Session Information
09 SES 03 B, Challenges in Educational Measurement Practices
Paper Session
Contribution
This paper discusses a peculiarity in institutionalized educational measurement practices. Namely, an inherent contradiction between guidelines for how scales/tests are developed and how those scales/tests are typically analyzed.
Standard guidelines for developing scales/tests emphasize the need to identify the intended construct and select items to capture the construct’s full breadth, leading items (or subsets of items) to target different aspects of the construct. This occurs in test development through specifying the test’s content domain along with a blueprint allocating items to content domains, item formats, and/or cognitive demand levels (AERA, APA, & NCME, 2014, ch. 4). Similarly, scale development guidelines emphasize identifying sub-facets of constructs, such that items can be targeted to capture each sub-facet, ensuring that the full construct is measured (e.g., Gehlbach & Brinkworth, 2011; Steger et al., 2022). These guidelines intentionally ensure that items (or subsets of items) contain construct-relevant variation that is not contained in every other item (e.g., it is recommended to include geometry-related items when measuring math ability because such items capture construct-relevant variation in math ability that is not present in, e.g., algebra-related items; c.f., Stadler et al., 2021).
At the same time, scales/tests are typically analyzed with reflective measurement models (Fried, 2020). I focus on factor models for simplicity, but the same basic point applies to item-response theory models, as a reparameterization of item-response theory models to non-linear factor models would show (McDonald, 2013). In the unidimensional factor model, the item, Xip, is modelled as Xip=(alpha_i+lambda_i*F_p)+e_p, where i represents items, p is persons, alpha_i is an item intercept, lambda_i is a factor loading, F_p is the latent factor construct, and e_p is the person-specific error. The (alpha_i+lambda_i*F_p) term can be understood as an item-specific linear rescaling of the latent factor (that is on an arbitrary scale) to the item’s scale, just as one might rescale a test to obtain more interpretable scores. The factor model, then, consists of two parts, the rescaled factor and the error term. Since each item is defined as containing a rescaling of the factor and this is the only construct-relevant variation contains in items, each item must contain all construct-related variation (i.e., all changes in the construct are reflected in each item). Note that these points are conceptual, stemming from the mathematics of the factor model, not claims about the results of fitting models to specific data.
There is a contradiction here: Scales/tests are intentionally designed so that each item (or subset of items) captures unique, construct-related variation, but analyses are conducted under the assumption that no item (nor subset of items) contain unique, construct-related variation. To have such a clear contradiction baked into the institutionalized practices of measurement in the educational and social sciences is peculiar indeed.
Method
This is a discussion paper so there are no true methods per se. The analyses are based on careful study of institutionalized guidelines for how to construct tests and survey scales and the typical approaches for analyzing data from tests and survey scales. The presentation will focus on reviewing direct quotes from these guidelines in order to build the case that there is an inbuilt contradiction to baked into current “best practices” in measuring in the educational sciences. I will then present a logical analysis of the implications for this contradiction. Drawing on past and recent critiques of reflective modeling, I will propose that this contradiction persists because reflective models provide a clear and direct set of steps to support a set of epistemological claims about measuring the intended construct reliably and invariantly. I will then argue that, given the contradiction, these epistemological claims are not strongly supported through appeal to reflective modelling approaches. Rather, this contradiction leads to breakdowns in scientific practice (White & Stovner, 2023).
Expected Outcomes
The reflective measurement models that are used to evaluate the quality of educational measurement are built using a set of assumptions that contradict those used to build tests and scales. This peculiarity leaves the field evaluating the quality of measurement using models that, by design, do not fit the data to which they are applied. This raises important questions about the accuracy of claims that one has measured a specific construct, that measurement is reliably, and/or that measurement is or is not invariant. There is a need for measurement practices to shift to create alignment between the ways that tests/scales are created and how they are analyzed. I will discuss new modelling approaches that would facilitate this alignment (e.g., Henseler et al., 2014; Schuberth, 2021). However, questions of construct validity, reliability, and invariant measurement become more difficult when moving away from the reflective measurement paradigm.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association. http://www.apa.org/science/programs/testing/standards.aspx Fried, E. I. (2020). Theories and Models: What They Are, What They Are for, and What They Are About. Psychological Inquiry, 31(4), 336–344. https://doi.org/10.1080/1047840X.2020.1854011 Gehlbach, H., & Brinkworth, M. E. (2011). Measure Twice, Cut down Error: A Process for Enhancing the Validity of Survey Scales. Review of General Psychology, 15(4), 380–387. https://doi.org/10.1037/a0025704 Henseler, J., Dijkstra, T. K., Sarstedt, M., Ringle, C. M., Diamantopoulos, A., Straub, D. W., Ketchen, D. J., Hair, J. F., Hult, G. T. M., & Calantone, R. J. (2014). Common Beliefs and Reality About PLS: Comments on Rönkkö and Evermann (2013). Organizational Research Methods, 17(2), 182–209. https://doi.org/10.1177/1094428114526928 Maraun, M. D. (1996). The Claims of Factor Analysis. Multivariate Behavioral Research, 31(4), 673–689. https://doi.org/10.1207/s15327906mbr3104_20 McDonald, R. P. (2013). Test Theory: A Unified Treatment. Psychology Press. Schuberth, F. (2021). The Henseler-Ogasawara specification of composites in structural equation modeling: A tutorial. Psychological Methods, 28(4), 843–859. https://doi.org/10.1037/met0000432 Stadler, M., Sailer, M., & Fischer, F. (2021). Knowledge as a formative construct: A good alpha is not always better. New Ideas in Psychology, 60, 1-14. https://doi.org/hqcg Steger, D., Jankowsky, K., Schroeders, U., & Wilhelm, O. (2022). The Road to Hell Is Paved With Good Intentions: How Common Practices in Scale Construction Hurt Validity. Assessment, 1-14. https://doi.org/10.1177/10731911221124846 White, M., & Stovner, R. B. (2023). Breakdowns in Scientific Practices: How and Why Practices Can Lead to Less than Rational Conclusions (and Proposed Solutions). OSF Preprints. https://doi.org/10.31219/osf.io/w7e8q
Search the ECER Programme
- Search for keywords and phrases in "Text Search"
- Restrict in which part of the abstracts to search in "Where to search"
- Search for authors and in the respective field.
- For planning your conference attendance you may want to use the conference app, which will be issued some weeks before the conference
- If you are a session chair, best look up your chairing duties in the conference system (Conftool) or the app.