Optimizing Dependability in Classroom Action Research Skill Assessment: A Multivariate Generalizability Theory Analysis of Alternative Measurement Designs

Roongporn  Klyprayong; Kamonwan  Tangdhanakanond; Sirichai  Kanjanawasee

doi:10.48161/qaj.v6n2a2462

Authors

Roongporn Klyprayong Department of Educational Research and Psychology, Faculty of Education, Chulalongkorn University, Bangkok 10330, Thailand. https://orcid.org/0000-0003-1377-8228
Kamonwan Tangdhanakanond Department of Educational Research and Psychology, Faculty of Education, Chulalongkorn University, Bangkok 10330, Thailand. https://orcid.org/0000-0002-2424-3512
Sirichai Kanjanawasee Department of Educational Research and Psychology, Faculty of Education, Chulalongkorn University, Bangkok 10330, Thailand. https://orcid.org/0009-0002-0490-5221

DOI:

https://doi.org/10.48161/qaj.v6n2a2462

Keywords:

Multivariate generalizability theory, Classroom action research, Performance assessment, Measurement design, Index of dependability.

Abstract

Reliable assessment of classroom action research skill is essential for supporting valid absolute decisions in teacher education, particularly when performance assessments involve multidimensional constructs and rater-mediated judgment. This study investigated how alternative measurement designs influence the index of dependability and variance structure of classroom action research skill assessment scores within a multivariate generalizability theory (MGT) framework. Specifically, the study compared fully crossed and nested measurement designs and examined the number of raters required to achieve acceptable dependability for absolute decision-making. The participants consisted of 58 fourth-year student teachers majoring in primary education whose classroom action research reports were evaluated by four raters using a multidimensional assessment form aligned with the Plan–Act–Observe–Reflect (PAOR) framework and supported by double-layer scoring rubrics. Data analysis was conducted sequentially using the many-facet Rasch model (MFRM) to examine rater effects, followed by MGT-based generalization and decision studies. The findings showed that the fully crossed design produced a higher composite index of dependability than the nested design (Φ = .8468 vs .7823) and generated different composite universe score variance structures. Under the fully crossed design, three raters were sufficient to achieve acceptable dependability for individual-level absolute decisions, whereas the nested design required four raters to reach a comparable level. The study contributes to educational measurement literature by demonstrating that measurement design influences not only the magnitude of dependability but also the variance structure underlying multidimensional performance assessment scores. The findings further highlight the importance of aligning measurement design with the intended interpretation and use of assessment scores in teacher education contexts.

Downloads

Download data is not yet available.

References

Ahmad, H., & Guzman, S. T. (2025). Designing trustworthy educational artificial intelligence: A systemic framework for explainability, adaptivity, and ethical learning analytics. Qubahan Techno Journal, 4(3).

McAteer, M. (2013). Action research in education. SAGE.

Wongwanich, S., Phiromsombat, C., & Srikhleub, K. (2017). Development of a learning package to enhance classroom research skills of pre-service teachers. Chulalongkorn University Press.

Cochran-Smith, M., & Lytle, S. L. (2009). Inquiry as stance: Practitioner research for the next generation. Teachers College Press.

Smit, B. H. J., Meirink, J. A., Tigelaar, D. E. H., Berry, A. K., & Admiraal, W. F. (2024). Principles for school student participation in pre-service teacher action research: a practice architecture’s perspective. Educational Action Research, 32(2), 222–242.

Lewin, K. (1946). Action research and minority problems. Journal of Social Issues, 2(4), 34–46.

Kemmis, S., & McTaggart, R. (1988). The action research planner (3rd ed.). Deakin University.

Chanchusakun, S., & Varasunun, P. (2020). Performance Assessment: From Principle to Practice Guidelines. Journal of Educational Measurement Mahasarakham University, 26(2), 36–56.

Thephasadin Na Ayudhya, W. (2012). Classroom Action Research. Dhurakij Pundit University Press.

Miller, D. M., Linn, R. L., & Gronlund, N. E. (2009). Measurement and Assessment in Teaching (10th ed.). Pearson.

Panadero, E., Jonsson, A., Pinedo, L., & Fernández-Castilla, B. (2023). Effects of rubrics on academic performance, self-regulated learning, and self-efficacy: A meta-analytic review. Educational Psychology Review, 35, Article 113.

Uludag, P., & McDonough, K. (2022). Validating a rubric for assessing integrated writing in an EAP context. Assessing Writing, 52, 100609.

Yılmaz, F.N. (2024). Comparing the reliability of performance task scores obtained from rating scale and analytic rubric using the generalizability theory. Studies in Educational Evaluation, 83, Article 101413.

Li, W. (2022). Scoring rubric reliability and internal validity in rater-mediated EFL writing assessment: Insights from many-facet Rasch measurement. Reading and Writing, 35, 2409–2431.

Huang, J., & Whipple, P. B. (2023). Rater variability and reliability of constructed response questions in New York state high-stakes tests of English language arts and mathematics: implications for educational assessment policy. Humanities and Social Sciences Communications, 10, Article 860.

Mohd Noh, M. F., & Mohd Matore, M. E. E. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084.

Palermo, C. P. (2022). Rater characteristics, response content, and scoring contexts: Decomposing the determinates of scoring accuracy. Frontiers in Psychology, 13, Article 937097.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749.

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.

Anderson, T. N., Lau, J. N., Shi, R., Sapp, R. W., Aalami, L. R., Lee, E. W., Tekian, A., & Park, Y. S. (2022). The utility of peers and trained raters in technical skill-based assessments: A generalizability theory study. Journal of Surgical Education, 79(1), 206–215.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.

Brennan, R. L. (2001). Generalizability theory. Springer-Verlag.

Brennan, R. L. (2006). Educational Measurement (4th ed.). Praeger Publishers.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. SAGE.

Chen, D., Hebert, M., & Wilson, J. (2022). Examining human and automated ratings of elementary students’ writing quality: A multivariate generalizability theory application. American Educational Research Journal, 59(6), 1122–1156.

Kato, M. (2022). Examining the dependability and practicality of analytic rubric of summary writing using multivariate generalizability theory: Focusing on Japanese university students with lower-intermediate proficiency in English. English Language Teaching, 15(9), 82–94.

Jiang, Z., Ouyang, J., Shi, D., Shi, D., Zhang, J., Xu, L., & Cai, F. (2024). Customizing Bayesian multivariate generalizability theory to mixed-format tests. Behavior Research Methods, 56(7), 8080–8090.

Huang, H.-Y. (2023). Modeling rating order effects under item response theory models for rater-mediated assessments. Applied Psychological Measurement, 47(4), 312–327.

Jiang, Z., & Skorupski, W. (2018). A Bayesian approach to estimating variance components within a multivariate generalizability theory framework. Behavior Research Methods, 50, 2193–2214.

Anthony, C. J., Styck, K. M., Volpe, R. J., & Robert, C. R. (2023). Using many-facet rasch measurement and generalizability theory to explore rater effects for direct behavior rating–multi-item scales. School Psychology, 38(2), 119–128.

Ramadhani, R., Syahputra, E., Simamora, E., & Soeharto, S. (2023). Expert judgement of collaborative cloud classroom quality and its criteria using the many-facets Rasch model. Heliyon, 9(10), Article e20596.

Gordon, R. A., Peng, F., Curby, T. W., & Zinsser, K. M. (2021). An introduction to the many-facet Rasch model as a method to improve observational quality measures with an application to measuring the teaching of emotion skills. Early Childhood Research Quarterly, 55, 149–164.

Brookhart, S. M. (2015). Performance assessment: showing what students know and can do. West Palm Beach: Learning Sciences International.

Hamzah, M. S. G., Idris, N., Abdullah, S. K., Abdullah, N., and Muhammad, M. M. (2015). Development of the double layer rubric for the study on the implementation of school-based assessment among teachers. US-China education review, 5(4), 245–256.

Eskin, D. (2022). Generalizability of writing scores and language program placement decisions: Score dependability, task variability, and score profiles on an ESL placement test. Studies in Applied Linguistics and TESOL, 21(2), 21–42.

Klyprayong, R. (2026). Development of classroom action research skill assessment model of student teachers: An application of multivariate generalizability theory (Unpublished doctoral dissertation). Chulalongkorn University.

Klyprayong, R., Tangdhanakanond, K., & Kanjanawasee, S. (in press). Development of the classroom action research skills assessment with classroom action research process components and the double layer scoring rubric. Silpakorn Educational Research Journal.

Jintanaprasert, P. (2021). Self-assessment using different rubric methods on the development of mathematical problem-solving skills: Annotated and double-layer approaches [Master’s thesis, Chulalongkorn University]. Chulalongkorn University Intellectual Repository (CUIR).

Wancham, K., and Tangdhanakanond, K. (2023). Development of a two-tier rubric scoring criterion for assessing physics problem-solving ability. Graduate Studies Journal, Valaya Alongkorn Rajabhat University under the Royal Patronage, 17(1), 16–31.

Linacre, J. M. (1994). Many-Facet Rasch Measurement (2nd ed.). MESA Press.

Khamboonruang, A. (2023). Detecting differential rater severity in a high-stakes EFL classroom writing assessment: A many-facets Rasch measurement approach. PASAA, 66, 5–36.

Raymond, M. R., & Jiang, Z. (2020). Indices of subscore utility for individuals and subgroups based on multivariate generalizability theory. Educational and Psychological Measurement, 80(1), 67–90.

Brennan, R. L., Kim, S. Y., & Lee, W.-C. (2022). Extended multivariate generalizability theory with complex design structures. Educational and Psychological Measurement, 82(4), 617-642.

AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Educational Research Association.