2011 Volume 38 Issue 1 Pages 33-50
This study investigated measurement problems of essay test data from various perspectives, controlling length of essays. Two sets of essay test data (A:about early introduction of English education, and B:about differences between the sexes in nurturing) were obtained from 303 high school students. Students were divided into 2 groups:one group (N=155) took essays A and B within 400 and 800 words respectively, and vice versa for another group (N=148). 4 raters evaluated 606 (303×2) essays both holistically and analytically (11 or 12 items).
From factor analysis and covariance structure analysis of analytically-evaluated data, it was statistically confirmed that 2 factor (“linguistic ability factor” and “writing ability factor”) model was valid regardless of length of essays and raters. Reliability of evaluation between raters and within raters varied depending on items, and length of essays showed different effects for different items. From the view of internal consistency, the result based on multivariate generalizability theory indicated, regardless of length of essays and evaluation methods, that increasing the number of tests is more effective than adding raters for improving internal consistency. Propensity score analysis, with analytically-evaluated scores as covariates, showed that “beauty of handwriting” and “direction of opinion” might bias holistic scores.