2017 Volume 28 Pages 241-256
Second language (L2) speaking assessment can be affected by raters as well as tasks and other factors. High-stakes speaking tests require that high rater reliability be assured and that such information be reported to the public. In Japan, investigations into rater reliability and the use of multifaceted Rasch analysis have been limited for L2 speaking assessment in both high-stakes contexts and classroom situations. To fill this gap, this study examines the rater reliability of the Speaking Section of the Global Test of English Communication Computer Based Testing (GTEC CBT). This test has nine tasks for evaluation and 23 assessment criteria. We analyzed 648 test takers’ responses using multifaceted Rasch analysis. The results showed that raters differed in severity to a small degree but demonstrated high rater agreement and rater self-consistency. The bias analysis indicated a small percentage of systematic biased patterns between raters and test takers and 25.78% of biases between raters and criteria. Implications for improving assessment were discussed.