2022 Volume 25 Pages 60-79
This study examines a classroom-based second-language speaking test that uses picture description tasks and a rating scale of functional adequacy (FA) in terms of measurement quality. We used four tasks and FA scores from 36371 Japanese learners of English at two public senior high schools and 44 undergraduate and graduate students. Each examinee was evaluated by two or three raters. We used many-facet Rasch measurement to stack longitudinal data and analyzed the test scores in detail. We found that examinees, tasks, raters, and the FA rating scale worked well, with some concerns. For example, some examinees showed serious underfit, and large differences in rater severity were apparent. The bias analysis suggested that the percentages of biased patterns were small between examinees and tasks, between examinees and raters, and between tasks and raters. Based on our results, implications for assessment are provided, such as the need to conduct rigorous rater training.