抄録
This research aims for more valid and efficient in-house pre- and post-testing systems for program evaluation in university language programs. In order to avoid conducting two paper-and-pencil tests for over 5,000 students, about 14% of all students were given a post-test within a large university language program. As our first stage, we attempted to find the best fitting system to equate these pre- and post-tests data. Two IRT (Item Response Theory) models, a two parameter model (2PL), and a three parameter model (3PL) were chosen and combined with five different equating methods, these being mean-mean (MM), mean-sigma (MS), Haebara (HB), Stocking & Lord (SL), and CALR (CR). Eventually ten combinations of IRT model and equating methods were examined for evaluation. The best fitting one was found to be 2PL-CR, while the least fitting one was 3PL-MS. Afterward, the average gain in ability (θ) of about 700 students who took both tests was calculated with both the best and least fitting combinations. The largest difference in the average gain of θ between the two was θ=0.143. The results indicated that we need to keep applying 2PL-CR for accuracy and consistency so that we can compare the average gain of students' θ between two or more school years to monitor changes. As further evidence of program efficacy, regression analyses were run to look into the relationships between the gain of θ and the survey results of students' needs. This was done in order to reflect the voices of the students who gained more ability. The overall research results indicate that pre- and post-testing systems, when conducted appropriately, can yield reliable and efficient student evaluation data within large university language programs.