The purpose of the present study was to examine the validity of 16 can-do items taken from the EIKEN can-do list (STEP, 2008). A total of 2,571 Japanese junior high school students were asked to assess their degree of confidence in the 16 can-do statements-four EIKEN Grade 5, Grade 4, Grade 3, and Grade Pre-2 items, respectively. The present study employed the Rasch model to investigate whether (a) the items are unidimensional, (b) their item difficulty is appropriate, (c) item difficulty correlates with the items' EIKEN grades, and (d) the students' confidence levels correlate with their proficiency levels. The results showed that the can-do items are highly reliable and unidimensional. However, the students tended to feel that the items were unchallenging, especially the speaking and listening items.
This article examines the main data of a task-based writing performance tests in which the five junior high school teachers participated as novice raters. The purpose of this research is to implement a task-based writing test (TBWT) which was developed on the basis of construct-based processing approach to testing, and to examine the degree of reliability and validity of the assessment tasks and rating scales. Accuracy and communicability were defined as constructs, and the test development proceeded according to such three stages as designing and characterizing writing tasks, reviewing existing scoring procedures and drafting rating scales. Each of the forty scripts collected from twenty undergraduate students was scored by five new raters, and the analyses were done using FACETS. The results indicated that all novice raters displayed acceptable levels of self-consistency, and that there was no significantly different scoring on the two tasks and overall impression, which provided reasonable fit to the Rasch model. The modified scales associated with the five rating categories and their specific written samples were shown to be mostly comprehensible and usable by raters, and demonstrated that the students' ability was effectively measured using these tasks and rating scales. However, further research is necessary for considering elimination of inter-rater differences.
The purpose of this study is to determine whether repeating the question in the auditory version of multiple-choice (MC) listening tests affects listening comprehension. Two formats were compared: (a) one set of an item was presented orally in the order of question, text, and options and (b) one set of an item was presented orally in the order of question, text, question, and options. Data collected from fifty-eight Japanese university students showed that there was no significant difference between mean scores with the two formats. The data analyses also showed that two formats did not greatly differ in reliability, item facility, item discrimination, or actual equivalent number of options. In light of these results, this study proposes that, on the auditory version of MC listening tests, presenting a question once is a better format than repeating it, if for no other reason than to save time.
The present paper aims to validate the oral proficiency rating scales that were developed in Nekoda, Nekoda, & Miura (2007), in which three-facet data were collected. The data concerned the way in which 47 junior high school and high school teachers assessed 46 pieces of video-recorded performance (10-minute interviews) with reference to 52 descriptors (short descriptions of performance characteristics). The dataset was analyzed by a Many-Facet Rasch Model (FACETS), and four analytic rating scales ('vocabulary range', 'grammaticality', 'fluency', 'pronunciation') were developed by means of a quantitative method. This study examines these results and attempts to verify the rating scales by means of a qualitative method. More concretely, three high school teachers were asked to describe several pieces of video-recorded performance in their own words. The performance videos in this process were selected (from those used in the previous study) on the basis of performance-quality 'logit values' gained from the FACETS analysis. A Many-Facet Rasch Model is based on probability theory and thus a closer look at the results reveals how likely each item of performance will be assessed by teachers in general at which score (on a four-point scale from 0 points to 3 points). On the basis of this information, this study examines (1) performance which is highly likely to be assessed as 'achieved' to a proficiency level described by a certain descriptor (a full mark = 3 points) and (2) performance which is highly likely to be assessed as 'not yet achieved' to the same level (a slightly lower mark = 2 points). The three teachers watched these performances and described favorable characteristics of (1) and unfavorable characteristics of (2). This study checks whether the descriptive terms that the teachers produced correspond to what is meant by the matching descriptors in the rating scales. As a result, it is clarified that the descriptions produced by these teachers cover various characteristics mentioned in the descriptors in the rating scales, but also that some of the descriptors need to be revised further.
This paper investigated the effects of the choice between two different strategies for solving TOEIC(R) listening test part 4 type questions on item difficulties. The strategies focused were the preview-questions and answer-while-listening strategy and the listen-without-question-preview and answer-after-listening strategies. A group of 64 Japanese EFL under- or post-graduate learners solved 30 TOEIC part 4 type listening problems using either of the strategies. Rasch-based common item equating located these items on a single difficulty dimension with measurement error information. Even though a τ test indicated no significant difference between the means, correlational analyses and 95% control line analyses suggested that the choice of different strategies did make a difference in the difficulty of each item as well as the exact nature of the construct tapped. Test-takers' perception of these strategies were also explored by examining their verbal comments, which indicated unanimous preference for question preview but divided opinions regarding while-listening answering. Based on these results, we argue that the current structure of TOEIC listening test part 4 poses a serious threat to its own test validity.
The neural test theory (NTT) 'uses the mechanism of a self-organizing map' or generative topographic mapping and 'assumes the latent scale is ordinal' (Shojima, 2008c). The current study aims to reveal the characteristics of the NTT by applying it to the analysis of a placement test at a university. We compared the NTT results with those obtained using the Classical Test Theory (CTT) and Rasch modeling (RM). The participants comprised 147 Japanese learners of English, whose major subject was international studies or management studies. They took a 90-item multiple-choice vocabulary test. We obtained the test scores using the CTT (percentages correct), RM (latent ability estimates), and NTT (latent rank estimates), and classified the students into three or five groups with different proficiency levels based on the scores derived from the CTT and RM. After a detailed analysis, we ascertained three findings. First, our analysis revealed that the three types of scores and the two types of groups were highly correlated. This suggests that similar results can be obtained by using any one of the three test theories. Second, the maximum number of groups into which we could divide the students was the same (i.e., three) according to the separation index in the RM and the test model-fit indices in the NTT. Third, we compared the item difficulty and discrimination obtained from these three theories and showed that the results of the item difficulty using the CTT, RM, and NTT were highly correlated; similar results were observed for item discrimination computed using the CTT and NTT. Overall, the NTT results (i.e., the test-takers' latent ranks, the maximum number of groups, item difficulty, and item discrimination) are similar to those obtained using the CTT and RM. Furthermore, the NTT is advantageous in computing ordinal ranks based on test-takers' test response patterns with a relatively small sample size and in presenting more information on item monotonicity. Thus, the present study provides evidence for the effectiveness of the NTT in analyzing in language testing data, especially when only ordinal scale results are required.
This paper reports the findings of a predictive validity study on two versions of the Test of English for International Communication(R) (TOEIC(R)) in the context of English language education in a Japanese university. Twenty students of English as a foreign language in Japan participated in this study. The study investigated the correlation between TOEIC(R) scores and grade points that the participants earned in six English language-related subjects taught at the university. Retrospective data such as that collected from interviews with the participants and teachers of English were analyzed in order to qualitatively support the empirical results. The results indicated that TOEIC(R) can predict academic performances in actual English language classrooms at the university level. Further discussions on employing TOEIC(R) in the context of English language education in Japan are strongly recommended.
Paying attention to language input is important for the memorization of language (Robinson, 2003). This study took high graders (5th- and 6th- grade students) engaged in English class activities in 2007 to examine relation between English language abilities and two kinds of attention abilities. Reverse-Stroop and Stroop tests of two languages (the first language, Japanese (L1), and English as a Foreign Language (EFL)) were used in order to measure attention abilities (processing abilities) and selective attention abilities (automaticity) in access and storage of two languages. Jidoeiken, Junior STEP onze Test (Bronze) was used to measure English proficiency. The following four points form the main results: (1) Processing abilities in L1 were more predominant than those in EFL, and the upper group of English proficiency has higher processing abilities in both languages. (2) Though automaticity did not develop so much in the processing, not only of L1 but also of EFL, abilities of automaticity in L1 processing might be more predominant than those of EFL processing, irrespective of English proficiency. (3) There were different processing abilities, accessing verbal code and generating imagery code from long-term memory (LTM) between L1 and EFL, while, there were similar processing abilities, accessing imagery code and generating verbal code from LTM between L1 and EFL. (4) The upper group might have had greater processing abilities to attend to smaller units of information (as words and phrases). On the other hand, the attention abilities of the lower group in task switching might be poor in more complicated English language information. Overall, in this research, it was suggested that there were different kinds of language processing depending on the kinds of attention and languages (L1 and EFL), and English proficiency.
A widespread view on second language (L2) learning is that vocabulary should be learned in context. Some researchers have suggested that the extent to which context affects vocabulary learning may be somewhat narrow; others have reported positive effects for the use of contextualized learning or recognition of L2 words. In the present study, the effects of context on vocabulary learning and testing were examined in the light of word imageability, which is image-evoking value of language. There were three main points of interest: (a) the relation between context type and imageability, (b) context and imageability effects on learning, and (c) the differences of these effects between test formats. A total of 22 Japanese university students participated in the experiment. They learned 21 target words in three learning conditions differing by context type, and recalled them on definition-cued, translation, and multiple-choice tests. In addition, the participants had two occasions to rate word imageability. Results suggested the following three findings. First, contextualized learning may affect target words' imageability, even when context effects on retention are not readily apparent. Second, context effects on the posttest scores were not significant even when the learners were provided with definition sentences (in the definition-cued test) or choices (in the multiple-choice test). Third, imageability effects were found in several types of posttests. Taken together, even if the presentation of context does not enhance learning directly, it will affect learner's performance giving target vocabulary a relevant image. Therefore, it is suggested that imageability can be helpful to reveal context effects on vocabulary learning and testing. Future research should investigate what learners represent in their mind based on contextual information.
Although many studies assume that the construction of situation models is essential for reading comprehension, few of them have examined EFL learners' reading process from the perspective of situation models. This study investigated the situation models of EFL readers with the verb-clustering test which is used in the event-indexing model paradigm. (Zwaan, Langston, Graesser, 1995). The event-indexing model paradigm is highly recommended for classroom instruction because it refers to elements for the development of elaborate situation models. The verb-clustering test is one way which examines situation models of readers. It is thought to be useful because one can investigate learners' situation models without disturbing them (Iseki & Kawasaki, 2006). Furthermore this test is effective for classrooms because it does not take much time to administer and it allows teachers to examine the reading processes of L2 learners. This study investigated the situation models of L2 readers with the test. A total of 122 high school students participated in this study. They read a short narrative text and took the verb-clustering test. They were divided into two groups based on their language proficiency. The result of this study showed that the construction of L2 situation models that are comparable to those of LI demands a high degree of language proficiency, which is supported by Zwaan and Brown (1996). This study also suggested that the pre-reading instructions had different effects on situation models depending on learners' language proficiency. Given instructions, good readers could construct more elaborate situation models in contrast to poor readers, who could not.
October 05, 2017 Due to the maintenance‚following linking services will not be available on Oct 18 from 10:00 to 19:00 (JST)(Oct 18‚ from 1:00 to 10:00(UTC)). We apologize for the inconvenience. a)reference linking b)cited-by linking c)linking to J-STAGE with JOI/OpenURL
May 18, 2016 We have released “J-STAGE BETA site”.
May 01, 2015 Please note the "spoofing mail" that pretends to be J-STAGE.