Niigata Journal of Health and Welfare
Online ISSN : 2435-8088
Print ISSN : 1346-8782
Original article
Student evaluation in orthoptics: evaluation of rubric-based assessments from interdisciplinary team of faculty
Hokuto Ubukata Haruo TodaNoriaki MurataFumiatsu MaedaHaruki Abe
Author information
JOURNAL FREE ACCESS FULL-TEXT HTML

2020 Volume 20 Issue 2 Pages 73-84

Details
Abstract

To achieve effective medical education, including the orthoptic education, activities that focus on small groups have been encouraged. In accordance with this trend, interdisciplinary team of faculty instructors who often come from a variety of academic or clinical backgrounds are typically involved in providing small group instruction. While this diversity may serve to broaden students' perspectives, evaluation of student performance must remain consistent within a given class. A rubric is a scoring tool that can be used to assess student performance. One of the possible solutions is education with the rubric-based evaluation, which typically focuses on achievement and performance via the use of explicit terms or criteria. However, the full impact of rubric-based assessments and how they might be used to standardize the quality of evaluations provided by interdisciplinary team of faculty instructors still remain to be investigated. To address this issue, we compared the scores provided by four different instructors, including one who is not a certified orthoptist (CO), directed at students (n=56) performance on five practice topics that share the same rubric. There was significant inter-rater difference with respect to scoring of the same reports despite the fact that the same rubric was used throughout. On the other hand, the interaction between instructors and students was not significant. These results indicated that the standardized evaluation using rubric-based assessments was not fully successful per se whereas all four instructors scored the students fairly without bias.

Introduction

Orthoptic education, similar to educational practices in all other medical disciplines, works to achieve a uniform level of high performance and achievement by the students; the importance of appropriate evaluation is commonly recognized by both hospitals and clinics [1]. To achieve effective medical education, activities that focus on small groups have been encouraged. In accordance with this trend, interdisciplinary team of faculty instructors are typically involved in providing small group instruction. In higher-educational faculties in Japan, the instructors often come from a variety of academic backgrounds and can provide students with numerous clinical and educational experiences. While this diversity may serve to broaden students' perspectives, evaluation of student performance must remain consistent within a given class.

A rubric is a scoring tool that can be used to assess student performance. Rubric-based education was developed in the U.S. in the late 1970s; rubrics typically focus on achievement and performance via the use of explicit terms or criteria [2]. As rubric-based methods present scoring criteria to the instructors in an explicit and descriptive way, they can help the instructors perform consistent assessments of student accomplishment with respect to a wide-range of tasks, including those relevant to elementary through higher-level education [37]. Among several important examples, Maeda et al. [8] introduced rubric assessments for off-campus clinical orthoptic internships; these were found to be useful by both students and instructors as a means to standardize student evaluations. Our department includes instructors with a wide variety of backgrounds; among them are highly experienced certified orthoptists (COs), younger COs, and individuals with PhD degrees in Engineering with no specific experience with clinical orthoptics. This group may provide a suitably diverse cohort for quantitative evaluation of rubric-based assessments for student evaluation. We previously reported that the disparities between the students' and instructors' scores decreased significantly through repeated use of rubric, suggesting an improvement of students' self-evaluation [9].

However, the full impact of rubric-based assessments and how they might be used to standardize the quality of evaluations provided by interdisciplinary team of faculty instructors still remain to be investigated. To address the issue, we began by comparing the scores provided by four different instructors, including one who is not a CO, directed at student performance on five practice topics that share the same rubric.

Materials and Methods

1. Subjects and Practice Schedule

This study was approved by the Ethics Committee of Niigata University of Health and Welfare (Approval No:17827-170829). The data were obtained from 56 second-year undergraduate students enrolled in the “Practice of Visual Physiology” course in the Department of Orthoptics and Visual Sciences of Niigata University of Health and Welfare; this specific course covers a variety of practical learning objectives focused on a fundamental understanding of visual function. All study participants provided written consent and agreement to participate, except for one student who withdrew from the university shortly after the completion of the course. Of the participants, 42 were female, 14 were male, and the average age was 19.6 years at the beginning of the course. The instructors included three COs (instructors #1, #2, and #3) and one PhD of Engineering (instructor #4); all instructors were male, and all were members of our department. Instructors #1, #2, and #3 had 16, 5, and 1 years' experience as a CO, respectively, when the course of study began.

The course sessions were carried out once a week from April 10 to July 17, 2015; these sessions covered fourteen topics and were carried out under the instruction of the one primary instructor for each topic with assistance from the other three instructors, as described in Table 1. After each session, each student wrote a report on the topic which was evaluated by the primary instructor. For five of the fourteen specific topics, the three non-primary instructors also evaluated the same student reports using the same rubric that included six terms (Table 2) in order to examine differences between instructors with respect to student evaluation. The instructors individually reviewed the rubric scoring criteria before evaluating the report. The instructors ex-plained the rubric criteria to the students and en-couraged them to work hard to get the highest score on each rubric item. Our study focused on the outcomes associated with these five specific topics.

2. Evaluation of disagreement and statistics

To quantify scoring divergence, we asked each instructor to assign scores of 0, 1, and 2 to “Poor,” “Marginal,” and “Good” for every criterion included in the rubric; however, in order to provide adequate encouragement, scores of 0, 1, and 3 were used to provide feedback to students. Complete datasets were obtained from 46 of the 56 students. To evaluate the scoring divergence, we calculated the distance vector for each instructor's score from the average of the four scores via a multi-dimensional calculation that included 46 (students) × 6 (rubric terms) × 5 (topics) = 1380 dimensions. The Euclidean space for the complete dataset can be defined as:   

where “target” reflects the students, topics, or rubric terms which are under consideration and “n” reflects the number of datapoints. The average score is the inter-rater mean for the students, topics, or rubric terms under consideration. Statistical analyses were performed with GNU R (http://www.r-project.org/) versions 3.5.1 or 3.6.3 on a Macintosh computer and expressed as mean ± standard deviation, unless otherwise stated. ANOVA-kun (http://riseki.php.xdomain.jp/index. php?ANOVA%E5%90%9B) versions 4.8.2 or 4.8.5 were employed for calculations of repeated measures analysis of variance (ANOVA).

Results

The instructor's score was not centered at Marginal (1) of three choices (Figure 1a). As shown in Figure 1b, ANOVA revealed the significant main effect of the specific instructor (p = 1.03 × 10−9, Table 3), as well as the practice topics and rubric terms (both p < 2 × 10−16, Figure 1c, and 1d, and Table 3); these results indicated that there were significant difference among instructors with respect to scoring of the same reports despite the fact that the same rubric was used throughout. Significant differences in mean scores were observed between each pair of instructors save for between instructors #2 and #4 (p = 5.17 × 10−11, 2.40 × 10−4, 6.81 × 10−12, 0.00820, 5.38, and 0.00673, for differences between instructors #1 and #2, #1 and #3, #1 and #4, #2 and #3, #2 and #4, and #3 and #4, respectively, post-hoc t-test with Bonferroni correction). Inter-rater differences were also observed when comparing practice topics and the rubric terms (Figure 2 and Table 3). Compared with the reports on the topics in which the students made models (“Ocular” and “Brain”), the evaluations were lower and distributed more widely than for the reports on the topics in which the students measured standard visual functions (“Glare,” “Eccentric,” and “Landolt”). In addition, we found that instructors evaluated the reports on their own primary topics (denoted as “P” in Figure 2a) significantly higher than reports submitted on the non-primary topics (primary topics scored 1.94 ± 1.10, non-primary topics scored 1.81 ± 1.09, p = 3.84 × 10−5, Student's t-test). For rubric terms, instructor-based evaluations of “Object”, “Results”, and “Discussion” tended to be lower and distributed more widely than those addressing “Reference”, “Format”, or “Writing.” Interestingly, the interaction between instructors and students was not significant (p = 0.29, Table 3); this result indicated that all four instructors scored the students fairly without bias or partiality.

To investigate student-specific inter-rater divergence, we used the scores provided by instructor #4 to create a rank order of the 46 students who provided a complete dataset. Of note, instructor #4 was the only individual who was not a CO instructor in this cohort. If instructors #1, #2, and #3 scored the reports of each student in the manner similar to that of instructor #4, all the scores would increase monotonically from left to right, as in case of the average scores (Figure 3), resulting in positive rank correlations. Our data supported this assumption in the average scores (Table 4, the rightmost column in the topmost row). Upon examination of each rubric term, however, the rank correlations were not significant in ten of the thirty-six possible combinations of instructor × rubric term (see gray-highlighted items of Table 4); these results indicated the existence of student-specific inter-rater divergence. Therefore, we quantified student-specific inter-rater divergence by summing up the distance of each instructor's score from the center (see Materials and Methods). As shown in Figure 4, there were only a few cases in which scores of four instructors were completely the same for a given student (i.e., mean inter-rater distance = 0). The relationship between score and mean distance varied from term to term. Within the category “Result,” mean inter-rater distance has significant positive correlation (p = 2.44 × 10−5, r = 0.58, Spearman's rank order analysis); this result suggested that all four instructors regarded different aspects of the reports as important within this rubric criterion. By contrast, significant negative correlations were observed with respect to “Reference”, “Format” and “Writing” (p = 2.73 × 10−6, r = −0.630; p = 9.33 × 10−5, r = −0.544; p = 0.0013, r = −0.459, respectively, Spearman's rank order analysis).

Discussion

While instructors with divergent backgrounds may help students to acquire wider viewpoints, consistency with respect to student evaluation remains among the most important principles in medical education. Rubric-based assessments are a practical tool that can be used to standardize student evaluations carried out by interdisciplinary team of faculty instructors; this is because the scoring criteria are presented to the instructors in an explicit and descriptive way [37], In this study, however, we found the standardized evaluation using rubric-based assessments was not fully successful per se. Our findings support the importance of confirmation of the contents and goals with all instructors prior to use of a rubric-based evaluation. Furthermore, it implies that even instructors belonging to the same faculty have different viewpoints in evaluation.

Inter-rater differences were observed in several categories including in orthoptic practice, including those focused on topics such as “Glare”, “Eccentric” and “Landolt” these differences were less stark when considering topics including “Ocular” and “Brain.” Interestingly the first group includes practical topics related to patient examinations and include quantitative evaluations of visual function; the latter group includes the more practical topics in which the students are asked to creating models. While the instructors rated the student reports on all topics using the same rubric and scoring criteria, they may have actually superimposed their own criteria on the data measured in this report. For the model-based reports, the expectations might be somewhat low (i.e., scores were high). Another possibility is that students who enroll in this course were generally quite capable of making models.

With respect to the terms in the rubric, inter-rater differences were less distinct for categories including “Reference”, “Format” and “Writing” (Figure 2b). These assignments included formalized academic writing that could be scored using the same criteria. In a previous study, student self-evaluation with respect to the term “Format” matched well with the instructor evaluation once rubric-based evaluations had been repeated several times [9], probably due to the same reason. Among the rubric terms, the ones that are more formal and those with particularly specific criteria were evaluated in the same way by all four instructors. By contrast, scores for the less-formalized rubric terms were more likely to be influenced by the backgrounds and experiences of the individual instructors. Each topic was assigned to one primary instructor; the other three instructors evaluated the reports on these topics as well. The overall evaluations were significantly higher when an instructor was evaluating student reports on his primary topic or topics (“P” in Figure 2a). These problems need to be overcome in order to create a successful educational program.

Though these inter-rater differences might be a matter of inconsistency, we did not find significant instructor × student interaction; this result indicates that the instructors were scoring students evenly, with no apparent bias toward one or more “favorite” students (Tables 3 and 5). There was a significant correlation between the evaluation order based on scoring provided by instructor #4 and the those provided by the other instructors. Therefore, it was possible that the faculty members performed relatively consistent evaluations. Moreover, the average distance of inter-rater evaluation was highest for students with higher scores, and was lower for terms including “reference”, “Format” and “writing”; this result suggests that each instructor focused on different aspects of the reports when considering the high-scoring students, typically reflecting their own professional backgrounds.

When interdisciplinary team of faculty evaluators are asked to perform rubric-based evaluations, the contents and goals of the evaluation are typically confirmed in advance [10,11]. In addition, according Allen et al., “When used as teaching tools, rubrics not only make the instructor's standards and resulting grading explicit, but they can give students a clear sense of what the expectations are for a high level of performance on a given assignment, and how they can be met.” The rubric used in this study did not change according to the content of the topics [12]. As such, the instructors need to have a common understanding not only of the rubric itself, but also of the evaluation criteria and the purpose of the individual topics. A well-designed scoring rubric serves to improve inconsistencies in the scoring process by minimizing errors due to problems associated with evaluator training, evaluator feedback, and clarity of the reference description [13]. Furthermore, Roger et al. [10] created a rubric that both instructors and students understand so as to improve the subject evaluation and to confirm its usefulness.

In this study, we used a rubric, which is evaluated on a three-point scale (3, 1, 0 pt), but the instructor's score did not concentrate on “Marginal”. There is also a report that the evaluation using the rubric is performed in five stages based on the Likert scale [14,15]. In the future, we will refer to advanced rubrics for scoring when creating rubrics. In addition, we plan to use the newly created rubric to compare the degree of agreement between faculty members with this study. Instructors need to create a rubric that complies with the learning objectives posted in the syllabus and to explain it to the students as an evaluation standard; they can then collect the comments from students. A feedback mechanism may be helpful when considering how to revise the evaluation criteria used in this study.

This study was presented at the 18th Annual Meeting of the Niigata Society of Health and Welfare.

Acknowledgments

We are grateful to Assoc. Prof. Igarashi for her advice on English writing.

Conflicts of Interest

There are no conflicts of interest to declare.

References
 
© 2020 Niigata Society of Health and Welfare

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top