The Tohoku Journal of Experimental Medicine
Online ISSN : 1349-3329
Print ISSN : 0040-8727
ISSN-L : 0040-8727
Regular Contribution
Development of a Peer Review System Using Patient Records for Outcome Evaluation of Medical Education: Reliability Analysis
Junichi KameokaTomoya OkuboEmi KogumaFumie TakahashiSeiichi IshiiHiroshi Kanatsuka
Author information
JOURNAL FREE ACCESS FULL-TEXT HTML

2014 Volume 233 Issue 3 Pages 189-195

Details
Abstract

In addition to input evaluation (education delivered at school) and output evaluation (students’ capability at graduation), the methods for outcome evaluation (performance after graduation) of medical education need to be established. One approach is a review of medical records, which, however, has been met with difficulties because of poor inter-rater reliability. Here, we attempted to develop a peer review system of medical records with high inter-rater reliability. We randomly selected 112 patients (and finally selected 110 after removing two ineligible patients) who visited (and were hospitalized in) one of the four general hospitals in the Tohoku region of Japan between 2008 and 2012. Four reviewers, who were well-trained general internists from outside the Tohoku region, visited the hospitals independently and evaluated outpatient medical records based on an evaluation sheet that consisted of 14 items (3-point scale) for record keeping and 15 items (5-point scale) for quality of care. The mean total score was 84.1 ± 7.7. Cronbach’s alpha for these items was 0.798. Single measure and average measure intraclass correlations for the reviewers were 0.733 (95% confidence interval: 0.720-0.745) and 0.917 (95% confidence interval: 0.912-0.921), respectively. An exploratory factor analysis revealed six factors: history taking, physical examination, clinical reasoning, management and outcome, rhetoric, and patient relationship. In conclusion, we have developed a peer review system of medical records with high inter-rater reliability, which may enable us, with further validity analysis, to measure quality of patient care as an outcome evaluation of medical education in the future.

Introduction

The evaluation of education has been divided into three categories: input (education delivered at school), output (students’ capability at graduation), and outcome (performance after graduation) evaluations (IPRA Gold Paper No. 11 1994). In medical education, “input evaluation” includes the accreditation of medical schools, such as the Educational Commission for Foreign Medical Graduates (ECFMG) in the United States (Kassebaum 1994), and Japan Accreditation Council for Medical Education (JACME) in Japan. “Output evaluation” includes examinations, both at each medical university and by official institutes, such as the United States Medical Licensing Examination (USMLE) in the United States (Williams 1993), and National Certificate Examination in Japan (Kozu 2006). In contrast, the methods of “outcome evaluation” have not been sufficiently established because of its difficulties (Prystowsky and Bordage 2001). However, considering that the ultimate goal of medical education is to develop good doctors who can provide superior patient care, the development of outcome evaluation methods is mandatory in the field of medical education for the long term.

Outcome evaluation has only been attempted in a few universities such as Thomas Jefferson Medical College, in which the clinical competence of 4,560 graduates between 1975 and 2004 were rated by the program directors of their hospitals (Hojat et al. 2007). Apart from longitudinal analyses, ratings by program directors or other staff members have been investigated for reliability and validity in assessing pediatric trainees’ clinical performance (Archer et al. 2010) and physicians’ professionalism (Cruess et al. 2006; Tsugawa et al. 2011). These methods mainly assess the “process of clinical performance” rather than “patient outcomes,” but the importance of patient outcomes has been increasingly recognized in medical education (Dauphinee 2012; Gonnella and Hojat 2012).

Another approach used to assess clinical competence is a review of medical records, which contain information about “patient outcomes” in addition to the “process of clinical performance.” Assessing the quality of patient care by reviewing medical records has been vigorously pursued for many decades, mainly from the viewpoint of health care (Payne 1979; Goldman 1992, 1994; Hayward et al. 1993; Rethans et al. 1994; Smith et al. 1997; Peabody et al. 2000; Hofer et al. 2004; Goulet et al. 2007). However, reviewing medical records, an implicit review in particular, has been met with difficulties because of poor inter-rater reliability (intra-class correlation coefficients (ICCs): 0.16-0.56) (Hayward et al. 1993; Hofer et al. 2004; Goulet et al. 2007). Proposed strategies to achieve adequate reliability include providing structured assessments, higher standards for reviewers, averaging scores from multiple reviewers, adjusting systematic bias resulting from the different backgrounds of individual reviewers, using outcome judgments, and adoption of practice guidelines (Goldman 1992; Smith et al. 1997).

To establish a method to measure quality of patient care and provide outcome evaluation of medical education, we launched a program to develop a peer review system of medical records in 2010. For this purpose, we planned to take two steps: (1) a retrospective study to develop a system with high inter-rater reliability as well as constructive validity, and (2) a prospective study to establish a system with content and criterion validity. Here, we took the first step and, by employing the strategies mentioned above, developed a peer review system of medical records with high inter-rater reliability.

Methods

Study design

For this study, a peer-review system (PRS) committee was constituted at Tohoku University, comprising seven physicians in various fields such as cardiology, gastroenterology, neurology, and hematology. This study was approved by the Tohoku University Research Ethics Board, and the Institutional Review Boards (IRB) of each hospital.

The procedure was as follows: reviewers visited each hospital independently, and evaluated medical records (all medical records of outpatient care and a summary sheet of inpatient care) based on the evaluation sheet described below. Since we wanted to evaluate both the “process” and “outcome” of patient care, we focused on outpatient care because inpatient care is normally performed by teams instead of individual physicians in many Japanese hospitals, making it difficult to evaluate the “process” of patient care by the physician in charge.

To determine the feasibility and examine the appropriateness of an evaluation sheet, we performed a pilot study with 51 cases in February 2012. Having improved the evaluation sheet with the PRS committee members and reviewers in the pilot study, we performed the main study between January and February 2013. After the pilot study, we also developed benchmark case records with varying quality of patient care, which we used to train reviewers in the main study.

Evaluation sheet

The peer review evaluation sheet, an original of ours, was designed by the PRS committee to measure several factors, including those previously reported in the literature (Rethans et al. 1994; Goulet et al. 2007), such as record keeping, gathering of information, clinical assessment, management, as well as factors we developed, such as rhetoric, physician-patient relationship (including “empathy”) and overall outcome. “Empathy” here was defined as the ability to understand the feelings and experiences of patients and their family members.

The evaluation sheet consists of two parts: record keeping using a 3-point Likert-type scale: 3 (written), 2 (partially written), and 1 (not written); and quality of care using a 5-point Likert-type scale: 5 (outstanding), 4 (standard), 3 (fair), 2 (poor), and 1 (very poor). After modifications following the pilot study, the final form contained 24 items: 14 items for record keeping and 15 items for quality of care (Table 1). Some of the 15 items for quality of care, such as B8 (appropriate treatment) and B9 (EBM), seemed difficult to assess, but we assumed that excellent reviewers, using their knowledge and experiences, could read between the lines of medical records.

The most controversial issue after the pilot study was whether “NA (not applicable)” in the initial evaluation sheet should be omitted or not, which eventually was left in, in cases of rare, but possible situations such as answering B12 (Is he/she referring other doctors, if necessary?) when “not” necessary, although the presence of NA would hamper the statistical analysis.

Table 1.

Mean scores (standard deviations) of each item according to hospitals.

Participants

The PRS committee selected five hospitals based on the following criteria: (1) general hospitals in the Tohoku region (northeastern Japan), and (2) approval from the IRB of the hospital was obtained. The average number of beds of the selected hospitals (Ishinomaki Red Cross Hospital, Sendai City Hospital, Yamagata Prefectural Central Hospital, Iwate Prefectural Central Hospital, Osaki Citizen Hospital) was 546 (range: 404-685). Three hospitals were chosen for the pilot study in 2011, and four (including two from the pilot study) were chosen for the main study in 2012. Three hospitals had electronic medical records and one hospital had paper records.

Patients were selected by a representative at each hospital and a member of the PRS committee based on the following criteria: outpatients (1) who visited the hospital for the first time between April 2008 and March 2012 and were eventually hospitalized, (2) who were seen by doctors three to ten years after graduation from medical school, (3) whose final diagnoses did not matter, as long as they were in the field of internal medicine. Patients seen by residents (doctors within two years of graduation) were excluded, because senior doctors always supervised their patient care.

Reviewers were selected by the PRS committee based on the following criteria: general internists (1) who were working in hospitals outside the Tohoku region, and (2) who had reputations for being excellent in a broad field of internal medicine. The selected reviewers came from workplaces all over Japan, from Hokkaido (the most northeastern region of Japan) to Okinawa (the most southwestern region of Japan).

Data analysis

Mean scores and standard deviations of the 29 items were calculated for each hospital. The internal consistency of the items was evaluated using Cronbach’s alpha. Inter-rater reliability among the scores by the four reviewers was examined by calculating ICCs. In addition, an exploratory factor analysis was performed in order to investigate the construct validity of the evaluation sheet. Parameters were estimated by maximum-likelihood estimation, and Promax rotation was employed for the rotation of the estimated factors. SPSS (version 15.0) was used for the statistical analysis.

Results

Time

It took three to four days for the reviewers to visit the hospitals and complete the review. The total time required for an evaluation ranged from 1,170-1,405 minutes (mean 1,260 minutes, 11.3 minutes per patient).

Scores

Among the 112 cases reviewed, two cases were excluded from the analyses below because of incomplete evaluation sheets. The diagnoses of 110 cases included 30 gastrointestinal diseases, 28 cardiovascular diseases, 12 respiratory diseases, and 40 other diseases.

The mean scores (standard deviations) of each item according to the hospitals are shown in Table 1. The average score (standard deviation) of items B1 through B15 (quality of care) for the 110 cases was 3.57 (0.34). The average scores (standard deviations) of items B1 through B15 for the four reviewers were 3.73 (0.51), 3.46 (0.44), 3.60 (0.51), and 3.55 (0.41) (data not shown). The average scores (standard deviations) of items B1 through B15 of gastrointestinal diseases, cardiovascular diseases, respiratory diseases, and other diseases were 3.57 (0.37), 3.54 (0.26), 3.62 (0.34), and 3.59 (0.35), respectively (data not shown).

The percentages of “NA” were very high in item B11 (referral to other doctors, 41.2%), and high in item B8 (treatment, 6.8%), probably because we focused on outpatient care, in which patients were sometimes hospitalized quickly before receiving any treatment or being referred to other doctors. Among the record keeping items, the percentage of “NA” was high in A6 (social history, 2.8%), possibly because reviewers may have decided this information was unnecessary in some cases.

Although preliminary, several observations can be made from this table. First, in regards to record keeping, each hospital had weak items, such as A13 (explanation to the patient) in hospital 1 (1.67), and A7 (history of allergies) in hospital 2 (1.38). These weak items appeared to be correlated with the forms used by the hospitals; the chart in hospital 1 had no form for “explanation to the patient” and the chart in hospital 2 had no form for “history of allergies.” Second, the total mean score for item B14 (outcome) was high (3.90) despite the relatively low scores for items B1 through B4 (history taking and physical examination). Hospital 2 presented a typical case whose mean score for item B14 (3.92) was the highest, while mean scores for items B1 through B4 were the lowest among the four hospitals.

Reliability and validity

Cronbach’s alpha was approximately 0.8 for all 29 items, indicating sufficient internal consistency among the items (Table 2I). ICCs for reviewers revealed high correlations, 0.733 for the single measure and 0.917 for the average measure, indicating a high inter-rater reliability among the scores by the four reviewers (Table 2II).

An exploratory factor analysis revealed six factors involving “history taking,” “physical examination,” “clinical reasoning,” “management and outcome,” “rhetoric,” and “patient relationship” (Table 3). We removed the following 16 items from the factor analysis: A1 through A13 since they were objective facts, B11 because of the high rates of “NA” (41%), and B15 because “overall assessment” was not suitable for factor analysis.

Table 2.

Reliability analyses.

Table 3.

Factor analysis.

Bold values indicate factor loadings higher than 0.3.

Factor 1: history taking, Factor 2: management and outcome, Factor 3: clinical reasoning, Factor 4: patient relationship, Factor 5: physical examination, Factor 6: rhetoric.

Discussion

In the present study, we have developed a peer review system of medical records with high inter-rater reliability (exhibiting one of the highest ICCs ever reported). We have also shown some construct validity of the evaluation sheet by factor analysis. What we need to do next is to perform a prospective study to determine content and criterion validity, as well as further construct validity.

The present system, in which medical records are reviewed by visiting each hospital, proved feasible with no practical problems. However, considering the time and cost of visiting hospitals, a system by which reviewers can review records in their own workplace, similar to the current peer review system of academic papers, may be a preferable alternative in the future, if the security of the patients’ information can be guaranteed.

High inter-rater reliability was obtained, probably because (1) we selected reviewers who had a reputation as good internists in a broad field of medicine, (2) we provided criteria to the reviewers by presenting benchmark medical records obtained from the pilot study, (3) the evaluation sheet was modified after the pilot study by reviewers as well as members of the PRS committee, and (4) reviewers were able to read the summary sheet of inpatient care to evaluate the “outcome” of outpatient care. These structured conditions were among the previously proposed strategies to achieve adequate reliability, as described in the Introduction section (Goldman 1992; Smith et al. 1997).

Construct validity was supported by exploratory factor analysis, indicating that our evaluation sheet measured various skill domains, including those previously emphasized, such as history taking, physical examination, clinical reasoning, and management (Rethans et al. 1994; Goulet et al. 2007). In addition to these established skill domains, we attempted to measure the physician-patient relationship, mainly using items B12 and B13. Whether we can measure the empathy of doctors by reviewing medical records remains to be determined, despite the internal consistency obtained in the current study: Cronbach’s alpha was lower when the item B12 (empathy) was deleted than when all items were included. The measurement of empathy has been receiving international attention, and a study using the Japanese version (Kataoka et al. 2009) of the Jefferson Scale of Physician Empathy (JSPE) (Hojat et al. 2002), composed of 20 items answered on a seven-point Likert-type scale, implicated cultural differences on empathic behaviors. We hope we can examine the correlation between empathy measured by our system and that by these established systems in the future.

Several issues from the data are worth mentioning, although we recognize that they are preliminary. First, some weak items of record keeping appeared to correlate with the format of the charts used in hospitals, confirming the theory that the quality of patient care depended on the “structure” as well as “process” and “outcome” (Donabedian 1988). Second, the low mean scores of B1 through B4 (history taking and physical examination) indicated that Japanese physicians, at least those in the current study, may not be good at systematic history taking and physical examinations, as has been pointed out by Western-trained Japanese physicians (Shimahara 2002). Third, despite the low scores on items B1 through B4, the mean score of item B14 (outcome) was high, perhaps because many physicians in the current study quickly resorted to laboratory examinations such as CT scans. Both the number of CT scanners per million population and the estimated number of radiation-induced cases of cancer per year were the highest in Japan (Berrington de González and Darby 2004; Hall and Brenner 2008); however, the advantages and disadvantages of these findings need further discussion.

Our study has several limitations. First, the number of reviewers and record samples were both low; therefore, a generalizability analysis was not performed. An extension study, including a greater number of hospitals outside the Tohoku region and more reviewers from various backgrounds, is underway for generalizability analysis. Second, validity analysis was insufficient. To further examine validity, particularly content and criteria validity, a prospect study investigating the correlation between the assessments in the current system and those by persons familiar with the doctors’ performance (program directors, comedical staff members, and patients) is being planned. Third, we only focused on outpatient care, which was performed by individual physicians. Whether we can evaluate inpatient care, which is normally performed by teams instead of individual physicians, remains to be investigated.

Japanese medical education has undergone significant changes since 1990, such as the introduction of problem-based learning tutorials, objective structured clinical examination (OSCE), and clinical clerkships (Kozu 2006; Teo 2007). In 2004, a new postgraduate medical education program including mandatory rotations in various clinical departments, such as pediatrics, obstetrics/gynecology, and psychiatry, was introduced (Nomura et al. 2008). To recruit students from various backgrounds such as the Humanities/Social Science track to medical schools, introducing a new medical school system has also been proposed (Tokuda et al. 2008). However, these reforms have been conducted and are being discussed without any measures of outcome evaluation. We hope the current system will enable us to contribute to the measurement of outcome evaluation in the future.

Acknowledgments

This work was supported in part by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science, and Technology of Japan (22590448). We thank Drs. Yutaka Kagaya, Yoshiyuki Ueno, Akira Imatani, Atsushi Takeda, and Masaki Kanemura (Tohoku University Graduate School of Medicine) for cooperating as members of the PRS committee, Dr. Mitsunori Miyashita (Tohoku University Graduate School of Medicine, Department of Health Sciences) for statistical analysis in the pilot study, and Dr. Yasumichi Kinoshita (Ishinomaki Red Cross Hospital), Dr. Masao Hiwatari (Sendai City Hospital), Dr. Hiroaki Takahashi (Iwate Prefectural Central Hospital), Dr. Makio Gamo (Osaki Citizen Hospital), and Dr. Toshikazu Goto (Yamagata Prefectural Central Hospital) for their support and cooperation in reviewing patients’ medical records. We also thank all the reviewers for reviewing the medical records of patients, Dr. Makoto Kikukawa (Kyushu University) and Dr. Junya Iwazaki (Tohoku University) for critical reading of the manuscript, and Mr. Yutaro Arata, Mr. Katsunori Tanaka, Mr. Shinya Otsuki, Ms. Naoko Chiba, and Ms. Ayaka Arata (Office of Medical Education, Tohoku University) for their technical assistance.

Conflict of Interest

All authors declare no conflict of interest.

References
 
© 2014 Tohoku University Medical Press
feedback
Top