Declining Accuracy in Disease Classification on Health Insurance Claims: Should We Reconsider Classification by Principal Diagnosis?

Background An ideal classification should have maximum intercategory variance and minimal intracategory variance. Health insurance claims typically include multiple diagnoses and are classified into different disease categories by choosing principal diagnoses. The accuracy of classification based on principal diagnoses was evaluated by comparing intercategory and intracategory variance of per-claim costs and the trend in accuracy was reviewed. Methods Means and standard deviations of log-transformed per-claim costs were estimated from outpatient claims data from the National Health Insurance Medical Benefit Surveys of 1995 to 2007, a period during which only the ICD10 classification was applied. Intercategory and intracategory variances were calculated for each of 38 mutually exclusive disease categories and the percentage of intercategory variance to overall variance was calculated to assess the trend in accuracy of classification. Results A declining trend in the percentage of intercategory variance was observed: from 19.5% in 1995 to 10% in 2007. This suggests that there was a decline in the accuracy of disease classification in discriminating per-claim costs for different disease categories. The declining trend temporarily reversed in 2002, when hospitals and clinics were directed to assign the principal diagnosis. However, this reversal was only temporary and the declining trend appears to be consistent. Conclusions Classification of health insurance claims based on principal diagnoses is becoming progressively less accurate in discriminating per-claim costs. Researchers who estimate disease-specific health care costs using health insurance claims must therefore proceed with caution.


INTRODUCTION
Health insurance claims contain diagnostic information and are a valuable data source for economic and epidemiological studies. However, 2 problems arise when researchers use health insurance claims for epidemiological studies: the need to ensure (1) the accuracy of diagnoses and (2) the accuracy of disease classification. The former is challenging because health insurance claims are essentially financial documents and not medical records. The latter derives from the fact that principal diagnoses are chosen rather arbitrarily when the coders are not properly trained.
To bypass these difficulties, studies attempting to evaluate the economic effects of smoking, 1 walking, 2 and health promotional activities 3 have largely used per-capita health care cost, without disease classification. Some studies estimating disease-specific health care costs for diseases such as asthma 4 and liver disease 5 also used other data sources, including the Patient Survey (a one-day crosssectional sampling survey conducted by the Japanese Ministry of Health, Labour & Welfare), to increase the accuracy of disease classification. Indeed, disease classification on health insurance claims was shown to be of questionable accuracy when compared with the Patient Survey even for a well-defined disease category like dialysis. 6 Health insurance claims are widely used for epidemiological studies abroad, and foreign researchers have validated the accuracy of diagnoses in an empirical manner with more or less positive results. A Korean study reported 76% accuracy of acute myocardial infarction (AMI) diagnoses through matching with medical records. 7 A Taiwan study reported 74.6% accuracy of diabetes diagnoses through a questionnaire survey to patients. 8 Researchers in the United States reported even higher accuracy: 94.1% positive predictive value (PPV) for AMI diagnoses, 9 72.6% to 80.8% PPV for pneumonia, 10 and 76.2% sensitivity and 93.3% specificity for hypertension. 11 Some researchers went so far as to match cases with the cancer registry to validate diagnoses of malignancy. 12 However, when researchers use health insurance claims classified by principal diagnoses, the second problem, ie, the accuracy of classification, is more important than the accuracy of diagnoses per se. There may also be systematic biases in classification, because some diseases are more likely to be chosen as principal diagnoses than others. 13 Accuracy of diagnoses can only be validated empirically through matching with a gold standard such as medical records, but accuracy of disease classification can be evaluated statistically. If claims of the same disease category have the same values, accurate classification should yield uniform claims, ie, zero variance. In other words, accurate classification should maximize the intercategory variance while minimizing intracategory variance.
In this study, statistical analysis is used to evaluate the accuracy of classification by analyzing per-claim costs of outpatient claims. Per-claim cost is the amount of money charged for medical treatment and is written on the bottom line of a health insurance claim. Per-claim cost is expressed in points and can be converted into Japanese yen by multiplying by 10.

METHODS Theory
Disease-specific means and variance can be estimated from published frequency tables-without referring to microdata that are not readily available-by using an optimization program such as Excel Solver with the assumption of a particular distribution. If a normal distribution is assumed, as is usually the case, then the frequency tables must follow a normal distribution for the optimization program to yield good estimates. Per-claim costs of health insurance claims do not follow a normal distribution, as evidenced by the skewed distribution in the frequency tables; they follow a log-normal distribution. 14 Therefore, the ranges of frequency tables were log-transformed to ensure normal distribution. The goodnessof-fit of the log-normal distribution was confirmed by the Kolmogorov-Smirnov test. The NHIMBS is a survey of all insurers (1818 municipal governments and 165 National Health Insurance societies as of March 2007). Health insurance claims are sampled randomly by each insurer at a specified sampling proportion. The sampling proportion is approximately 1/500 for regular and elderly beneficiaries. Until 2002, elderly was defined as age 70 years or older, after which the threshold was raised gradually to 75 years in 2007. Elderly beneficiaries also include people 65 years or older with certain disabilities. The sampling proportion for retiree beneficiaries is 1/100.

Data source
Because health insurance claims are administrative data, the population of health insurance claims can be determined from monthly administrative reports compiled by the Central Federation of National Health Insurance (www.kokuho.or.jp). The exact population and sample size of outpatient claims, as well as the number of beneficiaries from which the data were derived, are shown in Table 1. Thirteen years of data (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) were used because the same ICD10 classification (commonly referred to as the "119 classification" 15 ) has been applied since 1995.
The representativeness of the data is believed to be satisfactory because the survey includes all insurers. However, some irregularities were observed for renal failures in 2000, as shown in Table 2. Data in the genitourinary disease category of 2000 were modified according to the 1999 and 2001 data.

Estimation of means and standard deviations of log-transformed per-claim costs
To determine whether an observed distribution follows a certain distribution (such as a normal distribution), the Kolmogorov-Smirnov (KS) test is used. In this test, the maximum discrepancy between the 2 cumulative distributions, or KS value, is used as a test statistic. If the KS value is smaller than 1.63/√n, then one can assume that the observed distribution follows the certain distribution at P = 0.01. 16 Because the NHIMBS provides only arithmetic means for per-claim costs, the means (m) and standard deviations (σ) of log-transformed per-claim costs were estimated from diseasespecific frequency tables (Summary table 16-2) using Excel Solver, an add-in program for Microsoft Excel software.
For example, 25.1% of outpatient claims with a principal diagnosis of diabetes were in the range of 500 to 1000 yen per-claim in 2006. The range of 500 to 1000 yen was logtransformed to LN(500)-LN(1000) or 6.21 to 6.91 (LN, natural logarithm). If the log-transformed per-claim costs follow a normal distribution, the proportion of claims in this range is expressed with Excel functions as follows ("TRUE" in Excel functions denotes cumulative density functions; "FALSE" denotes probability density functions): Frequency tables consist of 7 ranges (1-500, 500-1000, 1000-2000, 2000-3000, 3000-5000, 5000-10 000, and ≥10 000 yen per-claim). Let R k denote the cumulative proportion of claims in the frequency tables in the kth range (1 ¼ k ¼ 7) and E k denote the estimated cumulative proportion in the log-transformed kth range using formula [1]. Then, the KS value is expressed as follows: Optimal m and σ were obtained using Excel Solver to minimize the KS value of the formula [2] for all disease categories and years. The square of σ, σ 2 , gives the variance within a given disease category (hereafter referred to as intracategory variance).

Estimation of intercategory variances
Let n, m, and σ denote the number of claims, and the mean and standard deviation of per-claim costs, respectively, of an entire sample, and n k , m k , and σ k denote those of the kth disease category. The relationship between the entire sample and disease categories are expressed as follows: Formula [4] signifies the following relationship: Hence, intercategory variance was calculated using the second part of the right side of formula [4].

Extrapolation of sample size
The number of sampled claims was obtained from the raw output tables (Table 7-1 for regular, 7-2 for elderly, and 7-3 for retiree beneficiaries) of the NHIMBS. However, the number of claims from these 3 beneficiary categories cannot be summed because the sampling proportion is different (1/500 for regular and elderly beneficiaries and 1/100 for retiree beneficiaries). Hence, the number of claims for retiree beneficiaries was deflated by five to adjust for the difference in sampling proportion.

Calculation of means and variance of residual subcategories
The NHIMBS presents disease-specific data on all 19 major disease categories in ICD10 (I-XIX), plus some selected subcategories. For example, NHIMBS presents data on ophthalmic disease (VII), as well as on a subcategorycataract (H25-26). From these data, a residual subcategory, "other ophthalmic diseases (H00-59 minus H25-26)", must be extrapolated to create a mutually exclusive disease classification. The means and variances of residual subcategories can be calculated using formula [4]. A total of 38 mutually exclusive disease categories were thus created. A subcategory, "renal failure", was merged with a major category, "genitourinary diseases", because of irregularities in the data. Table 3 illustrates how the optimal m and σ were obtained. The left frequency table presents an actual distribution of perclaim costs and the right frequency table presents a theoretical distribution when per-claim costs are log-transformed and assumed to follow a normal distribution with optimal m and σ minimizing the KS value. Table 4 shows the results of the KS test for goodness-of-fit. Overall, per-claim costs were shown to follow a log-normal distribution in 5 of 13 years (1995, 1996, 1997, 2001, and 2005). On a disease-specific level, a majority of disease categories were shown to follow log-normal distributions. Most notably, all disease categories followed a normal distribution in 1995 and 1996. Hypertension had the largest number of non-compatible years (11 out of 13 years), reflecting its large sample size, followed by genitourinary  Declining Accuracy in Disease Classification Table 5. Exponentiated optimal means of log-transformed per-claim costs (= geometric means; in Japanese yen)    diseases (9 out of 13 years), including dialysis, which has an exceptionally high per-claim cost. The overall compatibility improved when hypertension and/or genitourinary diseases were excluded (shown as a reference in Table 4). Without these 2 categories, per-claim costs were shown to follow a log-normal distribution in all 13 years. Table 5 and 6 show the exponentiated m and σ (exp(m) and exp(σ)) or geometric means and standard ratio for all disease categories and years. Geometric means of per-claim costs have consistently decreased, which may reflect a reduction in drug costs due to increasing separation of dispensing and prescription. In contrast, the standard ratio remains constant, around 2.55 to 2.70, throughout the study period. It is noteworthy that the standard ratio of per-claim costs is close to the Napier constant (e = 2.718). If the geometric mean of perclaim costs is 1000 yen and the standard ratio is 2.7, one can assume that 68% of claims fall within the range of 1000/2.7 to 1000 Ã 2:7, or 370 to 2700 yen. Table 7 shows the consistent decline in interclass variance relative to overall variance: in 1995, intercategory variance was 19.5% of overall variance but declined to only 10% in 2007. This means that disease categories account for less than before in discriminating differences in per-claim costs; Figure 1 shoes the trend line (Y = −0.0065X + 0.1659, R 2 = 0.901). The declining trend reversed in 2002, when hospitals and clinics were mandated to choose principal diagnoses; however, the reversal was only temporary and the declining trend appears consistent. In 2007, another reversal occurred, but it is too early to determine if it is temporary.

DISCUSSION
This study demonstrated a consistent decline in the intercategory variance of per-claim costs. If the difference in per-claim costs among disease categories is held constant, the declining intercategory variance can be interpreted as declining accuracy of classification or, in other words, increasing misclassification. Until 2001, disease classification was conducted rather arbitrarily, with no explicit criteria, by nonprofessionals at insurers. Starting in 2002, hospitals and clinics were required to specify principal diagnoses, which, it was hoped, would enhance the accuracy  of classification. The change in classifiers did increase intercategory variance, as suggested by this author's previous study, 17 but the effect was short-lived and does not appear to have altered the overall declining trend. This finding is sufficient to rebut the common claim that classification is accurate when doctors choose principal diagnoses. The goodness-of-fit evaluated by the KS test revealed that all disease categories followed log-normal distributions in 1995 and 1996, but that the goodness-of-fit deteriorated year by year as more categories were evaluated that did not follow a log-normal distribution, as indicated by increasing KS values. At the same time, this study revealed that the standard ratio of per-claim costs remained stable and close to the Napier constant (2.718). This finding should prove to be a useful rule of thumb for analysis of health insurance claims: 68% of claims fall between 2.718 times and 2.718th of the geometric mean.
Then, what is the cause of the decline in accuracy? The most probable cause is the increasing number of diagnoses, as suggested by this author in 1996. 13 The average number of diagnoses in a claim has consistently increased, as shown in Figure 2. The increased intercategory variance in 2002 can be explained by the sudden reduction in the number of diagnoses due to the revised rule exempting diagnoses for inexpensive drugs. Whatever the causes, disease classification by principal diagnoses is becoming progressively less accurate in discriminating per-claim costs. With the rapid computerization of claims, there is a need for a statistical method that can objectively quantify all diagnoses. Such a method was described by this author in a previous study. 18