Evaluation of the completeness of cancer case ascertainment in the Seoul male cohort study: application of the capture-recapture method.

Since the completeness of case ascertainment is directly related to the validity of a study, the evaluation of completeness is an essential feature of a cohort study. To estimate the completeness of cancer case ascertainment during a three year period (Jan. 1, 1993, to Dec. 31, 1995) in which the Seoul Male Cohort was followed up, we applied capture-recapture method. Data were obtained from the cancer registries, medical records and death certificates, with cases identified from each source numbering 103, 105, and 38, respectively. After eliminating duplicate cases, the total number was 141, and by using a log-linear model, the number of cases not detected by any of the three data sources was estimated to be 16. For all cancers, the estimated completeness of follow-up was 89.9%.

completeness, cancer, cohort follow-up, capture-recapture method To detect the occurrence of target disease during follow-up in cohort study, researcher may use various data sources; these might include medical claim data, medical records, cancer registry, death certificate data, and pathologic data. Despite the multiplicity of data sources, it is hard to access all of these data sources. Thus, researcher choose some, not all, of the data sources which are regarded as available and effective for follow-up. However, even if we combine several data sources, we cannot get complete detection of cases, because some of true cases are missed in each data source. We need to evaluate how completely we are conducting follow-up in cohort study.
The first and essential step of the evaluation is to assess the completeness of the data that means what proportion of target event is caught in the data1). In a cohort study, for example, completeness of follow-up means proportion of the number of detected cases to the number of actual cases which is hard to identify exhaustively 2,3). This study was carried out to estimate the completeness of cancer case ascertainment in a cohort study with follow-up period of January 1993 to December 1995 4). The data sources used during follow-up were medical insurance claim data of KMIC(Korean Medical Insurance Corporation), medical records in hospitals, informations obtained from the cancer registry data and death certificates. To evaluate the completeness of ascertainment, we used a capture-recapture method to estimate the number of undetected cases.

Study Population
To be eligible for inclusion in the study participants must be men resident in Seoul, insured of KMIC, aged from 40 to 59 at January 1991, and proven to be healthy in regular health examination which is conducted biennially as health promotion program of KMIC. The eligible population was 54,378 with 1,262 institutions. By cluster random sampling, 29,918 was sampled. We se nd postal questionnaires to 29,918 persons to obtain informations about lifestyles related to risk factors of cancer. Of the 29,189, 14,533 returned interpretable questionnaires(48.6%). The reason why we selected study subject from insured of KMIC is that each insured receives health examination biennially so that it is easy to construct disease-free cohort, and that the medical insurance claims of the insured is computerized and preserved for inspection so that we could easily get the data necessary for follow-up in cohort study.
Cancer case ascertainment from each data sourses Cancer cases were ascertained by three different ways( Figure 2). Medical insurance claim data was used for screening out potential cancer cases, about whom medical records was reviewed by visiting the hospitals. Data of cancer registries were obtained to detect cancer cases. Death certificate data was accessible from National Statistical office. Data linkage was done using resident identification number which is unique and given to all citizen of Korea. International classification of disease 140-208 (ICD~9) or C00~D09 (ICD-10) were regarded as relevant to cancer.
The medical insurance claims related to cancer were identified among cohort subjects. We were accessible to the computerized insurance claim data by courtesy of KMIC. The KMIC data consists of records of every hospital visit of each insured with their diagnosis as ICD cord, date of visit, whether inpatient or outpatient, cord number of visited hospital. We regarded subjects recorded to be cancer cases in claim data as potential cancer cases, not true cases, because the validity of medical insurance claim data is not proven yet. Therefore, we had to check the validity of medical claim data by visiting hospitals and reviewing medical records. The claim data relevant to cancer between January 1993 and December 1995 of claim data were reviewed. A total of 369 person were potential cancer cases; among these we accessed 301 medical records( Table 1). The medical records of 68 potential cases were inaccessible because of lack of cooperation from the hospitals or loss of medical records; of the 68 records, 10 were found at cancer registry data. remaining 58 potential cases were regarded as non-cases or missed cases. A structured abstract form was used for medical records review. Contents of the abstract form included first diagnosis Table 1. Number of persons who were recorded as a cancer case in medical insurance claim data of KMIC (1993KMIC ( -1995 and number of accessible medical records. of date, diagnostic measures and interpretation of the results, stage of cancer, treatment, etc. Well-trained reviewer(M.D.) decided whether the potential case had been a true cancer case or not by evaluating completed abstract form. The first visit recorded as cancer in KMIC data was regarded as an incidence date. Cancer cases were also identified from two cancer registries, the Central Cancer Registry, which collects cancer cases from 121 nationwide training hospitals, and the Seoul Cancer Registry, which covers small hospitals in Seoul. Using these data, for the period of 1993 to 1995, cancer cases in the cohort were identified. The cancer registry data had not been merged with death certificate data. Thus, the cancer registry data used in this study did not include death certificate only(DCO) cases.In addition, mortality cases recorded in death certificates as cancer were identified. We used death certificate data from January 1993 through December 1996. To be used for capture-recapture method with other data sources of this study, incidence date should be known or otherwise estimated. We could not know the incidence date with death certificate data alone. Thus we estimated the incidence date using KMIC data assuming that first record in KMIC data as a cancer was incidence date. Fortunately, all mortality case were found to be potential cancer cases of KMIC data so that incidence dates of mortality cases were estimated using the first cancer claim date in the KMIC data.

Statistical analysis
When cancer cases were identified from three data sources, the overlap of incident cases could be illustrated as Figure 1. 'H'represents cases not detected in any of the three data sources. Using log-linear model, we can estimate the number of unobserved 'H'1,2,5) The estimation needs several assumptions to be true 5). Independence between sources of data is important most of all. Independence means that the probability of an individual being included in a data does not dependent on whether he or she was included in the other data. Statistically, the log-linear model solves the problem of dependency by adding an interaction term to log-linear model for examining and adjusting dependency 5).
Another important assumption is that there should not be heterogeneity. If there is a group of individuals who are less probable of being caught than other group, there is a heterogeneity, by which an estimate can be biased. If catchability vary from individual to individual, those caught in every sample will have a higher average catchability than those who miss one sample, and so on down to those that are caught once. We analysed the heterogeneity using a method suggested by Cormack7). Three-source model including all cases were compared with three-source model excluding those who were caught by all of three sources of data . The significance of difference in catchability was tested using difference of two scaled deviances.
The goodness-of-fit of the three-source models were tested by likelihood ratio statistic. For log-linear model and parameter estimation, a GLIM 3.77 statistical package was used6,7). In this study, G2 likelihood ratio goodness-of-fit was used to estimate the 95% confidence interval 819). To calculate goodnessof-fit based confidence interval, we used CATMOD procedure of SAS 6.12 10) which was more convenient for iteration than GLIM package.
'To compare the estimates from various models , we applied two-source capture-recapture analysis 10. Each pairs from three data source was analysed to estimate total population, for example, [medical records review] versus [cancer registries]. In addition, each combined pairs of data source and the other one was analysed, for example, [medical records review and/or death certificates] versus [cancer registries]. Consequently, we tried six two-source models in addition to the three-source model.
If we had a expected value of cancer incidence, the validity of the estimate could be evaluated, though roughly. We calculated an expected incidence for three years of follow-up by indirect standardization using population-based incidence data of Seoul(1991) 11). Cancer incidence data in Seoul beyond 1991 was not yet available. To take into account the change in age of subject, expected value was calculated in each years of follow-up with different age distribution and summated. The incidence data was from the implementation study of Seoul Cancer Registry, that didn't contain the DCO cases 11).
Finally, by comparing the detected number of cancer cases with the estimated total cases, we calculated the completeness of ascertainment from medical records, cancer registry data, death certificates and overall case ascertainment. observed completeness =detected cases in each data source / observed cancer cases estimated completeness =detected cases in each data source/ estimated cancer cases

Number of cancer cases ascertained by each data sources
From a review of the 301 medical records by visiting hospitals, 105 incident cancer cases were identified to be occurred between 1993 and 1995. From cancer registry data, 103 cases were ascertained during the same period.
From death certificate data, 38 cancer cases were identified (Table 2). There was gradual increase in proportion of cases identified from cancer registries whereas the number of cases from other data sources showed somewhat fluctuating pattern.
Other than duplicates, a total of 141 cancer cases were found ( Figure 2, Table 2). There was no remarkable increase of annual number of cancer incidence. Medical records review ascertained 74.5% of total cases. Number of cases found in cancer registries was not much different from that ascertained by medical records review. However, only 27.0% of total cases were found in death certificate data.
All cases, as identified by combining the three data sources, are shown in Table 3. Sixteen cases(11.3%) were detected from all three sources; 73(51.8%) were detected from two, and 52(36.9%) from one only. When combine each couple of sources, medical records review and cancer registry detected 140 cases which is 99.3% of total cases. In combination of medical records review with death certificates and cancer registries with death certificates, 123 and 109 cases were detected respectively.   The result in Table 5 shows the statistical significance of each data source in three-source log-linear models. After adding the interaction term, there was significant decrease of scaled deviance. The significant interaction terms were 'medical record review + death certificates' and 'cancer registries + death certificates'. However, when the interaction term 'medical record review + death certificates' was included in the model, there was a loss of significance of the main effect of 'medical record review' . Finally, the model including the interaction of 'cancer registries + death certificates' was found to be most fitted. It means that there was dependence between cancer registry data and death certificate data. The number of undetected cases was estimated by summation of exponential function of parameters estimated. According to the model selected in table 5, the total number of cancer cases were estimated to be 156.8, which is 15.8 more than the total detected cases from the three data sources and 4.2 more than total cases estimated by three-source model without interaction term.
We presented various estimates using two-source models and three-source models( Table 6). Two-source models cannot solve the problem of dependence, If so, in two-source models with positively dependent sources, there would be under-estimation1,2). If the positive dependence shown in three source analysis is true, model 3, 5 and 6 ( Table 6) is twosource model with positive dependence unsolved, and the estimated total cases would be underestimated. Actually, the estimated total cases from model 3,5 and 6 was fewer than that of both three-source models with and without interaction term. The minimal estimate was from model 3 which consists of two data sources dependent on each other.
To evaluate the estimated cancer incidence, standardized incidence number of cases was calculated from cancer incidence data of Seoul, 1991 (Figure 3). The expected value(155.9) was near the estimated value and far from observed value. Comparing with expected value, the estimation from three-source model with interaction was found to be more reasonable than that from the other models. Table 7 illustrates the completeness of each data source and Table 5. Log-linear model fitting and evaluation for parameters and dependencies in three data sources-all cancers.

Completeness of case ascertainment
*I -medical record ascertainment , 2-cancer registry, 3-death certificate data **p-value < 0 .05 Table 6. Comparison of various models and total cancer occurrence estimated. abbreviation: MR, medical records ascertainment; CR, cancer registries; DC, death certificates * model with interaction term of DC*CR Figure 3. Observed incidence, estimated incidence from log-linear model and expected incidence from indirect standardization using populational cancer incidence data of Seoul, 1991.

DISCUSSION
Among the total of 369 potential cancer cases from KMIC data, 301(81.6%) medical records were accessible. However, only 105 were ascertained to be a true cancer by medical record review. This high miscoding rate (65.1%) would be likely due to following reason. In Korea, medical claim of expensive and redundant diagnostic measures need more persuasive, even though false, ICD code rather than that of real diagnosis. Nowaday, several validity studies are being conducted to prove this phenomenon, but not published yet. However, 196 miscoded subjects does not influence the result because they were not included in capture-recapture analysis. Although there may be true cases among the miscoded, they would not lead biases since the capture-recapture method itself was based on the presence of missed case and tried to find out the size of them.
Validity of data is important in capture-recapture method, in other words, there should be no false-positive cases. Misdiagnosis can over-or underestimate the result of the capture-recapture method 12. In particular, if there is a false positive case, correction for underestimation is not perfect. The cases ascertained by medical records review have few problem of false-positive case, because we judged the validity by various aspect of each record, for example, diagnostic tools with interpretation, treatment done, stage of cancer, etc.
Cancer registry data used in this study didn't include DCO. All cases in the cancer registries were reported from hospitals with histological diagnosis, first date of diagnosis and pathologic diagnosis, etc. The proportion of histological verification(HV%) was 77% in Seoul Cancer Registry(1991), which were slightly lower than other reliable cancer registry(94% for whites in US Connecticut, 95% for whites in US SEER, 70 and 73% for male and females in Osaka, Japan). The lower levels of HV% might be due to the high relative frequency of liver cancer in Korea, In fact, the HV% of liver was 26% in males, while that of other major cancer was over 85%).
In case of mortality data, there could be a problem of validity. In nationwide death certificate data in 1995, cause of death was unclassifiable in 0.8%, medical doctor certified the cause of death only in 60.6 % 13). However, in this study, all the 38 cases from death certificate data was certified by medical doctor. The high certification rate may be likely due to relatively young age of subjects and the attribute of study subjects who were all public servants and residents of metropolitan.
The reason for relatively small number of cases detected from death certificate is that we used the death certificate data of only 4 years, from 1993 through 1996. When we can get the data after 1996, the cases of first three years of follow-up will be more than 38.
The capture-recapture methods involves several assumptions 5). First, the population should not change during the investigation; secondly, there should be no loss of tags, and thirdly, for each sample, each individual should have the same chance of being included; fourthly, the two samples should be independent.
When the capture-recapture method is applied to a cohort study, the first assumption is usually true because there is no temporal order between capture and recapture in case ascertainment by multiple sources 14). Migration out of jurisdiction, death, and other loss of follow-up can also violate first assumption in fixed cohort study 14). However, the three sources of data used in this study are nationwide databases so that migration out of Seoul doesn't mean loss from follow-up. Death cannot be a loss of follow-up because we used death certificate data. One reason of loss from follow-up may be emigration, but it could not be substantial in cases or potential cases enough to be considered in this study.
The second assumption is completely satisfied in Korea, since every citizen has their own identification number, and it was included in all the three sources of data without missing.
The third and fourth assumptions are more often violated.
The third is that every individual should have the same chance of being included in the sample 1,2,5,15); if not, there is a heterogeneity which implies that dependency is present in all possible combinations of data sources 5). The failure of third assumption is due to different probability of being caught in any sample. In this study, if some of the cohort subjects were less more probable of being caught by any of three data sources, there might be a heterogeneity. Heterogeneity depends on the character of individual not of sources of data. The heterogeneity can be test in various way, but we tried a brief test where we found no evidence of heterogeneity. Heterogeneity can also be considered with regard to biological consideration. We could not suppose that any group among the cancer cases of this study had different probability of being caught from the other group, because the study subjects have same race, same sex, narrow range of age and not so much different socio-economic status. The fourth is that different case, respectively detected and not detected in the same data sources, have the same probability of being detected in another data source. For example, if cancer cases detected by cancer registry are more likely to be detected by a review of medical records than not detected cases in the cancer registry, the fourth assumption is violated. In this instance, two sources of data show positive dependency, and this underestimates the number of undetected cases. Conversely, negative dependency may occur, and this overestimates the number of undetected cases. If there are three sources of data, two-source dependence (first order interaction) can be solved by fitting model with interaction term 9. The dependence between cancer registries and death certificates was testified by fitting log-linear models, although second order interaction could not be testified. Another method to test dependence, as Wittes et al. suggested, is that, if there are more than two sources, one can compare the sources two at a time 16). If one estimates is much lower than the rest, then one might suspect positive dependence. In this study, we performed two-source analysis to support the presence of dependence. Two-source analysis of Cancer registry with death certificate data show the less estimate than any other two-source analysis, which supports the presence of positive dependency as appeared in three-source log-linear analysis. When two sources of data combined to one sample an,' two-source capture-recapture analysis was conducted with another, two models including dependent sources in each side showed less estimate than that of another twosource analysis and three-source analysis with interaction term. Thus, we believe that positive dependence between cancer registries and death certificate data is appropriately treated by log-linear model.
Since we had no true value to be compared with, the population-based cancer incidence rate of Seoul was projected into this study population to get a expected value. The estimated value was almost same as expected value. When considering that the population-based cancer registry is Seoul Cancer Registry which we used in case ascertainment. the number of cases, 103, from cancer registries is much lower than the expected value, 155.9, from the Seoul Cancer Registry. The much lower incidence in study population can be explained that study subjects was healthy population at the time of beginning of follow-up so that prevalence of risk factors for cancer might be lower than that of general population. With regard to completeness, if the estimated completeness of cancer registries could be generalized, the completeness-adjusted expected incidence cases in this cohort would be 237.3 which is 155.9(expected cases) over 0.657(completeness of cancer registries). Thus, the interpretation of Figure 3 should be limited.
Exhaustive case ascertainment can definitely not be replaced by the capture-recapture method, but without the method, it is very difficult to evaluate the completeness and effectiveness of follow-up in a cohort study. This study showed that the capture-recapture method can be used to evaluate underestimation of the incidence rate calculated during such a study.