Validity of Using Japanese Administrative Data to Identify Inpatients With Acute Pulmonary Embolism: Referencing the COMMAND VTE Registry

Background Acute pulmonary embolism (PE) is a life-threatening in-hospital complication. Recently, several studies have reported the clinical characteristics of PE among Japanese patients using the diagnostic procedure combination (DPC)/per diem payment system database. However, the validity of PE identification algorithms for Japanese administrative data is not yet clear. The purpose of this study was to evaluate the validity of using DPC data to identify acute PE inpatients. Methods The reference standard was symptomatic/asymptomatic PE patients included in the COntemporary ManageMent AND outcomes in patients with Venous ThromboEmbolism (COMMAND VTE) registry, which is a cohort study of acute symptomatic venous thromboembolism (VTE) patients in Japan. The validation cohort included all patients discharged from the six hospitals included in both the registry and DPC database. The identification algorithms comprised diagnosis, anticoagulation therapy, thrombolysis therapy, and inferior vena cava filter placement. Each algorithm’s sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were estimated. Results A total of 43.4% of the validation cohort was female, with a mean age of 67.3 years. The diagnosis-based algorithm showed a sensitivity of 90.2% (222/246; 95% confidence interval [CI], 85.8–93.6%), a specificity of 99.8% (228,485/229,027; 95% CI, 99.7–99.8%), a PPV of 29.1% (222/764; 95% CI, 25.9–32.4%) and an NPV of 99.9% (228,485/229,509; 95% CI, 99.9–99.9%) for identifying symptomatic/asymptomatic PE. Additionally, 94.6% (159/168; 95% CI, 90.1–97.5%) of symptomatic PE patients were identified using the diagnosis-based algorithm. Conclusion The diagnosis-based algorithm may be a relatively sensitive method for identifying acute PE inpatients in the Japanese DPC database.


INTRODUCTION
Recently, real-world data, including data from electronic health records, medical claims, and other sources, have played an increasing role in observational studies. 1 In many studies that use real-world data, the operational definitions of study elements (eg, inclusion and exclusion criteria, exposures, outcomes, key covariates) are derived from code-based algorithms using structured data elements or the extraction of relevant information from unstructured data, such as physician notes.However, because operational algorithms are usually imperfect, there is concern that the misclassification of study elements might impact the measure of associations and the interpretation of results. 2 The diagnostic procedure combination/per diem payment (DPC/PDPS) system is a case mix-based inclusive fee schedule for inpatient care that was launched in 2002 by the Ministry of Health, Labour and Welfare (MHLW) in Japan. 3The DPC/PDPS system covered approximately 55% of acute general care beds nationwide in 2014. 4Hospitals participating in the DPC/PDPS system are obliged to submit "DPC data" to the MHLW, including discharge abstract data and claims information (regardless of reimbursement) in addition to reimbursement claims.Discharge abstract data (referred to as "Format 1"), which are created for each patient per hospitalization, are easy to analyze due to their well-organized structure and can also be utilized for research purposes.Therefore, Japanese DPC data support epidemiological studies or ongoing surveys of low-prevalence diseases, such as venous thromboembolism (VTE), although the data are limited to the in-hospital setting.
6][7] A series of questionnaire-based reports showed that PE occurs relatively less frequently in Japan than in Western countries, [8][9][10][11] but its incidence increased during the 1990s and 2000s. 12In the late 2000s, several clinical studies based on administrative data were conducted.Kunisawa et al reported that the incidence of postoperative PE was 0.05% using a diagnosis code-based algorithm for discharge abstract data from a DPC database. 13Additionally, Nagase et al described thromboprophylaxis and the prevalence of PE after lower extremity surgery using another DPC/claims database. 14However, to the best of our knowledge, the validity of PE patient identification using Japanese administrative data has not yet been well addressed.The purpose of this study was to evaluate the validity of PE patient identification using DPC data.

METHODS Reference standard and validation design
Due to the low prevalence of PE, we used existing data from the COntemporary ManageMent AND outcomes in patients with Venous ThromboEmbolism (COMMAND VTE) registry 15 as a reference standard.The design of the registry has been reported in detail elsewhere. 15Briefly, the registry was a physician-initiated, retrospective cohort study of consecutive patients with acute symptomatic VTE objectively confirmed by imaging examination or autopsy in 29 centers in Japan between January 2010 and August 2014.Patients were extracted from hospital databases of imaging test results (contrast-enhanced computed tomography, ultrasound, ventilation-perfusion lung scintigraphy, pulmonary angiography, or contrast venography), and diagnostic information was used as an adjunct.As a result, data from 19,634 patients suspected of VTE were extracted from hospital databases.Thereafter, the medical charts were manually reviewed, and 3,027 VTE patients were enrolled in the registry.Figure 1A shows a schematic diagram of the registry cohort, which included symptomatic PE patients with/without DVT and asymptomatic PE patients with symptomatic DVT but did not include asymptomatic PE patients with asymptomatic DVT.
The flowchart of the COMMAND VTE Registry and the current study is shown in Figure 1B.The study period was from January 2010 to August 2014.The validation cohort (DPC data) included all patients discharged from the six hospitals participating in both the registry and the DPC database.Among the registry data, we selected symptomatic VTE patients from the six hospitals, and the following patients were excluded: those who were treated on an outpatient basis only, those who were diagnosed outside of the research period, and those who were discharged outside of the research period.After DPC data and registry data were linked, the reference standards (true-positive symptomatic/asymptomatic PE patients) were defined, excluding those who were not linked with DPC data and those who were diagnosed with DVT but not diagnosed with PE.Among the reference standards, the AT and TT subgroups were defined as those receiving anticoagulation therapy in the acute phase and those receiving thrombolysis therapy, respectively.Additionally, the AT+IVCf subgroup and the TT+IVCf subgroup were defined as those with IVC filter placement in addition to the corresponding therapy.

Source of DPC data
The DPC data source was the Real World Data (RWD) database: this database is maintained by the Health, Clinic, and Education Information Evaluation Institute (HCEI; Kyoto, Japan) with support from the Real World Data Co., Ltd.(Kyoto, Japan). 16his database contains the records of ∼20 million patients from ∼160 medical institutions across Japan as of 2020.The stored information includes DPC data from inpatients, administrative claims data, and laboratory results from both outpatient and inpatient services.The DPC data consist of discharge abstract data (Format 1 file) and claims information (EF file).Format 1 files include the following data: patient demographics; diagnoses, comorbidities at admission, and complications after admission recorded by using International Classification of Diseases, Tenth Revision (ICD-10) codes and text data in Japanese; selected clinical information; admission and discharge statuses; surgeries and procedures with the original Japanese codes (K codes); and special reimbursements.EF files include the following data (regardless of reimbursement): drug administration data, medical device use data, and service data.

Identifying PE patients in DPC data
Diagnosis information was derived from Format 1, and medication and procedure information was derived from the EF files.Five algorithms were examined in the current study: 1) diagnosis (D); 2) diagnosis and anticoagulation therapy (D+AT); 3) diagnosis, anticoagulation therapy and inferior vena cava filter placement (D+AT+IVCf ); 4) diagnosis and thrombolytic therapy (D+TT); and 5) diagnosis, thrombolytic therapy and IVCf (D+TT+IVCf ).eTable 1 presents the algorithm elements and the definitions.The diagnosis algorithm was developed to represent acute PE caused by thrombi, fat, and air.

Data linkage
We were provided with COMMAND VTE Registry data and DPC data, with research IDs specific to the registry and other research IDs specific to the RWD database, respectively.Therefore, we merged the IDs from the two databases in the following two steps: 1) the correspondence table relating the hash value (generated from the hospital-specific patient ID) and the registry research ID was generated for each of the six institutions and provided to RWD Co; and 2) the correspondence table relating the registry research ID and the RWD research ID was generated at RWD Co. and provided to us.The registry data and DPC data were linked at the episode/admission level deterministically using the patient ID and the timing of the PE event.According to the timing of the event, it was considered a match if the diagnosis date in the registry existed from 6 days before the date of admission to 1 day after the date of discharge in DPC data.To assess the quality of the linkage, the following analyses were performed: 1) the linkage proportion was calculated, 2) the agreement ratio of sex and year of birth was reported, 3) the characteristics of the episodes among those whose sex or year of birth disagreed were compared, and 4) the characteristics of symptomatic VTE patients and the reference standard were described. 17

Main analysis and stratified analyses
We used descriptive statistics to describe the validation cohort, symptomatic VTE patients, and true-positive PE episodes.The prevalence of acute PE was defined as the proportion of truepositive PE episodes in the validation cohort.
We estimated the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) with 95% confidence intervals (CIs) for binomial distributions using the exact method.We then stratified these estimates by age, sex, and setting (medical or surgical admission) to assess whether these factors significantly affected the study estimates.Surgical admis-sions were defined as those whose Format 1 was recorded with K codes excluding transfusion codes, pulmonary thromboendarterectomy codes (K codes: K92x, K592, K593), and any K codes under local anesthesia.For stratified analyses, the sensitivity of algorithm (D) was estimated by the following characteristics of PE episodes based on the registry data: 1) symptomatic or asymptomatic, 2) hospital-acquired or community-acquired, 3) severity classification, and 4) risk of recurrent VTE.

Subgroup analyses and other analyses
Subgroup analyses were performed to evaluate the performance of the algorithm in identifying PE patients receiving AT, those receiving AT+IVCf, those receiving TT, and those receiving TT+IVCf.Only those who received the corresponding therapy were defined as subgroup references.
The following analyses were performed on true-positive symptomatic/asymptomatic PE episodes: 1) the position of PE diagnosis in Format 1 was described; and 2) comparisons of the treatment between DPC data and registry data were described using two-by-two tables and simple kappa statistics.
All data handling and statistical analyses were performed using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA).This study was approved by the Kyoto University Graduate School and the Faculty of Medicine Ethics Committee (Kyoto, Japan, R1484).The requirement for written informed consent was waived because of the retrospective design using previously anonymized patient records.The information about this study was publicly disclosed via a webpage of Kyoto University Hospital, including an opportunity to opt out of participation. 18

Reference standard
The flowchart of the current study is shown in Figure 2. Of the 3,027 patients enrolled in the COMMAND VTE Registry, 692 patients (from six hospitals) were included.After we excluded patients who met the exclusion criteria, 375 of 379 (98.9%) symptomatic VTE episodes/discharges were successfully linked with DPC data.The agreement of both sex and year of birth was 98.7% (370/375).Five discharges with disagreement in sex or age were considered input errors in the registry or DPC data because the characteristics of the episodes/discharges were similar (data not shown).The characteristics of the symptomatic VTE patients were similar to those of the entire registry (shown in eTable 2). 15mong the symptomatic VTE episodes, 246 episodes/ discharges were identified as the reference standard.Table 1 shows the prevalence of acute PE in each hospital.The combined average PE prevalence for the six hospitals was 7.3 and 10.7 per 10,000 discharges for symptomatic PE and asymptomatic PE, respectively.The prevalence of PE did not include cases of asymptomatic PE with asymptomatic DVT, which were not eligible for the registry.The characteristics of the true-positive PE patients (reference standard) are described in eTable 3. The proportions of symptomatic PE patients (with or without DVT) were 168/246 (68.3%) and 148/223 (66.4%) among the reference standard and AT subgroup reference, respectively.Among the TT subgroup reference, 44/49 (89.8%) had symptomatic PE.

DPC data (validation cohort)
Of the 289,979 cumulative discharges included, 60,706 were excluded following the exclusion criteria.The characteristics of

Main results
The sensitivity, PPV, NPV, and prevalence of each algorithm are shown in The algorithm (D+AT) resulted in a slightly higher PPV of 37.9% (212/560; 95% CI, 33.8-42.0%)compared to that of the algorithm (D).The algorithm (D+TT) showed a much higher PPV of 65.7% (44/67; 95% CI, 53.1-76.8%).The sensitivities of algorithm (D+AT+IVCf ) and algorithm (D+TT+IVCf ) were significantly lower; on the other hand, the PPVs tended to be higher compared to those of algorithm (D+AT) and algorithm (D+TT), respectively.
Table 3 shows the results of the stratified analyses.A total of 94.6% (159/168; 95% CI, 90.1-97.5%) of symptomatic PE episodes were identified using the algorithm (D).Among the nine false-negative symptomatic PE episodes, five had DVT diagnoses or codes for IVC filter placements, one had a suspected PE diagnosis (this patient presented with cardiac arrest and died on the third hospital day), one had a chronic PE diagnosis, and the other two had no diagnoses or codes relating to VTE.
The position of PE diagnoses in Format 1 among true-positive PE patients is described in Table 5.In 10 of 28 patients with hospital-acquired PE who were diagnosed within 2 months of  AT was defined as the administration of unfractionated heparin or fondaparinux and warfarin during hospitalization.TT was defined as the administration of monteplase or urokinase during hospitalization.a The prevalence of true-positive cases of symptomatic/asymptomatic PE per 10,000 discharges.
surgery, PE diagnoses were due to comorbidities at the time of admission.eTable 5 presents comparisons of treatment in DPC data and registry data among the true-positive PEs.The agreement "UFH or fondaparinux" and warfarin were 89.0%(216/219) and 87.3% (215/246), respectively; however, the kappa statistics were 0.15 and 0.39, respectively.The kappa statistics of thrombolysis therapy and IVC filter placement were 0.88 and 0.91, respectively.

DISCUSSION
The diagnosis information in Format 1 in the DPC database enabled us to detect more than 90% of the acute PE inpatients in our validation cohort.The validity was fairly consistent across age, sex, and medical/surgical setting groups.The diagnosisbased algorithm also had high sensitivity in identifying subgroups of acute PE inpatients receiving anticoagulation therapy or thrombolysis therapy.This is the first report on the sensitivity of PE patient identification using Japanese administrative data.
The validity of PE patient identification using Japanese administrative data, especially sensitivity, is difficult to estimate because of the low disease prevalence. 12To overcome this  AT was defined as the administration of unfractionated heparin or fondaparinux and warfarin during hospitalization.The TT subgroup comprised symptomatic/asymptomatic PE patients receiving TT.The TT + IVCf subgroup comprised PE patients receiving TT and IVCf.TT was defined as the administration of monteplase or urokinase during hospitalization.a The prevalence of true-positive cases per 10,000 discharges.
Validation of Japanese DPC Data: Identifying Pulmonary Embolism challenge, an existing registry was used as a reference standard in our study.All discharges in DPC data were tested using the index algorithm; not all of them were tested the reference standard because "all possible patients" were manually reviewed in the process of registry inclusion.This is a modified stratification design technique that is applicable for diagnostic accuracy studies in low-prevalence situations. 19In the current study, this technique is performed on the assumption that there is no true-positive patient other than those listed as "all possible patients".Several studies regarding the accuracy of multiple sclerosis identification used this technique, reporting that a random sample would not have yielded enough cases due to the low prevalence. 20,21he PPV of algorithm (D) was lower than 30%, and adding the treatment element to the algorithm had a marginal effect on the PPV.However, our PPV estimates should be considered to be underestimated.This is because the asymptomatic PE patients, who were detected as PE patients by DPC data, decreased the PPV because they were not eligible for inclusion in the registry.To support this, the higher proportion of symptomatic patients among the TT subgroup reference resulted in a higher PPV in algorithm (D+TT) than in algorithm (D).
The addition of IVC filter placement to algorithm (D+AT) and algorithm (D+TT) tended to increase the PPVs.Moreover, the sensitivities for the AT+IVCf and TT+IVCf subgroups remained high.Therefore, these algorithms would be useful when a researcher is interested in PE patients undergoing IVC filter placement.
Differentiating between hospital-acquired and communityacquired onset is important for future clinical studies.In stratified analyses, the sensitivity of hospital-acquired PE tended to be lower than that of community-acquired PE (Table 3).However, it was not possible to distinguish between patients with and without onset during hospitalization.This is because hospital-acquired PE patients in the registry included those who developed PE once discharged from the hospital after surgery or those who developed it during treatment (eg, chemotherapy for cancer) in the outpatient department of that hospital.This can be inferred from the fact that some patients with hospital-acquired PE had the diagnosis of "trigger-for-hospitalization condition" or "comorbidities at the time of admission" (Table 5).
We compared the treatment for true-positive PE episodes between DPC data and registry data.According to treatment with thrombolysis therapy and IVC filter placement, the records revealed excellent reproducibility, which is consistent with the findings of previous studies. 22However, according to anticoagulation therapy, the reproducibility was poor, although the agreement ratio was high.One possible reason for this is that we defined one or more prescriptions as the treatment in our study.Further research is needed to create a more valid algorithm for anticoagulation therapy for PE patients.
In the design of epidemiological studies using databases, the use of an algorithm with high sensitivity is important when the goal is to identify all persons with certain characteristics. 23Our findings, which showed that the algorithms were highly sensitive, suggest that PE identification using DPC data could be a powerful screening tool for surveillance studies with confirmation via manual chart review.When designing an outcome study, the use of an algorithm with a high PPV is important in the development of an appropriate cohort, and equivalent sensitivity among groups is needed for the precise estimation of relative risks. 23Therefore, manual chart review after screening using DPC data enables researchers to develop a cohort without false-positive participants.In both design situations, highly sensitive algorithms are important for the generalizability of results because less sensitive algorithms may be differentially sensitive to different disease characteristics. 23his is the first study to evaluate the validity of PE identification algorithms in Japanese administrative data using registry data as the reference standard.Furthermore, the strength of this study is that it estimated sensitivity, which is difficult to measure due to the low frequency of PE.Therefore, our results would be helpful for future clinical studies using Japanese DPC data for acute hospital inpatients, even when only the PPV is validated.In addition, the diagnosis algorithm could be applied to research using claims data among hospitals participating in the DPC/PDPS system because claims data include the "SB records", which are identical to diagnosis information in format 1 of DPC data.
Our study had several limitations.First, true-positive PE patients were possibly missed because of the assumption that there were no true-positive patients other than those listed as "all possible patients".However, data from 19,634 "possible patients" were extracted through screening for imaging examination results and clinical diagnosis information.We believe that this was the best available method to determine the presence of the target condition.Therefore, the PE patients confirmed by the chart review for "all possible patients" in the registry were justified as a reference standard.
Second, the results of the current study have limited generalizability because only patients treated at six institutions were included.This is also because the institutions participating in the COMMAND VTE Registry may have better disease recording or clinical practice patterns.Ideally, a validation study should be conducted study population because the performance of an algorithm is dependent on various factors, such as the data source, study population, and choice of the reference standard. 2However, it seems infeasible for a manual chart review to measure a reference standard because the prevalence of PE in Japan is very low.Although our validation cohort was limited to patients in six institutions, the baseline characteristics of our patients were consistent with those of patients in general acute hospitals in a report by the MHLW. 24In addition, the prevalence and characteristics of true-positive PE patients in the current study were consistent with those of previous reports. 12,13,25,26hird, the PPV was underestimated because asymptomatic PE patients without VTE were not eligible for inclusion in the registry.The PPV is easy to measure even for diseases with a low prevalence because only algorithm-positive patients are included in the reference standard test (eg, manual chart review).Therefore, for each study in the future, the PPV should be evaluated using a validation cohort that replicates the target population.
Fourth, direct oral anticoagulants were not yet approved during the period of this study.Because some of the direct oral anticoagulants are used for initial therapy for PE patients except among critically ill patients, 27 the definition of anticoagulation therapy needs to be modified for the period when they became clinically available.
In conclusion, PE diagnostic codes obtained from Japanese DPC data may provide a relatively sensitive method to identify inpatients with acute PE, especially symptomatic patients.Highly sensitive algorithms can be useful screening tools for surveillance studies.Additionally, these algorithms may extract appropriate PE cohorts with high generalizability when combined with confirmation using manual chart review.However, the PPV should be evaluated as a part of future individual research because it was underestimated in the current study.
The data are automatically extracted from

Table 1 .
The prevalence of true-positive PE in each hospital aThe prevalence of true-positive cases of PE per 10,000 discharges.

Table 2 .
The sensitivity, specificity, positive predictive value, and negative predictive value of each algorithm referencing the symptomatic/asymptomatic PE patients in the COMMAND VTE registry AT, anticoagulation therapy; CI, confidence interval; IVCf, inferior vena cava filter placement; PE, pulmonary embolism; TT, thrombolysis therapy; VTE, venous thromboembolism.

Table 3 .
The sensitivity of diagnosis-based algorithm by patient characteristic based on the registry (stratified analyses)

Table 4 .
The algorithm validation results for PE patient subgroups AT, anticoagulation therapy; CI, confidence interval; IVCf, inferior vena cava filter placement; TT, thrombolysis therapy.The AT subgroup comprised symptomatic/asymptomatic PE patients receiving AT.The AT + IVCf subgroup was consisted by PE patients receiving AT and IVCf.

Table 5 .
The position of PE diagnosis in discharge abstract data (Format 1 file) among true-positive symptomatic/asymptomatic PE patients (246 patients) Hospital-acquired PE episode that was diagnosed within 2 months of surgery.