Data Resource Profile of Shizuoka Kokuho Database (SKDB) Using Integrated Health- and Care-insurance Claims and Health Checkups: The Shizuoka Study

Background Analyzing real-world data, including health insurance claims, may help provide insights into preventing and treating various diseases. We developed a database covering Shizuoka Prefecture (Shizuoka Kokuho Database [SKDB]) in Japan, which included individual-level linked data on health- and care-insurance claims and health checkup results. Methods Anonymized claims data on health insurance (National Health Insurance [age <75 years] and Latter-Stage Elderly Medical Care System [age ≥75 years]), care insurance, subscriber lists, annual health checkups, and all dates of death were collected from 35 municipalities in Shizuoka Prefecture. To efficiently link claims and health checkups, unique individual IDs were assigned using a novel procedure. Results From April 2012 to September 2018, the SKDB included 2,230,848 individuals (men, 1,019,687; 45.7%). The median age (min–max) of men and women was 60 (0–106) and 62 (0–111) years, respectively. During the study period, the median subscription time was 4.4 years; 40.8% of individuals continuously subscribed for the 6.5 years; 213,566 individuals died. Health checkup data were available for 654,035 individuals, amounting to 2,469,648 records. Care-service recipient data were available for 283,537 individuals; they used care insurance to pay for care costs. Conclusion SKDB, a population-based longitudinal cohort, provides a comprehensive dataset covering health checkups, disorders, medication, and care service. This database may provide a robust platform to identify epidemiological problems and generate hypotheses for preventing and treating disorders in the elderly.


INTRODUCTION
Advances in therapeutic techniques and medical technologies have contributed to health promotion and longevity. Clinical trials mostly evaluate the efficacy of new medicines, device candidates, and therapeutic regimens. However, the number of hypotheses that can be tested in a single trial is usually just one-even though a trial demands considerable time and cost. Thus, in medical heuristics, there are many potentially efficient therapies and types of care that remain unconfirmed.
Recently, individual-level real-world data, including health insurance data, have become available for medical research. [1][2][3][4] There are residual biases and strong assumptions with the mathematical modeling of findings obtained from health insurance data 5 ; however, analysis using health insurance data may be valuable in hypothesis generation and confirmation of the effectiveness of medical heuristics. 6 In Japan, the Ministry of Health, Labour and Welfare has developed and operated the National Database of Health Insurance Claims and Specific Health Checkups of Japan (NDB) since 2008, which has been available for academic research since 2011. The NDB includes information on disease and disorder names, medical therapy details, frequency and dosage of prescribed medicines, and treatment costs. [7][8][9][10] However, the NDB was not linked to the long-term care insurance (LTCI) database until October 1, 2020. Further, the NDB includes only individuals who used insurance to pay for medical care and underwent health checkups; the incidence and prevalence of the disease may have been incorrectly estimated owing to the lack of an identifier for all residents.
To improve the usefulness of the database, we developed a new longitudinal cohort from the Shizuoka Kokuho Database (SKDB); we did so using a unique procedure to connect individuals and remove overlaps among scattered data. The SKDB consists of data of Shizuoka Prefecture residents insured under National Health Insurance (NHI; for subjects under age 75 years) and the Latter-Stage Elderly Medical Care System (LSEMCS; age 75 years or older). The present study includes individuals who used insurance to pay for medical care, as well as all individuals insured under NHI and LSEMCS; it includes the claims of LTCI and the results of health checkups. This is a collaborative project by academia and local government. The aim of the Shizuoka Study is to clarify current and future epidemiological problems in the prefecture and identify solutions. We present here the characteristics of the SKDB.

METHODS Shizuoka Kokuho Database
Shizuoka Prefecture is located in central Japan and had 3.7 million inhabitants as of 2015 (census, 11 Figure 1). We obtained the Kokuho Database on Shizuoka Prefecture residents from the Federation of National Health Insurance Organizations (FNHIO); it was named the SKDB. The Kokuho Database includes the monthly claims data of health insurance for NHI and LSEMCS; it also contains the results of annual health checkups and daily careservice data from LTCI for NHI and LSEMCS subscribers. All data are linked by an identifier of the Kokuho Database (KDBID). The SKDB allowed us to keep track of all subscribers, including individuals who did not use insurance to pay for medical care. Japan's health-and care-insurance system and health checkups are explained in eMaterials 1 and eTable 1.
Variables included in the SKDB appear in Table 1. The  subscriber lists included age, sex, postal code, observation period,  reason for withdrawal, and death dates over the study period. The available information from the health insurance data was the patient's disease, treatment, the care given to both outpatients and hospitalized patients, and the corresponding cost. The careinsurance data included support and care level, as well as information about care services provided for insured individuals. We also obtained the results of health checkups, which included questionnaire responses and results from laboratory examinations.

Determination of unique identifiers
To analyze the SKDB, we had to prepare a unique identifier for each individual in the database, so we performed the following three procedures (eFigure 1). Problems related to the KDBID are explained in eMaterials 2. First, a new ID automatically replaces the KDBID at the time of subscribing to the LSEMCS at age 75 years; correspondence tables linking KDBIDs given for the same individuals were provided by the Shizuoka FNHIO. These lists were applied to the subscriber lists; KDBIDs were renamed as Unified KDBIDs (UKDBIDs).
Second, we excluded the following dispensable records and IDs: (1) working records generated for the insurers' administrative purposes; (2) UKDBIDs where the sex and age information were inconsistent within the subscriber list; (3) UKDBIDs that lacked the postal code within the subscriber list; these UKDBIDs are perhaps UKDBIDs of people insured with care insurance but lacking health insurance; (4) UKDBIDs for the Third, if there were multiple UKDBIDs with the same sex, birth date, and postal code, we selected only one. We prioritized UKDBIDs with a longer observational period with deaths and health checkups over others. The same person could not appear as multiple individuals in the SKDB.

Determining outliers in health checkup data
To deal with outliers for laboratory data in the health checkups, we flagged values that met outlier criteria (eTable 2). 12 When the values of height and weight or values of systolic blood pressure and diastolic blood pressure were the same, we flagged both values as outliers. In cases with three or more values, we flagged as outliers values that were ≥99.7 percentile for the standard deviation (SD) of all values (total SD) and the proportion (SD for the remaining values excluding one/total SD) was <40%, as well as values that were ≥99.9 percentile for the total SD. Those potential outliers will be reported to the individual who will conduct the statistical analysis. We also excluded multiple records (ie, records having the same UKDBID on the same measurement day).

Statistical analysis
We analyzed the SKDB as a population-based cohort study. We prepared a longitudinal dataset based on information related to monthly claims, annual health checkups, and daily care services. In this analysis, the start date of the follow-up period was defined as the insurance registration date or April 1, 2012, whichever came first; the end date was the date of insurance withdrawal or September 31, 2018, whichever came later.
Continuous variables are presented as mean and SD or median and range; categorical variables are presented as frequency and percentage. The individuals were classified into age groups: ≤4, We counted the numbers of cases with only NHI, both NHI and LSEMCS, and only LSEMCS. We estimated sex-and agegroup-specific survival rates using the Kaplan-Meier method for the whole follow-up period. We treated withdrawals from health insurance as losses to follow-up. For our analyses, we used the age as of the initial date for the follow-up period.
We calculated the coverage rates in the SKDB against public statistics for insured individuals (on March 31, 2015, obtained from Shizuoka FNHIO) and residents (according to the census of 2015 11 ) among subgroups for the insured period of fiscal 2015 (April 1, 2015 to March 31, 2016). We also determined the coverage rates for health-insurance users (with NHI and LSEMCS), care-service users (with LTCI), and individuals undergoing health checkups against the SKDB data for fiscal 2015. For these analyses, we used the age as of April 1, 2015.
We summarized the support and care levels of care insurance, using an initial certified level during fiscal 2015 for each case. We also calculated the coverage rates among care insurance users in the SKDB against residents (census of 2015 11 ) among subgroups for the insured period of fiscal 2015 (April 1, 2015 to March 31, 2016).
We summarized the results of health checkups using initial results during the follow-up period for each case. We performed a summary of the results of health checkups after deleting the measurements with outlier flags.
We conducted statistical analyses using SAS version 9.4 (SAS Institute, Cary, NC, USA).

Data disclosure
The SKDB is not accessible to the public. Researchers at certain medical institutes, such as Shizuoka General Hospital, are allowed access to the dataset for medical research following approval by the ethics committee of Shizuoka General Hospital.
Owing to a contract made with Shizuoka Prefecture, local municipalities, and Shizuoka FNHIO, the SKDB can currently be accessed only by our coresearchers. From April 2021, it will be necessary to collaborate with a full-time faculty of the Shizuoka School of Public Health to access the SKDB.

Ethical considerations
This study conforms to the Ethical Principles for Medical Research Involving Human Subjects issued by the Ministry of Health, Labour and Welfare and the Ministry of Education, Culture, Sports, Science, and Technology in Japan. We also obtained approval from each municipality review board in Shizuoka Prefecture for using the data. The Ethics Committee of Shizuoka General Hospital approved the whole research project (SGHIRB#2018058, 2018); that committee will also review individual plans to undertake additional research using the data. Information related to this research has been disclosed on the Web sites of the FNHIO in Shizuoka Prefecture, Shizuoka Prefectural Government Office, and Shizuoka General Hospital (eTable 3). Following approval by the review committees and the information disclosure, each person's information was anonymized and sent from the Shizuoka FNHIO to the Research Support Center of Shizuoka General Hospital for analysis.

RESULTS
The initial subscriber list, which was provided by the Shizuoka FNHIO, included 4,499,614 KDBIDs. After applying the procedure for identifying unique IDs (eFigure 1), 2,230,848 individuals (men, 1,019,687; 45.7%) with median person-years of 4.93 (range, 0.005-6.50) years, were included in the SKDB; 910,365 cases (40.8%) continuously subscribed to health insurance over the 6.5-year period. The median age at the initial date of the study period in men and women was, respectively, 60 (range, 0-106) and 62 (range, 0-111) years, and the frequencies of the sex and age-groups classified by the year of the initial date of follow-up period appear in eTable 4. The numbers of cases with only NHI, both NHI and LSEMCS, and only LSEMCS were 1,712,297 (76.8%), 85,471 (3.8%), and 433,080 (19.4%), respectively.
During the study period, among the 2,230,848 cases, the numbers of individuals who died for any reason were 110,873 (10.9%) for men and 109,323 (9.0%) for women. Figure 2 presents the age-and sex-specific survival rates from the insurers' initial subscription to death by any cause. The sex-and agegroup-specific reasons for loss to follow-up appear in Table 2.
In all, 1,332,625 individuals (men, 601,750; 45.1%) were insured in fiscal 2015. The sex-specific age distribution on April 1, 2015 appears in eFigure 2; the median age among men and women was 67 (range, 0-107) and 70 (range, 0-108) years, respectively. We compared the sex-and age-group-specific case numbers with public statistics and number of residents ( Figure 3). With any age-group, those numbers were close to the publicly available numbers of subscribers to NHI and LSEMCS.
The sex-and age-group-specific coverage rates for medical claims (including dental claims in NHI and LSEMCS), careservice suppliers (in LTCI), and health checkups in fiscal 2015 appear in Figure 4.
In fiscal 2015, data related to recipients of care service were available for 130,760 individuals (men, 38,807; 29.7%) who used   Figure 5 shows the frequency of age-groupand sex-specific initial care-service levels.
Health checkup data were available for 678,501 individuals (men, 288,890; 42.6%) with 2,383,523 results; those data amounted to 30.4% of the whole analysis set. Sex-and agegroup-specific results of first health checkups within the study period appear in Table 3 and Table 4.

DISCUSSION
The SKDB is a prefecture-wide, individual-level linked, and longitudinal dataset; it comprises health-and care-insurance claims and the results of health checkups. Several studies have used the Kokuho Database of a single municipality 13 and annuallinked data 14 ; however, the SKDB is the first prefecture-wide, longitudinal dataset, and it includes over 2 million Shizuoka Prefecture residents. Several studies using the SKDB have been  Nakatani E, et al.
reported. [15][16][17] In the future, the SKDB will be updated by adding further data, starting from October 1, 2018. Health insurance data have become widely used for epidemiological studies in several countries. The Taiwan National Health Insurance Research Database is the largest nationwide population database; it includes approximately 23 million Taiwanese, 1 and by 2018 over 2,700 medical papers related to it had been published. 18 The Netherlands, Scandinavian countries, and South Korea have also established national health-insurance databases. [2][3][4] In Japan, the NDB has covered almost all health-insurance claims submitted electronically from medical institutions since 2009. 19, 20 The NDB covers only individuals who used insurance to pay for medical      ALT, alanine aminotransferase; AST, aspartate aminotransferase; BMI, body mass index; BP, blood pressure; γ-GTPm, γ-glutamyl transpeptidasel; HDL, high-density lipoprotein; LDL, low-density lipoprotein.
Continuous and categorical variables were summarized by means (standard deviations) and frequency (percentages). In cases with multi-year data, the earliest one was used for analysis. The estimated glomerular filtration rate (eGFR) was calculated using the following equation: 194 × creatinine −1.094 The test paper was dipped into the urine sample and qualitatively classified as (−), (±), (+), (++), or (+++) according to the discoloration density of the reagent.
Nakatani E, et al. Categorical variables were summarized by frequency (percentage). The questionnaires at the health checkups addressed the following questions; a Have you been told by a physician that you have suffered a stroke (eg, cerebral hemorrhage, cerebral infarction) or have you ever received treatment for stroke? b Have you been told by a physician that you suffer from heart disease (eg, angina pectoris, myocardial infarction) or have you ever received treatment for heart disease? c Have you been told by a physician that you suffer from chronic renal failure or have you ever received treatment for chronic renal failure (dialysis)? d Have you been told by a physician that you suffer from anemia? e Are you currently a habitual smoker? f How often do you drink alcoholic beverages (eg, sake, distilled spirit, beer, whiskey, wine)? g How much do you drink a day? The amount of drinking was assessed using the Japanese liquor unit of go, whereby 1 go corresponds to 22 g of ethanol. h Has your body weight increased by 10 kg or more since age 20? i Have you undergone a weight gain or loss of 3 kg or more in the past year? j Have you performed exercise that involved slight sweating for 30 minutes or more at least twice a week for over 1 year? k Do you walk or engage in some physical exercise equivalent to walking for 1 hour or more a day? l Do you walk faster than people who are of around the same age as you and the same sex? m Do you sleep well and get sufficient rest? n Do you want to improve your life habits regarding eating and exercise? o Do you eat faster than other people? p Do you eat dinner within 2 hours before sleeping at least three times a week? q Do you eat any snacks after dinner (a bedtime snack other than three regular meals) three times or more a week? r Do you miss breakfast three times or more a week? s Are you taking medication to reduce your blood pressure? t Are you taking medication to reduce your cholesterol level? u Are you taking insulin injections or medication to reduce your blood sugar?
Data Resource Profile of Shizuoka Kokuho Database care and undertook health checkups; it does not include all insured individuals. Furthermore, the NDB was not linked to the LTCI database until October 1, 2020. Accordingly, compared with other countries, health-insurance data in Japan have not been fully utilized for epidemiological studies. Accordingly, we believe that even though the SKDB covers a single prefecture, not the whole of Japan, it has complete personal links between insurance claims data and health checkup data; thus, it is appropriate for epidemiological analysis in several ways. The coverage of the SKDB for individuals aged <75 years is limited to residents who enrolled in the NHI; it does not include people who subscribed to employee health insurance. However, all individuals ≥75 years are included in the SKDB. Japan is becoming a super-aged society ahead of other parts of the world 21,22 ; thus, population-based data about older people constitutes essential information for understanding and addressing related problems. For other developed countries facing a rapidly aging population, it is essential to understand such health problems as frailty and to identify solutions. The SKDB may be one of the prime options for providing health-care evidence for an older Asian population.
The Japan Medical Data Center (JMDC) has a health-insurance claims dataset that can be used for medical research. 23,24 That dataset mainly includes individual-level linked data on healthinsurance claims and specific health checkups; the data were obtained mainly from insurers receiving employee health insurance. The SKDB, however, does not cover beneficiaries from employee health insurance. The JMDC dataset does not include data on retired people; it is unsuitable for analyzing geriatric diseases (such as bone fractures, dementia, and terminal care) and long-term care. The possibility to undertake such analyses is a strength of the SKDB. Accordingly, JMDC data and organized Kokuho Databases, such as the SKDB, may need to complement one another. An extensive database makes it possible to grasp the basic epidemiology in certain diseases with low incidence, including rare conditions. For example, the incidence of progressive multifocal leukoencephalopathy among patients with autoimmune diseases was determined by analyzing a United States health insurance database. 25 Kuo et al assessed familial aggregation of systemic lupus erythematosus and other autoimmune diseases from the data of over 18,000 patients. 26 Using the SKDB as an extensive dataset may help determine the characteristics (eg, lifestyle information from health checkups) of several low-incidence diseases in addition to calculating the incidence and prevalence. The SKDB may be useful for finding solutions to health-care issues that cannot easily be assessed by hospital-and population-based cohorts.
It is difficult to precisely conduct a prognosis analysis (including an economic analysis) of cancer patients. In the future (as of November 18, 2020), if it becomes possible to match the population-based cancer registry data in Shizuoka Prefecture 27 with the SKDB, the detailed baseline characteristics of cancer and detailed cause-of-death information will be added to the SKDB. Conversely, the SKDB provides a population-based cancer registry with medical or long-term care services, as well as more detailed medical information, such as comorbidities in cancer patients. Thus, a precise analysis, which is currently not possible, may be conducted in future by combining the SKDB with other data.

Limitations
When analyzing the SKDB, there are several limitations regarding identifiers, care-insurance claims, health checkups, and cause of death. First, with respect to identifiers, we found cases with multiple KDBIDs by checking the coincidence of sex, birth date, and postal code such that only a single ID remained in the SKDB. Different people having exactly the same information could be deleted over suspicion of being the same person. Second, conversely, cases with multiple KDBIDs may not be eliminated; for example, individuals who moved to another municipality in Shizuoka Prefecture within the study period may have had different KDBIDs and thus were treated as different people. Third, we treated readmission cases with the same KDBID as continuous subscribers; however, data on insurance claims and specific health checkups during a period of temporary withdrawal were not available.
Regarding care-insurance claims, first, only two digits-not six digits-of the care-service code were available, so we lacked details about that service, knowing only approximately the type of long-term care service. Thus, a detailed analysis of care services provided for insured individuals is impractical using the SKDB. Second, we could not analyze all insured individuals receiving LTCI: the dataset included claims data only for careservice receivers who used care insurance to pay for care costs. Further, certified information about care levels was unavailable.
The SKDB included only about 84% of care-service receivers from March to September 2018 among the individuals certified as needing care services in March 2018 28 ; thus, individuals with poorer health and with a certified care level could not be accurately identified.
Health checkups are not mandatory for older people in Japan; systems for health checkups depend on the annual health administration policy of each municipality. Therefore, in a subpopulation with available health checkups results, there may be subgroups that should be distinguished by year and municipality in addition to sex and age. We investigated the presence of associations among those four classification variables, as well as among other variables; perhaps owing to the large number of cases, almost all the tests for independence were significant (P < 0.001, data not shown). Thus, to avoid a severe bias when assessing health checkup data, the classification variables should be confirmed and an adjustment analysis should be undertaken when analyzing the SKDB.
This database contains all deaths and dates of death (provided by the FNHIO), but the cause of death is unknown. It may be identifiable by extracting the disease name code for the months before death.

Conclusion
The SKDB is organized as an individual-level linked, populationbased longitudinal cohort; it comprises data of Shizuoka Prefecture residents with NHI and LSEMCS. The dataset covers all individuals insured with NHI and LSEMCS, not just those receiving medical care. The database also has the results of subscriber lists, health checkups, LTCI claims data, and data with all death dates. The SKDB may be useful for addressing healthcare issues of older people and epidemiological issues unclarified using conventional population-based cohorts. Nakatani E, et al.