Annals of Clinical Epidemiology
Online ISSN : 2434-4338
SEMINAR
Introduction to High-dimensional Propensity Score Analysis
Miho Ishimaru
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2020 Volume 2 Issue 4 Pages 85-94

Details
ABSTRACT

High-dimensional propensity score analysis automatically selects independent variables for calculating propensity scores, using a vast amount of information from real-world health care databases. This technique can reduce confounding by indication or unmeasured confounders more precisely compared with conventional propensity score analysis. High-dimensional propensity score analysis assumes that proxy information for important unmeasured confounders can be obtained from the underlying data. The number of published studies using high-dimensional propensity score analysis has increased, with pharmacoepidemiology as the main area in which these studies have been published. This report explains the main assumption and the limitations of this analytical method and provides step-by-step procedures to implement the method.

INTRODUCTION

Real-world data (RWD) have been used to study the effectiveness of medical procedures and medications in many countries in recent years [1]. However, observa­tional studies using RWD have several limitations, including confounding by indication and unmeasured confounders. Propensity score analyses have recently been applied to adjust for confounding by indication. An overview of propensity score analysis has been provided elsewhere [2].

High-dimensional propensity score (hd-PS) analysis is a novel analytical method that uses RWD to overcome the limitations of conventional propensity score analysis. The hd-PS approach was first proposed in 2009 [3] and was updated in 2018 by Schneeweiss et al. [4]. Hd-PS analysis can be used in clinical studies with RWD such as electronic health records or administrative claims databases. Using a vast amount of medical information from these types of databases, hd-PS analysis can reduce both confounding by indication and the effect of unmeasured confounders.

Clinical studies using hd-PS analysis have gradually increased; however, the hd-PS approach may often be difficult for clinicians or researchers to understand. The present article aims to introduce to the concept of hd-PS, provide information on how to use hd-PS analysis, and review previous studies that have used the hd-PS approach.

OVERVIEW OF THE HIGH-DIMENSIONAL PROPENSITY SCORE APPROACH

The basic concept and assumption of hd-PS analysis are similar to those of conventional propensity score analysis. Indeed, hd-PS analysis can be regarded as an extension of conventional propensity score analysis.

The key assumption of hd-PS analysis is that proxy information for important unmeasured confounders can be obtained from the underlying data. In conventional propensity score analysis, researchers select several variables from a database as confounding factors on the basis of knowledge from previous studies or from the investigators’ experience. However, in hd-PS analysis, the independent variables used to estimate the propensity scores are selected automatically from a vast amount of historical data on diagnoses, prescriptions, and procedures in RWD. These medical records are presumably related to a variety of patient characteristics, including health and socioeconomic status. Thus, the information in the records can be regarded as proxy variables for unmeasured confounding factors.

Assume that a researcher wants to evaluate the association between Treatment A and mortality using a database. Previous studies have reported age, body mass index, diabetes mellitus, and activities of daily living (ADL) as confounding factors. The researcher’s database contains information on age, body mass index, and diabetes mellitus but unfortunately does not include data on ADL. Thus, ADL is an unmeasured confounder (Fig. 1). However, ADL may be associated with various illnesses, procedures, and prescriptions, such as history of fracture, cerebrovascular disease, rehabilitation, and prescription of anti-dementia drugs. Thus, it is possible to indirectly control for ADL by adjusting for as many of the available variables that are associated with ADL as possible.

Fig. 1 Proxy measures of unobserved confounders for high-dimensional covariate adjustment

The utility of hd-PS analysis compared with conventional propensity score analysis has been reported in several previous studies. First, compared with the results from conventional propensity score analyses, the effect sizes obtained from hd-PS analyses are closer to those obtained from randomized, controlled trials [3, 5]. Second, a simulation study reported that the estimated effect sizes are closer to the expected values in hd-PS analyses than in conventional propensity score analyses [6]. Third, using hd-PS analysis, two groups were found to be well balanced with respect to unmeasured variables [7, 8].

HOW TO CREATE HIGH-DIMENSIONAL PROPENSITY SCORES

This section details 6 steps on how to create hd-PSs. As an example, I refer to a previous study we published in 2019 using the JMDC Claims Database (JMDC Inc., Tokyo, Japan) [9]. The JMDC database is a commercial Japanese claims database that contains data on 7.3 million corporate employees and their family members aged under 75 years.

Determination of Data Dimensions

Administrative claims databases can usually be divided into data dimensions, each including a subset of information with a specific coding system. My study identified 6 data dimensions in the JMDC database: (i) inpatient and (ii) outpatient diagnoses (International Classification of Diseases, 10th Revision codes), (iii) inpatient and (iv) outpatient procedures (Japanese original procedure codes), and (v) inpatient and (vi) outpatient prescriptions dispensed by a pharmacy (Anatomical Therapeutic Chemical Classification System codes). Examples of the diagnoses recorded in the database are shown in Table 1. Researchers can add data dimensions for laboratory test results, biomarker status, and free text if necessary.

Table 1An example of recorded diagnoses
Patient identifierInpatient/outpatientYear/MonthDiagnosisICD-10 code
AInpatient2015/01HypertensionI10.00
AInpatient2015/01Malignant neoplasm: body of stomachC16.20
AOutpatient2015/02HypertensionI10.00
AOutpatient2015/03HypertensionI10.00
BOutpatient2016/01HypertensionI10.00
BOutpatient2016/02HypertensionI10.00
BOutpatient2016/03HypertensionI10.00
CInpatient2012/03Type 2 diabetes mellitusE11.90
CInpatient2012/03HypertensionI10.00
DOutpatient2015/02HypertensionI10.00
EOutpatient2014/10Type 2 diabetes mellitusE11.90

Abbreviations: ICD-10, International Classification of Diseases, 10th Revision.

Identification of Candidate Empirical Covariates

The codes for the variables are listed for each dimension and sorted in order of prevalence. Prevalence is defined as the percentage of patients who have a specific code at least once during a 6-month or 12-month baseline period before the exposure. When the prevalence is greater than 50%, the prevalence should be replaced by 100% minus the prevalence. For example, when the prevalence is 70%, it is replaced by 30%. The top 100 to 200 most prevalent codes are identified as candidate empirical covariates.

In the JMDC database, diagnosis codes include International Classification of Diseases, 10th Revision codes ranging from A00.00 to Z99.90, procedure codes include Japanese procedure codes ranging from A000 (initial consultation fee) to N007 (histopathology), and pharmacy dispensing codes include Anatomical Therapeutic Chemical Classification System codes ranging from A01 to V10. It is necessary to determine the number of digits in the codes to calculate each code’s prevalence. It is optimal to have 3 or 4 digits in each code because larger numbers of digits indicate finer classification.

Evaluation of the Occurrence of the Variable Codes

For each variable code appearing for each patient during the baseline period, the number of occurrences is count­ed. Three binary codes are created to describe the candidate empirical covariates: (i) at least one occurrence of the code, (ii) sporadic occurrences, and (iii) many occurrences. An example of the evaluation of the occurrence of a candidate empirical covariate using these variable codes is shown in Table 2.

Table 2Count of the occurrences of each variable code
Patient identifierNumber of occurrences of I10 (hypertension)1) at least one occurrence of the code(2) sporadic occurrences(3) many occurrences
A5111
B3110
C1100
D1100
E0000

Bias Evaluation of the Covariates

Next, potential biases for candidate variables are assessed by considering the relationship between each variable and the outcome. First, the multiplicative bias term (BiasM) is calculated using the following formula:

  

BiasM=Pc1RRcd - 1 + 1Pc0RRcd - 1 + 1 ,

where Pc1 and Pc0 are the proportions of cases for whom a code is observed in the treatment and control groups, respectively, and RRcd is the unadjusted relative risk of the outcome associated with the variable.

The absolute values of the logarithms of BiasM are then compared across candidate empirical covariates. An example of this bias evaluation is shown in Table 3.

Table 3Evaluation of bias using the multiplicative bias term (BiasM)
VariablePc1Pc0RRcdBiasM|log(BiasM)|
I10_1 (at least one occurrence of I10)0.60.21.501.18180.167
I10_2 (sporadic occurrences of I10)0.20.21.3310
I10_3 (many occurrences of I10)0.203.001.40.336

Note. Pc1 and Pc0 are the proportions of a variable in the treatment and control groups, respectively. RRcd is the unadjusted relative risk of the outcome associated with the variable.

Selection of Empirical Covariates

After listing the absolute values of the logarithms of BiasM in descending order, the top k variables are selected for inclusion in the propensity score estimation model. Previous studies have reported that 50 to 300 top-ranked variables were sufficient for stable estimations; therefore, the top 500 variables have generally been selected [10].

Adding other factors is optional; additional factors might include patient background factors (e.g., age, gender, year, and race), known confounding factors, and variables on health service utilization such as the number of visits and drug prescriptions.

Propensity Score Estimation

Finally, a multivariate logistic regression model is used to estimate the hd-PSs. This model includes the selected variables, background factors, and known confounding factors. The same methods are used as in traditional propensity score analysis, such as matching, inverse probability of weighting, stratification, and covariate adjustment.

OVERVIEW OF STUDIES USING HIGH-DIMENSIONAL PROPENSITY SCORE METHODS

This section provides a narrative review of published studies conducted using the hd-PS approach in a variety of settings. In June 2020, I searched studies in PubMed using the following search string: (“high-dimensional propensity score*” OR “high dimensional propensity score*”).

I found 157 studies, 4 of which were excluded because 2 were not original articles and 2 did not focus on hd-PS analysis. Therefore, 153 studies were eligible for this review [3155]. Fig. 2 shows the number of studies per publication year. The numbers of studies focusing on the hd-PS approach or using hd-PS analyses have gradually increased. Of the identified studies, 30 were methodology articles that focused on hd-PS analysis, 108 were clinical epidemiological studies, 8 were health service research, and 7 were other types of studies.

Fig. 2 Number of studies using high-dimensional propensity score analysis by publication year

Of the 108 clinical epidemiological studies, 87 were pharmacoepidemiological studies, 9 were studies focusing on diagnoses, and 12 were studies examining medical procedures.

Of the 87 pharmacoepidemiological studies, 18 focused on drugs for diabetes mellitus, 14 focused on anticoagulant or antiplatelet drugs, and 13 focused on antidepressant drugs.

Regarding the methods for using hd-PSs, 74 studies used matching, 37 used adjustment, 7 used inverse probability of treatment weighting, and 2 used stratification.

The number of studies by country was 49 in the United States, 27 in Canada, 24 in the United Kingdom, 16 in other European countries (France, Sweden, Germany, and Italy), and 7 in Asian countries (Taiwan and Japan).

One study evaluated the association between benzodiazepines and risk of all-cause mortality in adults [72]. For creating hd-PSs, the top 200 variables were selected for inclusion in the propensity score estimation model on the basis of the evaluation of BiasM. In addition to these variables, investigator-identified variables such as demographic characteristics, comorbidities and lifestyle factors, medications, and health care utilization indicators were included. After 1:1 hd-PS matching was performed, 1,252,988 eligible matched pairs were identified. The results from an unadjusted Cox regression analysis indicated that mortality was significantly higher in benzodiazepine users than in non-initiators (hazard ratio, 1.79; 95% confidence interval, 1.73 to 1.85). However, the hazard ratio declined to 1.00 (95% confidence interval, 0.96 to 1.04) in the hd-PS-matched population. The authors concluded that these results suggested either no increase or a small increase in the risk of all-cause mortality associated with benzodiazepine use.

Another study evaluated the quality of care for percutaneous coronary interventions (PCI) in outpatient settings compared with in-hospital settings [85]. To create hd-PSs, the following 6 dimensions were defined: inpatient diagnoses, ambulatory diagnoses without an exclusion diagnosis, inpatient procedures, outpatient procedures, ambulatory treatments, and ambulatory prescribed medications. The top 200 most prevalent codes were selected for each dimension. Finally, the top 500 variables were selected for inclusion in the propensity score estimation model after the evaluation of BiasM. In the main hd-PS analysis, the authors used a weighted Cox regression with stabilized inverse probability of treatment weights based on the hd-PSs. In the conventional propensity score analysis, mortality was lower among patients receiving the outpatient PCI than among those receiving the inpatient PCI (hazard ratio, 0.45; 95% confidence interval, 0.37 to 0.55). However, in the hd-PS analysis, the mortality of patients receiving the outpatient PCI was not lower than that of those receiving the inpatient PCI (hazard ratio, 1.20; 95% confidence interval, 0.91 to 1.52).

LIMITATIONS

Hd-PS analysis has several limitations. First, not all unmea­sured confounders can be adjusted using hd-PS analysis. Only unmeasured confounders associated with variables in the database can be controlled in this way. If there are independent unmeasured confounders, other analytical methods and study designs should also be considered.

Second, there are concerns about the possibility of over-adjustment with the hd-PS approach. Adjusting for too many pre-exposure covariates will lead to collinearity and statistical inefficiency in the estimation. However, a previous study has demonstrated that the biases from over-adjustment in hd-PS analysis are minor [4].

CONCLUSIONS

Hd-PS analyses automatically select a large number of variables for use in estimating propensity scores, drawing on a vast amount of information in RWD. Hd-PS analysis can reduce residual confounding, with the assumption that proxy information for important unmeasured confounders can be obtained from the underlying data. The number of studies using hd-PS analyses is increasing, particularly in pharmacoepidemiology. I believe that hd-PS analysis is a promising tool for enhancing comparative effectiveness studies using RWD.

CONFLICT OF INTERESTS

MI has no conflict of interest.

ACKNOWLEDGMENTS

This work was supported by grants from the Ministry of Health, Labour and Welfare, Japan (19AA2007 and H30-Policy-Designated-004) and the Ministry of Education, Culture, Sports, Science and Technology, Japan (17H04141). The funders had no role in the execution of this study or the interpretation of the results.

REFERENCES
 
© 2020 Society for Clinical Epidemiology

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top