Risk Ratio and Risk Difference Estimation in Case-cohort Studies

Background In case-cohort studies with binary outcomes, ordinary logistic regression analyses have been widely used because of their computational simplicity. However, the resultant odds ratio estimates cannot be interpreted as relative risk measures unless the event rate is low. The risk ratio and risk difference are more favorable outcome measures that are directly interpreted as effect measures without the rare disease assumption. Methods We provide pseudo-Poisson and pseudo-normal linear regression methods for estimating risk ratios and risk differences in analyses of case-cohort studies. These multivariate regression models are fitted by weighting the inverses of sampling probabilities. Also, the precisions of the risk ratio and risk difference estimators can be improved using auxiliary variable information, specifically by adapting the calibrated or estimated weights, which are readily measured on all samples from the whole cohort. Finally, we provide computational code in R (R Foundation for Statistical Computing, Vienna, Austria) that can easily perform these methods. Results Through numerical analyses of artificially simulated data and the National Wilms Tumor Study data, accurate risk ratio and risk difference estimates were obtained using the pseudo-Poisson and pseudo-normal linear regression methods. Also, using the auxiliary variable information from the whole cohort, precisions of these estimators were markedly improved. Conclusion The ordinary logistic regression analyses may provide uninterpretable effect measure estimates, and the risk ratio and risk difference estimation methods are effective alternative approaches for case-cohort studies. These methods are especially recommended under situations in which the event rate is not low.

where   are the auxiliary variables that are measured for all subjects in the whole cohort, and the weights   are set so that the weighted mean of   is equal to the population total   (i = 0,1; j = 1,…,J; k = 1,…,Nij); Ω is the index set of the phase-1 samples.When the target measure is the population total   of a variable   , and if .See Deville and Särndal (1992) and Deville et al. (1993) for more distance functions and their properties.
In case-cohort studies using the pseudo-Poisson and pseudo-normal linear regressions, the IPW estimator with design weights is approximated to the true regression coefficient  0 plus a weighted sum of the efficient scores: where   ( 0 ) is the information matrix of the Poisson or normal linear regression model.Since  0 is a fixed quantity, the estimator is expected to be improved by calibrating the weight with respect to some auxiliary variables correlated with   −1 ( 0 )  ( 0 ) .Breslow et al. (2009a, b) proposed to use dfbetas where  ̃ is the regression coefficient estimate for phase-1 cohort data.For the computations of the dfbetas, since  ̃ is unknown, we propose using the following approximate computational method.
(i) Because the phase-2 variables are missing, Breslow et al. (2009a, b) proposed imputing a single suitable value to the missing covariates.For predicting the missing covariates, construct a regression model with the fully observed covariates as explanatory variables and make a prediction model using a weighted estimation.
(ii) For the imputed phase-1 dataset, use the predicted values generated in Step (i) to estimate  ̃ by fitting the pseudo-Poisson or pseudo-normal linear regression.
Then, extract the dfbetas from the regression model.
Then, approximate dfbetas are computed using the estimates of  ̃ and the design weights are calibrated using the computed dfbetas as the auxiliary variables.We also propose to use the calibrated weights with the approximate dfbetas.Through a weighted pseudo-  x1 was supposed to be measured only for the phase-2 samples and was considered to correlate with two other variables, z1 ~ Bernoulli (0.10) and z2 ~ N(0, 1).These variables were observed for all subjects in the phase-1 cohort, such that logit{Pr(X1 = 1 | z1, z2)} = γ0 + γ1 z1 + γ2 z2.x2 and x3 were supposed to be measured for all participants in a phase-1 cohort and were dummy variables of a trinomial distribution with event probabilities 0.16 and 0.48, respectively.The stratified phase-2 sampling was implemented by six strata divided by z1 and tertiles of z2.The sample size of the phase-1 cohort was set to 4,000.In the phase-2 sampling, all cases were sampled, and the number of cases in the subcohort was set to 400 for three strata with z1 = 0 and to 100 for three strata with z2 = 1.For the log link model, the regression parameters were set to β0 = −1.81,β1 = 0.96, β2 = −0.28,β3 = −0.39,and δ12 = 1.84.Also, for the identity link model, the regression parameters were set to β0 = 0.17, β1 = 0.22, β2 = −0.03,β3 = −0.06,and δ12 = 0.38.For the model used to generate x1, we considered z1 as a correlated surrogate variable of x1.Two settings for the correlation of x1 and z1 were considered: (i) γ0 = −3.50,γ1 = 0.50, and γ2 = 4.5 (sensitivity = 0.720, specificity = 0.967; high correlation); and (ii) γ0 = −3.50,γ1 = 0.50, and γ2 = 3.0 (sensitivity = 0.384, specificity = 0.967; moderate correlation).We performed 10,000 simulations for each scenario.We compared the performance of the IPW estimators (with the design weight, calibrated weight, and estimated weight) with the entire cohort estimator as a benchmark.For the calibrated weights, we computed approximate dfbetas, predicted the missing x1 using the logistic model logit{Pr(X1 = 1 | z1, z2)} = γ0 + γ1 z1 + γ2 z2 (correct model), and adopted a raking distance function.For the estimated weights, we used a logistic regression model and adopted the stratum indicators and the dfbetas as covariates.
The results of the simulation studies are presented in e-Tables 1 and 2. We assessed the mean, standard deviation (SD), root mean squared error (RMSE), empirical coverage probability (CP) for the 95% confidence intervals, and estimated relative efficiency (RE) of the IPW estimators compared with the entire cohort estimator for the 10,000 simulations.All of the IPW estimators could estimate the risk ratios and risk differences without bias.The relative efficiencies depended on the scenarios and regression coefficients, but the efficiencies of the IPW estimators with calibrated and estimated weights were generally higher than the IPW estimator with design weights.In particular, efficiencies were markedly gained for the regression coefficients (β2 and δ12) that corresponded to covariates that correlated with the phase-2 variable x1.Also, the relative efficiencies of estimating β1 by the IPW estimators with calibrated and estimated weights were slightly improved comparing with the IPW estimator with design weights.In addition, the 95% confidence intervals were validly constricted; the CPs of all of the proposed IPW methods were approximately 0.95.
eTable 1. Summary of the estimates of pseudo-Poisson regression parameters derived from 10,000 simulated datasets of a phase-2 cohort (N=500) sampled from a phase-1 cohort (N=4,000); the regression coefficients are interpreted as the log risk ratio
Summary of the estimates of pseudo-normal linear regression parameters derived from 10,000 simulated datasets of a phase-2 cohort (N=500) sampled from a phase-1 cohort (N=4,000); the regression coefficients are interpreted as the risk difference * coverage probability for the 95% confidence intervals; RE, estimated relative efficiency of the estimator compared with the entire cohort; RMSE, root mean squared error of the estimates from the true regression parameter.Mean, SE: Mean and SD of the estimates in 10,000 simulations.