Variance Estimation for Logistic Regression in Case-cohort Studies

Background The logistic regression analysis proposed by Schouten et al (Stat Med. 1993;12:1733–1745) has been a standard method in current statistical analysis of case-cohort studies, and it enables effective estimation of risk ratios from selected subsamples, with adjustment of potential confounding factors. Schouten et al (1993) also proposed the standard error estimate of the risk ratio estimator can be calculated using the robust variance estimator, and this method has been widely adopted. Methods and Results The robust variance estimator does not account for the duplications of case and subcohort samples and generally has certain bias (ie, inaccurate confidence intervals and P-values are possibly obtained). To address the invalid statistical inference problem, we provide an alternative bootstrap-based valid variance estimator. Through simulation studies, the bootstrap method consistently provided more precise confidence intervals compared with those provided using the robust variance method, while retaining adequate coverage probabilities. Conclusion The robust variance estimator has certain bias, and inadequate conclusions might be deduced from the resultant statistical analyses. The proposed bootstrap variance estimator can provide more accurate and precise interval estimates. The bootstrap method would be an alternative effective approach in practice to provide accurate evidence.


Introduction
The case-cohort design 1 has been widely used as an efficient study design to reduce costs and effort for clinical and epidemiologic studies.In statistical analyses of case-cohort studies, the logistic regression analysis proposed by Schouten et al. 2 has been one of the standard methods used in current practice, and it enables risk ratios to be effectively estimated from the selected subsamples.A remarkable advantage of their method is that the computation can be easily implemented using statistical packages for ordinary logistic regression analysis (e.g., glm in R).In the logistic regression analysis, duplicate participants between case and subcohort samples are formally regarded as different participants and a logistic regression is fitted to the selected subsamples as ordinary casecontrol studies 3 : where  is an indicator variable that is equal to 1 if a participant is in the case samples and 0 if he/she is in the subcohort samples.Also,  , … ,  are the explanatory variables.
The formal maximum likelihood estimators of the regression coefficients except for the intercept  have been shown to be unbiased (consistent) estimators of the log risk ratios in the target population 2 .Schouten et al. 2 noted that their standard errors can be estimated by the robust (sandwich) variance estimator.However, this variance estimation ignores the duplications of case and subcohort samples and simply fits the ordinary robust variance formulae to the pseudo-likelihood function.In the present article, we show that the variance estimator is biased because the duplications are not adequately accounted for and that the resultant confidence intervals are usually unprecise.We also provide an alternative valid variance estimator using bootstrap and show its effectiveness via simulation studies.

Bias of the robust variance estimator
Consider a cohort that consists of N participants.We here consider a case-cohort study that samples n1 case samples among the cases and n0 subcohort samples among the entire cohort.Schouten et al. 2 proposed that the risk ratio of the target population is consistently estimated by the ordinary logistic regression and that the standard errors of the regression coefficient are also consistently estimated by the robust variance estimator, because they considered D is correlated for the duplicated samples.However, they did not confirm the validity of their approach via rigorous asymptotic theory or simulation studies.D is actually not correlated for the duplicated samples, because the subcohort sampling is independently determined by the case status.Thus, the model and robust variance estimators are asymptotically equivalent.Also, from the asymptotic theory of responseselective design 4 , especially for the conventional case-control study 3,5 , the consistency of the Prentice-Pyke-type model variance estimator 3 is assured when the all case and subcohort samples are unduplicated.However, duplications usually exist between the case and subcohort samples and the assumption is generally violated.Thus, the robust variance estimator is biased because the duplications are not adequately accounted for.In subsequent simulation studies, the robust variance estimator certainly overestimated the actual standard errors of the regression coefficients (Table 1).

Alternative valid variance estimator using bootstrap
To construct unbiased variance estimators, the duplications should be adequately accounted for; consider that m duplicate samples exist between the case and subcohort samples.An effective approach is to use the bootstrap method.To account for the duplications, we formally consider the sampling scheme of a case-cohort study as a conventional case-control sampling; Noma and Tanaka 6 showed that the sampling design of case-cohort studies is theoretically equivalent to that of case-control sampling.The duplicated samples are then regarded as randomly selected samples from the case samples.
To quantify the uncertainty adequately, an effective approach is to incorporate this sampling mechanism by bootstrap.The bootstrap algorithm is then given as follows.

Algorithm (bootstrap variance estimation)
1. Perform a bootstrap resampling from the  case samples.
2. Perform a bootstrap resampling from the   non-case samples in the subcohort.
3. Select  samples from the  bootstrap samples of case samples randomly, and add them to the subcohort (the duplicated samples).
4. Fit the logistic regression to the bootstrap samples generated by processes 1-3.

5.
Repeat processes 1-4 and calculate bootstrap samples of the regression coefficient estimates sufficient times.Then, compute the empirical variances of the bootstrap samples of regression coefficients.
Through the bootstrap algorithm, consistent standard error estimates can be obtained.In processes 2 and 3, the duplications are adequately accounted for in the bootstrap algorithm.An alternative asymptotically equivalent resampling strategy is to resample from the   unduplicated case samples, the   non-case samples, and the  duplicated samples, separately.Also, we can consider another naïve bootstrap strategy that substitutes processes 2 and 3 with process 2′: 2′. Perform a bootstrap resampling from the  samples in the subcohort.
This bootstrap resampling corresponds to the naïve bootstrap for ordinary case-control studies, and the duplications are not accounted for.This naïve bootstrap algorithm provides standard error estimates similar to those of Schouten et al.'s 2 robust variance estimator.
An R package bootcc (https://github.com/nomahi/bootcc) is available for implementing the proposed bootstrap inference method by simple commands.

Simulation studies
To assess the validities of the theoretical results, we carried out simulation studies.The simulation settings were based on the Wilms' tumor studies of Breslow et al. 7 For the event occurrence mechanisms, we considered a binomial regression model with a log link N was set to 2000, 4000, and 10,000, and the subcohort size was determined to be 20% and 40% of N. Also, all cases were sampled as case samples.The number of bootstrap resampling was consistently set to 2000, and 10,000 simulations were performed for all scenarios.

Results
The simulation results are presented in Table 1.We assessed the means and standard deviations of the regression coefficient estimates across the 10,000 simulations.We also evaluated the means of standard error estimates across the 10,000 simulations for the robust variance estimator (SE ), the naïve bootstrap variance estimator (SE , ), and the proposed bootstrap variance estimator (SE , ).In addition, we evaluated the empirical coverage probabilities of Wald-type 95% confidence intervals of the regression coefficients based on the three variance estimators (i.e.,
The regression coefficients were unbiasedly estimated using the logistic regression analysis method of Schouten et al. 2 In addition, the means of the standard error estimates obtained by the robust variance estimators were certainly biased from the actual standard errors for all scenarios and overestimation biases were indicated.Furthermore, the means of standard error estimates obtained by the robust variance and the naïve bootstrap variance estimators were similar under all settings.These results indicate that the robust variance estimator did not account for the duplications of the samples.In addition, the proposed bootstrap variance estimators unbiasedly estimated the actual standard errors under all settings.The coverage rates of the 95% confidence intervals reflected these properties, and the confidence intervals obtained by the robust variance and the naïve bootstrap variance estimators were generally too conservative.More precise and valid interval estimates were provided by the proposed bootstrap method, with retention of adequate coverage rates.

Discussion and Conclusions
Logistic regression analysis has been a standard method for case-cohort studies, and the robust variance estimator has been used for these analyses.As shown in the present study, the robust variance estimator has certain bias, and inadequate conclusions might be deduced from the resultant statistical analyses.By contrast, the proposed bootstrap variance estimator is shown to be unbiased.The resultant confidence intervals are more precise, and more accurate interval estimates are generally obtained.Thus, the bootstrap method should be adopted in practice to provide accurate evidence.
Recently, alternative efficient inverse probability weighting methods have been established for case-cohort studies 6,8 .However, because of its simplicity and usefulness, the logistic regression analysis will continue to be used as a standard method for casecohort studies.Our results enable more accurate and precise evaluations of effect measures in case-cohort studies and would facilitate the use of Schouten et al.'s 2 effective method in practice.

Table 1 .
Results of the simulations for the logistic regression analysis * .