Understanding Marginal Structural Models for Time-Varying Exposures: Pitfalls and Tips

Epidemiologists are increasingly encountering complex longitudinal data, in which exposures and their confounders vary during follow-up. When a prior exposure affects the confounders of the subsequent exposures, estimating the effects of the time-varying exposures requires special statistical techniques, possibly with structural (ie, counterfactual) models for targeted effects, even if all confounders are accurately measured. Among the methods used to estimate such effects, which can be cast as a marginal structural model in a straightforward way, one popular approach is inverse probability weighting. Despite the seemingly intuitive theory and easy-to-implement software, misunderstandings (or “pitfalls”) remain. For example, one may mistakenly equate marginal structural models with inverse probability weighting, failing to distinguish a marginal structural model encoding the causal parameters of interest from a nuisance model for exposure probability, and thereby failing to separate the problems of variable selection and model specification for these distinct models. Assuming the causal parameters of interest are identified given the study design and measurements, we provide a step-by-step illustration of generalized computation of standardization (called the g-formula) and inverse probability weighting, as well as the specification of marginal structural models, particularly for time-varying exposures. We use a novel hypothetical example, which allows us access to typically hidden potential outcomes. This illustration provides steppingstones (or “tips”) to understand more concretely the estimation of the effects of complex time-varying exposures.

When we try to say something meaningful about a specific exposure-outcome causal relationship, counterfactual models are among the most popular and widely accepted approaches in the epidemiologic community. 1-6 A counterfactual approach not only formalizes the language of cause and effect, [7][8][9][10][11][12][13] but has also triggered the explosive development of novel analytic methods, including propensity scores (ie, the probability of exposure conditional on measured confounders) [14][15][16][17][18][19] and regression modelbased estimation methods (ie, multivariable-adjusted outcome modeling, possibly followed by averaging predicted risks under distinct exposure statuses), 20,21 which have been evolved into doubly robust estimation. [22][23][24][25][26][27][28] More importantly, a counterfactual approach has spurred extensive discussion on the assumptions for inferring causality from data and the conditions for specific statistical methods to work using, for example, causal diagrams. [2][3][4]6,[29][30][31][32][33][34][35] Yet, the most striking illustration brought about by the counterfactual approach may be that it can offer an elegant solution to the controversy surrounding the definition and estimability of the effects of exposures that vary over time. For example, initiated antiretroviral therapy (exposure) for acquired immunodeficiency syndrome may be intermitted after looking at the symptoms of pneumonia, which is a predictor of clinical outcomes (eg, death) but affected by the prior exposure, and thus considered as a part of the exposure's effects. While no existent theory (at the time) in the statistics literature had offered clear guidance for adjusting or not adjusting for such intermediate variables to estimate the effect of time-varying exposures, new causal methodologies emerged in the 1980s. These include Robins' unified approach, which is comprised of the generalized computational algorithm formula (abbreviated as g-formula) and estimation methods (ie, inverse probability weighting and g-estimation) of two classes of counterfactual, or structural, models. [36][37][38][39][40][41][42] In 2000, marginal structural models were introduced as a tool to make the effects of such time-varying exposures easily estimable. [43][44][45] Specifically, a marginal structural model is an equation to demonstrate prespecified assumptions on the causal effects to be estimated (ie, causal estimands). Thanks to the series of Robins and Hernán's seminal works, [46][47][48][49][50][51] as well as others' tutorials on the topic with intuitive theory and easy-to-implement software, 52-59 marginal structural models have been widely applied to longitudinal data. Herein, we illustrate the use of marginal structural models, parameters of which can be estimated in a comparative way using inverse probability weighting and the g-formula in certain situations, featuring hypothetical data with a time-varying exposure to point out common pitfalls as well as serve as a stepping stone to better understand the use of these methods.

CONCEPTUAL PITFALLS
If readers feel confused with the following statements, they could be trapped by the pitfalls around the methodology considered in this paper: 1. Marginal structural models should be distinguished from inverse probability weighting.

A marginal structural model is an equation to show
prespecified assumptions on causal estimands, while an exposure probability model for inverse probability weighting is an imposed restriction on observed distribution for estimation. 3. As a marginal structural model and exposure probability model (for inverse probability weighting) are used for different purposes, misspecification of these models would lead to biases in different ways. 4. Principles for variable selection for marginal structural models are distinct from that for exposure probability models, and thus model specification of them raises different challenges. 5. Inverse probability weighting shares identifiability assumptions with the g-formula and can be used to fit marginal structural models when the assumptions are met, although g-formula can be used to fit them only when the models are saturated. Although some of these pitfalls have been appreciated previously, 57 we aim to discuss them from a different perspective. Before entering these subtleties, it would be helpful to seize the rationale of the specialized causal methods elaborated for timevarying exposures with simple worked examples without relying on computerized packages. Unlike point-exposure settings, 1,6,60,61 however, we rarely encounter such pedagogic examples of timevarying exposures, including counterfactual data that explicate causal estimands and underlying conditions. Although there are at least four excellent numerical examples appropriate for exercise, they rely on either the external causal knowledge (ie, causal diagrams without explicit estimands 51,59 or "g-null" theorem implied by a causal diagram and observed data 6 ) or "true" parameters for simulated data. 54 In this paper, we provide a stepby-step illustration, or tips, using a novel, hypothetical numerical example dataset that includes potential outcomes, which directly incorporates minimal information to explicitly define causal estimands and conditions for their identification. One may consider a causal diagram would be helpful to understand the structure of the dataset. As noted later, however, causal diagrams typically include more causal assumptions than sufficient conditions to identify causal effects. That is why we do not start by drawing causal diagrams and use them only complimentarily in our illustration, despite the fact that they are indeed useful tools for explicating our assumptions in real data analysis. 35 The following "tips" emanate from two introductory subsections regarding the effects of point exposures and time-varying exposures. Then, we step into the main contents to understand the unique role of and distinction between inverse probability weighting, marginal structural models, and regression=exposure probability models.

TIPS TO UNDERSTAND WHAT, WHY, AND HOW OF MARGINAL STRUCTURAL MODELS
Prerequisite: identification of point-exposure effects As many epidemiologists become familiarized with a potentialoutcome framework for a single time point, or a point-exposure setting, we just briefly review it here; readers unfamiliar with the basic concepts and notation may refer to Part 1 of Causal Inference: What If 6 or concise introduction papers. 61,62 Suppose that exposure A i (eg, antihypertensive drug), outcome Y i (eg, the occurrence of cardiovascular disease), and set of covariates L i (eg, current=prior health conditions, unhealthy behaviors, and social support) are observed for individual i = 1,…, n. Let Y i a denote the possibly unobserved, potential outcome that would be observed if, possibly counterfactually, exposure A i were set to level a = 0 (unexposed) or 1 (exposed) (hereafter, we may omit subscript i if no confusion will occur). Then, the average causal effect of exposure A on outcome Y may be defined as E[Y 1 ] − E[Y 0 ], which compares counterfactual expectations (or risks for a binary outcome) of Y 1 and Y 0 in the same population along the difference-scale.
Suppose a hypothetical cohort (Table 1) a for all a), positivity (ie, 0 < P(A = a|L) almost everywhere, for all a), and the following conditional exchangeability given covariates, say, L. Table 1 also presents the observed distribution of (L i , A i , Y i ) in accordance with potential outcome Y i a (a = 0, 1) under consistency. In Table 1, potential risk under a = 0 in the exposed E[Y 0 |A = 1] = 213=460 = 0.463 is not equal to that in the unexposed E[Y 0 |A = 0] = 447=780 = 0.573, and the same is true for potential risk under a = 1, E[Y 1 |A]. When A is associated with Y a as previously, marginal (unconditional) exchangeability is violated and the A-Y association (observed risk difference) is said to be confounded: 4%, indicating almost null association. Fortunately, within every strata of L, we can verify from Table 1 ("Risk" columns) that E[Y a |A = 0, L = l] = E[Y a |A = 1, L = l ] (a = 0, 1 and l = 0, 1) and thus equal to E[Y a |L = l ]. This condition is called "conditional exchangeability given L" and the sets of covariates that satisfy the condition are said to be cofounders. 6,36 6 ; that is, causal effects are identifiable. In our data, standardized risks for A = 0 and A = 1 are 0:6ð1;000=1;240Þ þ 0:25ð240=1;240Þ ¼ 0:532 ¼ E½Y 0 and 0:75ð1;000=1;240Þ þ 0:333ð240=1;240Þ ¼ 0:669 ¼ E½Y 1 ; respectively. The next subsection extends the definitions for and the conditions sufficient to identify causal effects for time-varying settings. To focus on the complexity of conditional exchangeability in time-varying settings, we suppose throughout this paper that consistency and positivity assumptions, as well as the time-varying versions of them, 51,63,64 are met in our data.
Definition and identification of effects of time-varying exposures Targeted effects of time-varying exposure If an exposure varies over time, the aforementioned definition of effects should be redefined. Consider a simple case with 2 time points. At time 1, baseline confounders L 1i are measured and then exposure A 1i is commenced; at time 2, confounder set L 2i is measured and exposure is changed to A 2i ; finally, outcome Y i is measured. Thus, the observed data are (L 1i , A 1i , L 2i , A 2i , Y i ), for i = 1,…, n. Note that A 1 and A 2 may represent the same exposure (eg, start=stop antihypertensive drugs) or different exposures introduced sequentially (eg, first-line and second-line chemotherapy for cancer patients). Likewise, L 1 and L 2 may consist of the same set of variables or (partly or entirely) distinct sets of variables.
For time-varying exposure, potential outcome can be defined by the combination of intervention on a joint exposure (A 1i , A 2i ): let Y a1;a2 i denote the potential outcome that would be observed if exposure A 1i and A 2i were set to level a 1 and a 2 , respectively. We assume that exposure at each time takes on 0 (unexposed) or 1 (exposed), leading to 4 different potential outcomes-Y i , and Y i 1,1 for each individual i. The average causal effect of exposure on outcome may be defined as any contrast between counterfactual expectations E[Y a 1 ;a 2 ]; eg, E[Y 1,1 ] − E[Y 0,0 ]. We can also consider E[Y 1,0 ] − E[Y 0,0 ], which is referred to as the "controlled direct effect of A 1 while A 2 set at 0." [65][66][67] Note that joint exposure (A 1 , A 2 ) can affect not only outcome Y, but also L 2 (by A 1 ), which is measured after exposure initiation. Under the implausible assumption of no effect of (the part of ) exposure on (the part of ) the following confounders, the effect of (A 1 , A 2 ) can solely be seen as a multivalued exposure at a single time-point; as shown earlier, P l1;l2 E[Y |A 1 = a 1 , A 2 = a 2 , L 1 = l 1 , L 2 = l 2 ]P(L 1 = l 1 , L 2 = l 2 ) is equal to E[Y a1;a2 ] if the corresponding exchangeability assumptions for point-exposure hold. In the following hypothetical data, however, there is no single set of confounders for joint effects of (A 1 , A 2 ). Rather, L 1 is a sufficient set of confounders for A 1 , and (L 1 , A 1 , L 2 ) is a sufficient set of confounders for A 2 . This condition would enable us to identify E[Y a 1 ;a 2 ] but the usual standardization formula, P l 1 ;l 2 E[Y|A 1 = a 1 , A 2 = a 2 , L 1 = l 1 , L 2 = l 2 ]P(L 1 = l 1 , L 2 = l 2 ) leads to biased estimates unless the aforementioned implausible assumption of no-effect of past exposures on time-varying confounders holds. 6,36,43,51 A hypothetical cohort For simplicity, consider a hypothetical cohort with empty L 1 . The situation would arise if A 1 is randomized at baseline, but non-adherence occurs or another exposure is introduced during the follow-up, or if the cohort is restricted based on measured variables L 1 . In either case, the following illustration is unaffected by including the diverse values of L 1 , so let us ignore the adjustment for baseline confounders in our illustration. 6,51,54 Table 2 provides the data distribution of (A 1i , L 2i , A 2i , Y i ) augmented by unobserved potential outcome Y a1;a2 i (a 1 , a 2 = 0, 1) in the hypothetical cohort. As in Table 1, observed outcome Y i coincides with Y a 1 ;a 2 i such that (A 1i , A 2i ) = (a 1 , a 2 ) by consistency. We want to identify from observational data four expectations E[Y a1;a2 ] ("Total" row of "Risk" columns).
Instead of using P(L 2 = l 2 ) in the standardization formula, the "g-formula" in Table 3 averages the stratified risks E[Y |A 1 , Unlike the previous two naïve estimates, we can see that these values are equal to E[Y a 1 ;a 2 ] in Table 2. As elaborated in the next subsection, the g-formula is one expression of E[Y a1;a2 ] in terms of observed distribution under the condition that is different from unconditional=conditional exchangeability.

Conditions for identification of the effects
for joint exposure, we can easily check the following conditions, E½Y a1;a2 for all a 1 and a 2 , from upper four rows vs lower four rows (for (C1)) and every 2 rows within the same stratum of (A 1 , L 2 ) (for (C2)) in Table 2. These conditions are collectively called the sequential exchangeability for (A 1 , A 2 ), 6,36,51 which are typically easier to hold than joint conditional exchangeability but are neither necessary nor sufficient condition for joint conditional exchangeability (see Appendix A for more technical notes on the conditions). The covariates that satisfy (C2) through their stratification (ie, L 2 here) are called time-varying confounders. In fact, slightly strong condition (C2A) E[Y a 1 ;a 2 |A 1 , L 2 , A 2 = 1] = E[Y a 1 ;a 2 |A 1 , L 2 , A 2 = 0] (which requires conditional independence in all A 1 supports instead of only in A 1 = a 1 compatible with intervention on Y a1;a2 ) also holds in our example, while this is not required for the g-formula to be equal to E[Y a1;a2 ]. The g-formula equals E[Y a1;a2 ] if sequential exchangeability (C1) and (C2) holds.
It is helpful to depict the conditions in causal diagrams, namely, causal directed acyclic graphs (DAGs) 2,29 and singleworld intervention graphs (SWIGs) 31,32 ; we would like readers unfamiliar with these graphical terminology and rules (eg, opening=blocking paths, d-separation, the backdoor criterion) to refer to introductory articles 30,32,35 or book chapters 6,34 on the topic. Informally, variables are d-separated if they are not connected with each other or connected only through paths on which at least one unadjusted "colliders" or adjusted "non-colliders" exist. If a supposed exposure is d-separated from a supposed outcome by adjusting for non-descendant variables of the exposure (in an original graph) after deletion of arrows emanating from the exposure, then we would say the backdoor criterion is satisfied. Figure 1, which is adopted from Part 3 of Causal Inference: What If, 6 represents the causal diagrams that imply (C1) and (C2). Note that the typical strategy for causal inference in practice starts by drawing a causal DAG (eg, Figure 1(a)) or a SWIG (eg, Figure 1(c)) assumed for the data-generating process. Then, (conditional) independences between potential and observed variables, such as (C1) and (C2), are deduced from the graph. Here, we go backward; we start with counterfactual data (Table 2) in which (C1) and (C2) hold and proceed to causal DAGs=SWIGs that are compatible with those conditions.
In Figure 1(a), there is no non-descendant variable set that blocks all backdoor paths from collective nodes (A 1 , A 2 ) to Y (ie, satisfies the backdoor criterion). On the contrary, the backdoor paths to Y from A 1 and A 2 are separately blocked by distinct sets of variables: empty set for A 1 and (A 1 , L 2 ) for A 2 . The arguments can be more directly depicted using potential variables in Figure 1(c), which is a "template" of the SWIG representing each intervention (a 1 , a 2 ) on (A 1 , A 2 ). 32 For example, A 1 is dseparated from any variables, and A a 1 2 is d-separated from Y a1;a2 given L a 1 2 . After additionally conditioning on A 1 = a 1 (which is automatically done in the "template"), A a 1 2 ¼ A 2 (by consistency) is still d-separated from Y a 1 ;a 2 given L a1 2 ¼ L 2 (by consistency) and A 1 = a 1 ; thus, (C1) and (C2) are satisfied in this SWIG. The same arguments can be applied to Figure 1(b) and (d), where A 1 -L 2 is confounded (ie, connected by a backdoor path) by unobserved W. In other words, there are settings where joint effects of (A 1 , A 2 ) on Y can be identified (via sequential exchangeability) even if the effects of A 1 on L 2 are not identifiable (by the unobservable). More implication obtained from Figure 1 is detailed in Appendix B. The remainder of the paper does not require the reference to causal diagrams. Table 3. Estimates of effects of time-varying exposure from hypothetical cohort data , where data in "Risk" and "p(L 2 )" columns in each L 2 = l 2 (0 or 1) row are used for E[Y|A 1 , A 2 , L 2 = l 2 ] and p(l 2 ), respectively.
p(l 2 |A 1 ) as above, except for using probabilities in "p(L 2 |A 2 )" instead of "p(L 2 )" for the corresponding L 2 and A 2 values.

Marginal Structural Models: Pitfalls and Tips
Different view of the g-formula: inverse probability weighting We have seen that under the sequential exchangeability (C1) and (C2), the g-formula is equivalent to the averages of potential outcome. If baseline confounders L 1 exist, the g-formula is which is equivalent to E[Y a1;a2 ] if (C1) and (C2) hold by additionally conditioning on L 1 . The left-hand side of equation (1) is a representation of the iterative conditional expectation of the g-formula.
The alternative expression of E[Y a1;a2 ] under (C1) and (C2) is inverse probability weighting 6,42,51,64 : where I(A 1i = a 1 , A 2i = a 2 ) is an indicator function that takes 1 if individual i has joint exposure level (a 1 , a 2 ) and 0 otherwise, p(a 1 |l 1 ) = P(A 1 = a 1 |L 1 = l 1 ) is a conditional probability function of first exposure having level a 1 and p(a 2 |l 1 , a 1 , l 2 ) = P(A 2 = a 2 |L 1 = l 1 , A 1 = a 1 , L 2 = l 2 ) is a conditional probability function of second exposure having level a 2 given past exposure and covariates. Accordingly, p(A 1i |L 1i ) and p(A 2i |L 1i , A 1i , L 2i ) in formula (2) are functions of individual data. These two expressions are equivalent forms of E[Y a 1 ;a 2 ] under sequential exchangeability (C1) and (C2), as well as the timevarying versions of consistency and positivity. Despite the equivalence of these identification formulas, the estimator that plugs each estimate into (1) is called a g-formula estimator and that based on (2) is an inverse probability weighted estimator. The arguments can be extended to "dynamic regimes" with stronger conditions (Appendix B). 51,64 Now, let us obtain inverse probability weighted estimates from Table 2. First, we garner the probability of actually received exposure given past exposure and covariates separately for A 1 and A 2 . As L 1 is empty to achieve sequential exchangeability, p(A 1i ) and P(A 2i |A 1i , L 2i ) for each combination of (A 1i , L 2i , A 2i ) are provided in Table 4. Next, calculate the "inverse probability weights" 1={p(A 1i )p(A 2i |A 1i , L 2i )} and multiply the numbers of combinations (A 1i , L 2i , A 2i ) by the weights. Note that the sum of the weights I(A 1 = a 1 , A 2 = a 2 )={p(A 1 )p(A 2 |A 1 , L 2 )} for each (a 1 , a 2 ) equals total sample size (ie, n = 15,000 in our data). Hence, formula (2) indicates that we only have to estimate the probability  of Y = 1 for every combination of (a 1 , a 2 ) in these multiplied numbers, or the inverse probability weighted population: 100Þ=ð9;000 þ 6;000Þ ¼ 9;300=15;000 ¼ 0:62; This is the correctly specified marginal structural model; if we have the data in Table 2 (3), the lefthand side can take any four values, but the right-hand side expresses them by only three parameters. Model (3) is marginal because the expectations are taken with the marginal distributions of Y a1;a2 unconditional on other observed variables (though the condition is relaxed later) and other potential outcomes Y a 0 1 ;a 0 2 other than (a 1 , a 2 ) (thus, we need not consider any cross-world joint distributions under different interventions). 42,46 Model (3) is also structural because it imposes restrictions on potential outcomes Y a 1 ;a 2 rather than observed distributions. There are other possibilities for specification of marginal structural models. For example, we can fit the simpler additive model which has only two parameters assuming that A 1 and A 2 have the same effect (risk difference) on Y, or a multiplicative marginal structural model where exp(β 1 ) and exp(β 2 ) represent the (common) risk ratios E[Y 1;a 2 ]=E[Y 0;a 2 ] (a 2 = 0, 1) and E[Y a 1 ;1 ]=E[Y a 1 ;0 ] (a 1 = 0, 1), respectively. However, these are incorrectly specified or misspecified marginal structural models because any parameter values (β 0 , β 1 ) or (β 0 , β 1 , β 2 ) in the right-hand sides of (4) and (5) cannot exactly express the left-hand sides. A marginal structural model is correctly specified in multiplicative scale by making it saturated by, for example, including an interaction term of a 1 and a 2 : We estimate these marginal structural models through inverse probability weighting from observed data in Table 3, where sequential exchangeability (C1) and (C2) holds. Of course, models (4) and (5) are misspecified and necessarily result in biased estimates of E[Y a 1 ;a 2 ]. Nevertheless, the estimates of misspecified marginal structural models may well approximate the true E[Y a 1 ;a 2 ] unless the model forms differ significantly from the true relationship between E[Y a 1 ;a 2 ] and (a 1 , a 2 ). A typical estimation process is as follows: 1) calculate the inverse probability weight, 1={p(A 1i )p(A 2i |A 1i , L 2i )}, for each variable pattern (A 1i , L 2i , A 2i ) as in Table 4; 2) fit the regression model for E[Y |A 1 = a 1 , A 2 = a 2 ] with the same functional form of the marginal structural models; and 3) obtain confidence intervals by the sandwich estimator or bootstrap. The SAS and Stata codes to create a dataset and replicate the results are provided in Appendix C and Supplementary Material, respectively. Table 5 shows the parameter estimates of these models. Expectations E[Y a 1 ;a 2 ] are also estimated by linear combination of these estimates in the corresponding models; eg, E[Y 0,0 ] = β 0 (models 3 and 4) or exp(β 0 ) (models 5 and 6), and E[Y 1,1 ] = β 0 + β 1 + β 2 (model 3), β 0 + 2β 1 (model 4), exp(β 0 + β 1 + β 2 ) (model 5), or exp(β 0 + β 1 + β 2 + β 3 ) (model 6).
Why do we need to model E[Y a1;a2 ] by taking the risk to cause bias? Consider exposures can change at an additional one time point. Without models, we need to estimate 2 3 = 8 (double of our case) distinct E[Y a 1 ;a 2 ;a 3 ]. If we have six time points, the task requires 64 estimates from the limited amount of data. Furthermore, if we have continuous exposure, we have to rely on the dose-response curves irrespective of the number of exposure time points. Given we always have a limited amount of data, our estimation tasks must rely on the dimension reduction of parameter space by imposing restriction on the possible values of counterfactual outcome means. In Table 5, despite both models (3) and (6) being correctly specified and unbiasedly estimated, the estimates of E[Y a1;a2 ] from model (3) (3 parameters) have slightly narrower confidence intervals than those from model (6) (four parameters). The efficiency gain owing to dimension reduction will be modest as the number of time points increases.
Note that models (3)-(6) do not require covariate information, though can incorporate baseline confounders L 1 for examining effect modifications by certain variables in specific scales (eg, risk difference or ratio). 6,68 The convenient choice that is commonly seen in practice may be the simplest model assuming a common exposure effect across time and baseline confounder strata: which imposes more restriction than the marginal structural model (4), which is agnostic about (ie, does not assume) no-effect modification by L 1 . To assess effect modification by baseline confounders, the model can be modified as though this is still generally stricter than model (4) because the effect of exposure is restricted to be linearly modified by L 1 .

Dealing with high-dimensional covariates
In our example, we have no baseline confounder and only one Marginal Structural Models: Pitfalls and Tips time-varying binary confounder variable L 2 , as well as two binary exposures A 1 and A 2 . As a result, we can estimate all conditional expectations and conditional probabilities in g-formula (1) and inverse probability weighting (2) from the direct calculation of the mean=proportion in each stratum; in other words, we used saturated regression and exposure probability models. In practice, however, we have many variables in L 1 or L 2 , or both, some of which may follow continuous or multinomial distributions. In such cases, we must rely on models for observed distribution of (L 1 , A 1 , L 2 , A 2 , Y ). 4,20,69,70 For example, g-formula (1) can be estimated by fitting the following outcome and covariate regression models: where L 2k is a kth variable in arbitrarily ordered L 2 = (L 21 ,…, L 2K ) T with a constant L 20 . Note that in general, we must conduct numerical approximation of conditional distribution of L 2k by simulating the Monte-Carlo samples from the model fit, which has the conditional means following the above regression models (the parametric g-formula estimator). 38,47 Alternatively, we could iteratively model the left-hand side of g-formula (1) from inside to outside of expectations by fitting the outcome regression models for the predictions from previous model fit (equivalent to the Q-learning estimator). 24,71 Inverse probability weighting formula (2) can also be estimated by, for example, logistic models for exposure probabilities: logit We then calculate the weighted mean using 1={p(A 1i |L 1i )p(A 2i |L 1i , A 1i , L 2i )} from predicted values from these models.
Note that both of these approaches do not impose any restriction on the values of E[Y a1;a2 ]; we could use regression or exposure probability models without specifying marginal structural models and vice versa (recall the calculation of Table 5). Marginal structural models are causal assumptions about the relationship between E[Y a1;a2 ] and hypothetical intervention (a 1 , a 2 ); on the contrary, regression and exposure probability models are approximations of certain aspects of the observed distribution of (L 1 , A 1 , L 2 , A 2 , Y ). In practice, however, we should rely on both marginal structural models and exposure probability models when using inverse probability weighting for estimating the effects of exposure with a moderate number of time points. [44][45][46][47][48][49][50] Table 6 shows the estimates of marginal structural models (3)-(6) using the fit of a misspecified exposure probability model: logit P(A 2 = 1|A 1 = a 1 , L 2 = l 2 ) = α 0 + α 1 a 1 + α 2 l 2 . As expected, all estimates of E[Y a1;a2 ] are biased from Table 2 owing to the exposure probability model misspecification. Moreover, even for correctly specified marginal structural models (3) and (6), these estimates diverge from each other when using an exposure probability model to estimate inverse probability weights. Similar to the dimension reduction via marginal structural models, we would expect a greater efficiency gain (ie, variance reduction) in inverse probability weighting estimators when high-dimensional confounders must be conditioned on to achieve sequential exchangeability.

Summary of pitfalls and tips
Our hypothetical dataset explicitly shows estimands (ie, E[Y a1;a2 ]) and minimally possesses the counterfactual conditions (ie, sequential exchangeability) to estimate counterfactual means under joint intervention on time-varying exposure (A 1 , A 2 ). We hitherto illustrate the tips (Box) for formal understanding of marginal structural modeling and its estimation through inverse probability weighting (pitfall 1), as well as the required causal assumptions on unobservable data. Models are used to account for the "curse of dimensionality." On one hand, marginal structural models reduce the dimension of counterfactual outcome means under a huge number of the combinations of time-varying exposures. On the other hand, exposure probability models must Table 5. Inverse probability weighted estimates of marginal structural models from observed hypothetical cohort data (Table 3) MSM (3) (3) and (4)) or risk ratios exp( β) (MSMs (5) and (6) Shinozaki T, et al. be adopted in practice to account for the large numbers of baseline and time-varying confounders, which usually do not have implications on marginal structural modeling (pitfalls 2 and 4). We also show the biases based on the misspecification of exposure probability models and misspecification of marginal structural models separately (pitfall 3). Note that while inverse probability weighting and the g-formula are applicable to estimate marginal counterfactual means (ie, saturated marginal structural models), only the former can estimate general, unsaturated marginal structural models (pitfall 5). Although running into these pitfalls may not necessarily lead to large biases in practical analysis, failure to recognize these subtleties would advocate unprincipled and suboptimal strategies for causal inference.
We would conclude this section with additional emphases of two pitfalls. First, variable selection and model specification are generally different tasks in modeling for causal inference. By inverse probability weighting, exposure probability models should select confounders, stratification of which is sufficient to achieve sequential exchangeability. In our example, all analyses with or without an exposure probability model include all confounder(s), L 2 . Even if the models include all confounders, however, they may be misspecified as in the analysis in Table 6. The same is true for regression models for the g-formula. On the contrary, it is unnecessary for marginal structural models to include confounders; only covariates (need not to be confounders but should be conditioned in propensity score 19 ) that may modify the exposure effect of interest may be included in marginal structural models. 6,68 Second, doubly robust estimators can alleviate the bias from misspecification of regression and exposure probability models, [22][23][24][25][26][27][28] but not the bias owing to the misspecification of marginal structural models nor other causal models (that are not introduced in this paper). For example, Table 6 provides the biased estimates using a misspecified exposure probability model for correct= incorrect marginal structural models. Among them, bias in the estimates of correct marginal structural models (3) and (6) would be mitigated by doubly robust methods, by including outcome regression models via the iterative model-fitting algorithm of Bang and Robins, 24 while the fitting of incorrect marginal structural models (4) and (5) must result in biased estimates. Hence, even with doubly robust methods, the careful consideration of marginal structural models is needed, especially for long-term follow-up study with many time points at which exposure can change. Marginal structural models for dynamic regimes may also have to depend on strong modeling assumptions, 51,64,[72][73][74] even when exposure is binary and change at several time points.

FUTURE DIRECTIONS
There is a relevant method other than the g-formula and inverse probability weighting that requires essentially the same assumptions to estimate causal effects of time-varying exposures: gestimation. 15,18,[38][39][40][41]51,67 Like the relation of marginal structural modeling and inverse probability weighting, g-estimation is a method to estimate the parameters of structural nested models. Structural nested models and g-estimation indeed have attractive statistical properties (eg, robustness, efficiency, and flexible parameterization), which successfully work within Robins' causal "interventionism" framework with minimal conditions. 31,41,46,63,75 Despite its theoretical superiority, g-estimation has been underused in epidemiologic literature probably because of the complexity of background theory and interpretability of the  (3) and (4)) or risk ratios exp( β) (MSMs (5) and (6) • MSM shows prespecified assumptions on causal estimands, while an exposure probability model is an imposed restriction on observed distribution • As MSM and exposure probability model are used for different purposes, misspecification of these models would lead to biases in different ways • Model specifications of MSMs and exposure probability models raise different challenges in real data analysis • G-formula, which shares identifiability assumptions with inverse probability weighting, can be used to fit MSMs only when the models are saturated Marginal Structural Models: Pitfalls and Tips parameters. 75 However, structural nested models are especially useful for dynamic regimes of time-varying exposures by modeling the effect modification by time-varying covariates, 38,41,51 which cannot generally be included in marginal structural models. 46,68 Besides the conceptual pitfalls considered in this paper, there are important pitfalls regarding specification and estimation of marginal structural models, which will often lead to mistakes in practice: • One should always use the independence working correlations in marginal structural models of repeated-measures outcomes. 47,76,77 • If "stabilized" weights include covariates in the numerator weights, 43 they should be conditioned in the marginal structural models. 50 • "Stabilization" of the weights is not always acceptable (eg, dynamic-regime marginal structural models [72][73][74] ). • It is always important to check the fits of exposure probability models (eg, checking calibration or modeldiagnostic measures 78 and weight distributions 50 ) and marginal structural models (eg, comparing the estimating equation-based quasi-likelihood information criterion with that for less restricted models 79 or testing equivalence between asymptotic values of parameter estimates obtained through different weighting options 80 ). There are other practical concerns in real data analysis. For example, many follow-up studies compare time-to-event outcomes, which complicate the modeling and estimation process for the effects of time-varying exposure. In these settings, timedependent Cox models or the risk-set switching Kaplan-Meier estimators would need unrealistic assumptions to yield causally interpretable estimates. 43,81 In addition, censoring of the events must be taken with care by, for example, constructing the inverse probability weights to prevent attrition bias. 44,45,51 Note that the idea of inverse probability of censoring weights appears in diverse causal inference fields; eg, adjustment for treatment discontinuation in clinical trials, 82,83 estimation of the effects of dynamic regimes, 72 and the effects of the treatment duration on survival. 84

APPENDIX A. EXCHANGEABILITY CONDITIONS FOR IDENTIFYING THE EFFECTS OF TIME-VARYING EXPOSURES
As shown in Figure 1, sequential exchangeability (C1) and (C2) is more likely in practice than conditional exchangeability E[Y a1;a2 |A 1 , L 2 , A 2 ] = E[Y a1;a2 |L 2 ] for joint exposure (A 1 , A 2 ); conditional exchangeability for joint exposure is not a necessary condition for sequential exchangeability, which would be intuitively understandable to many readers. Mathematically, however, conditional exchangeability for joint exposure itself is not a sufficient condition for sequential exchangeability, either. Nevertheless, these conditions are closely related with each other in other realistic situations, as shown subsequently.

APPENDIX B. INDEPENDENCY ASSUMPTIONS ENCODED IN CAUSAL DIAGRAMS AND IDENTIFIABILITY OF GENERAL INTERVENTION REGIMES
Sequential exchangeability (C1) and (C2) is insufficient for identification of the effects of more general exposure interventions (also known as dynamic regimes or strategies) that may depend on (time-varying) covariates, say, (L 1 , L 2 ). That identification is built on the identification of the distribution f (Y a 1 ;a 2 , L a1 2 ), or generally, f(Y g , L g 2 ) with the intervention g = (g 1 (L 1 ), g 2 (L 1 , A 1 , L 2 )), where g k (·) corresponds to the intervention on A k possibly depending on past A and L values (rather than a prespecified value like a k ). Hence, we need more assumptions to identify the effects of a dynamic regime g, one of the sufficient conditions is ðY a 1 ;a 2 ; L a 1 2 Þ ? ? A 1 and Y a 1 ;a 2 ? ?
where Z 1 ? ? Z 2 |Z 3 refers to statistical independence between Z 1 and Z 2 conditional on Z 3 . 51 However, our example is also compatible both with (C3) and the settings with E[L a1 2 |A 1 ] ≠ E[L a1 2 ], in other words, agnostic about condition (C3). Thus, data in Table 2 themselves are not sufficient for the validity of the g-formula for effects of a general regime g.
On the contrary, causal diagrams would indicate whether condition (C3) holds and the assumptions encoded in the diagrams allow f (Y g , L g 2 ) to be identifiable. From the SWIG of Figure 1(c), we can read the independences (Y a1;a2 , L a 1 2 ) ? ? A 1 and Y a1;a2 ? ? A a 1 2 |A 1 = a 1 , L 1 , which imply (C3) by consistency under conditioning on A 1 = a 1 (ie, the "world" represented by the Shinozaki T, et al. SWIG) in the second condition. Thus, the corresponding causal DAG of Figure 1(a) allows us to identify E[Y g ]. However, we cannot deduce (C3) from Figure 1(d) owing to a d-connected path between L a 1 2 and A 1 ; hence, under the corresponding causal DAG of Figure 1(b), E[Y g ] for a general regimes g cannot be identified even though E[Y a 1 ;a 2 ] for non-dynamic exposure intervention (a 1 , a 2 ) is identified as illustrated in the main text. That is, Figure 1 is one of the examples of causal diagrams that are compatible with our example data, where the stronger causal assumptions are implicitly imposed on. As we have documented earlier, 35 causal diagrams (when tied with underlying causal models) often represent the "finer" description of causal assumptions than counterfactual notation.
The difficulty in identification of E[Y g ] with g = (g 1 , g 2 ) = (g 1 (L 1 ), g 2 (L 1 , A 1 , L 2 )) is directly depicted in Appendix Figure 1(a) and (c), where L 1 is suppressed for simplicity but it can affect any variable in the graphs. A causal DAG of Appendix Figure 1(a) is the same as Figure 1(b), while the corresponding SWIGs are different. The structural distinction is the presence (Appendix Figure 1(c)) or absence (Figure 1(d)) of an arrow from L 2 to hypothetical intervention (g 2 or a 2 , according to the dependence of intervention on covariates), respectively. We can easily see that dependence between Y g ¼ Y g1;g2 and A 1 is either with or without conditioning on L g 1 2 ¼ L 2 in Appendix Figure 1(c), which suggests that E[Y g ] is not identifiable without referring to condition (C3).
Finally, we show a slightly modified causal DAG of Figure 1(b) in Appendix Figure 1(b), in which L 2 that is affected by A 1 also affects Y. The corresponding SWIG, Appendix Figure 1(d), reveals that Y a 1 ;a 2 is d-connected either with or without conditioning on L a1 2 ¼ L 2 ; hence, the effects of dynamic regimes g and non-dynamic exposure intervention (a 1 , a 2 ) is unidentifiable if the association between exposure and its effect lying on a path to the outcome is confounded by unobservables. Of course, our example data in Table 2 is incompatible with Appendix Figure 1 model A1 = ; output out = MSM p = P1; run; proc logistic data = MSM desc; = + Use either of the following two commands + = model A2 = A1 L2 A1 + L2; + Fitting correct exposure probability model for Table 5; + model A2 = A1 L2; + Fitting misspecified exposure probability model for Table 6; output out = MSM p = P2; run; + Calculate inverse probability weights; data MSM; set MSM; IPW = (A1=P1 + (1 − A1)=(1 − P1)) + (A2=P2 + (1 − A2)= Figure 1. Causal DAGs and SWIGs for dynamic regimes and without identifiability conditions: (a) causal DAG identical to Figure 1(b); (b) causal DAG with the arrow from L 2 on Y, in which L 2 is affected by A 1 , and the A 1 -L 2 association is confounded by unobserved W ; (c) a "template" under intervention g = (g 1 , g 2 ) = (g 1 (L 1 ), g 2 (L 1 , A 1 , L 2 )) of SWIG that corresponds to causal DAG (a); (d) a "template" under intervention (a 1 , a 2 ) of SWIG that corresponds to causal DAG (b).