Multiple imputation how many




















Multiple imputation is one technique becoming increasingly advocated to deal with missing data because of its improved performance over alternative approaches 1—4. However, multiple imputation is still rarely used in epidemiology 2 , perhaps in part because relatively little practical guidance is available for implementing and evaluating multiple imputation techniques.

This is particularly true for large data sets, such as are common in epidemiology. This paper presents a case study of imputation of data from the national evaluation of the Community Mental Health Services for Children and Their Families Program i. Multiple imputation is a powerful and flexible technique for dealing with missing data. Conceived by Rubin 5 and described further by Little and Rubin 6 and Schafer 7 , multiple imputation imputes each missing value multiple times.

Inferences using the multiply imputed data thus account for the missing data and the uncertainty in the imputations. Multiple imputation is relatively easy to implement and is appropriate for a wide range of data sets. Two other general techniques for dealing with missing data are single imputation methods—such as mean imputation—and maximum likelihood approaches, which directly estimate parameters by accounting for the missingness.

Multiple imputation offers distinct advantages over each of these alternatives. It thus does not account for the uncertainty in the missing values and so underestimates the variance of estimates, leading to inflated type I error rates. Maximum likelihood techniques can be difficult to implement for complex or nonstandard models, although special-purpose software is making it more feasible 1.

In contrast, multiple imputation techniques provide valid variance estimates and are easy to implement and describe. Although multiple imputation is often used by individual researchers performing imputation for a particular analysis, this paper focuses on situations in which a large data set will be used by multiple researchers, for diverse purposes.

For example, a research group may want to create one large, multiply imputed data set rather than have each researcher deal with the missing data individually and perhaps differently; 8, 9.

This paper builds on previous papers that provide an overview of multiple imputation 3 , 10 , 11 by focusing on the complications encountered in implementing multiple imputation for large data sets that will be used for a range of analyses.

Since , the CMHI has funded communities to develop service systems to provide a comprehensive spectrum of mental health and support services to US children with mental illness A national evaluation is currently collecting descriptive data on all children referred to the CMHI. More detailed data are gathered from youth and families who agree to participate in a longitudinal outcomes study These data potentially provide a wealth of information on children's mental health services, with longitudinal data on more than 9, children across 45 sites.

However, substantial data are also missing. In this paper, we focus on the baseline data, with approximately variables, including measures of behavior problems, family resources, and service receipt. Many of the end users of the CMHI data are trained in psychology and other applied fields and likely have limited experience with methods for missing data.

The goal of the imputation process described here was to create a general use, multiply imputed data set that a broad range of researchers could use to answer their research questions, similar to a situation that may be encountered by research groups that aim to create imputed data sets for all of their members to use.

Nearly all studies have some missing data; the question is, How much of a problem is it? The answer depends on the mechanism that caused the missingness. MCAR occurs when the missingness is unrelated to the variables under study. In other words, the missingness is purely random, and the individuals with missing data are a simple random sample of the full sample. MAR means that the probability of an observation being missing may depend on observed values but not on unobserved values.

Finally, NMAR means that the probability of missingness depends on both observed and unobserved values. Most commonly, missing data are assumed to be MAR. The MCAR assumption is generally unrealistic, as can be observed in the data if the missingness is related to any of the observed characteristics. In these cases, MAR, while empirically unverifiable, is often a reasonable assumption to make unless substantive knowledge about the data or data collection process indicates that the missingness may depend on unobserved values.

In that case, a NMAR model should be posited. Understanding when and why variables are missing is crucial. This step can be accomplished by calculating the rate of missingness for each variable as well as examining the patterns of missingness. Finally, the observed characteristics of individuals with observed and missing values for some key variables of interest should be compared.

The missingness is scattered, with no clear pattern, except that variability is large across sites. Sites were responsible for their own data collection, and this variability likely reflects variation in the time and resources available as well as possibly differences in the populations served. In the CMHI data, the missingness on key variables is related to a number of observed characteristics of the children, confirming that the data are not MCAR. For example, children with a missing value on the internalizing problems scale were more likely to be eligible for Medicaid and to have conduct disorder and were less likely to be white or Hispanic Table 1.

Unfortunately, there is no way to empirically confirm that the data are not NMAR; however, for our purposes, we are comfortable making that assumption here.

Multiple imputation has emerged as an appropriate and flexible way of handling missing data. Complete-case methods, which simply discard observations with any missing data, generally make the usually unrealistic assumption that the data are MCAR, or at least MAR within categories defined by the variables included in the analysis model Multiple imputation methods work by imputing or filling in the missing values with reasonable predictions multiple times.

The analysis is then run separately on each data set, and the results are combined across data sets by using the multiple imputation combining rules 5. The resulting estimates account for both within- and between-imputation uncertainty, reflecting the fact that the imputed values are not the known true values. Doing so results in correct standard error estimates and coverage rates, as compared with single imputation methods or simply including a missing data indicator for each variable in the model 17 , The original approaches to creating multiple imputations generally assumed a large, joint model for all of the variables, for example, multivariate normality 6 , 7.

More recently, a more flexible method called multiple imputation by chained equations MICE has been developed MICE cycles through the variables, modeling each conditional on the others. The imputations themselves are predicted values from these regression models, with the appropriate random error included.

The procedure is as follows: first, the variable with the least missingness variable 1 is imputed conditional on all variables with no missingness. The variable with the second least missingness is then imputed conditional on the variables with no missing values and variable 1, and so on.

This process is then repeated using this data set with no missing values. Raghunathan et al. The idea is that, at the end of 10 iterations, the imputations should have stabilized such that the order in which variables were imputed no longer matters. The imputed values at the end of the 10th iteration, combined with the observed data, constitute one imputed data set.

A strength of MICE is that each variable can be modeled by using a model tailored to its distribution, such as Poisson, logistic, or Gaussian. MICE can also incorporate additional data challenges, such as bounds or variables defined for only a subset of the sample.

For the CMHI, we created 10 imputed data sets. Although MICE is very useful in practice, it does lack the theoretical justification of some other imputation approaches. In particular, a drawback of MICE is that fitting the series of conditional models does not necessarily imply a proper joint distribution, which could lead to inconsistencies across models, where, for example, the model for variable 2 given variable 1 may not be consistent with the model for variable 1 given variable 2.

Initial research has indicated that this drawback is not generally an issue in applied problems 3 , 24 , but this is an area of ongoing statistical research. Another drawback, discussed further below, is the need to include many interactions to preserve associations in the data; for example, to preserve all 3-way interactions, all of the 2-way interactions must be included in all regression models, which is often not feasible.

A number of complications are encountered when actually implementing MICE, particularly with large data sets. These complications include model selection and computing limitations.

Ideally, the model for each variable to be imputed should fit the data well and be as general as possible, in the sense of including as many predictors and interactions as possible, as discussed above. In practice, this step is sometimes difficult to accomplish. One strategy is to use stepwise selection to choose the model for each variable at each iteration This process will include in the regression models those variables most predictive of the variable being imputed, for example, including a certain number of variables those most predictive or those leading to some minimum additional R -squared value The exact model for each variable may change across iterations but should stabilize as the imputations themselves stabilize.

An important consideration in implementing MICE procedures is that the model used to create the imputations should be more general than the analysis model in terms of including all interactions that will be examined in the analyses 1 , 8.

This step will prevent the analyses from missing associations that actually exist. For example, if there is particular interest in the relation between gender and internalizing symptoms, then that relation should be included in the imputation model. In contrast, if the variables are assumed to be independent i.

This is not a large issue for bivariate associations because variables that have associations in the observed data will be selected by the stepwise selection procedures. However, for crucial 3-way interactions e.

Because of computational limitations, it was not possible to include a large number of interactions in the CMHI imputation models. There is particular interest in disparities in care and in interactions between race and gender with mental health needs and services. How Many Imputations are Really Needed? Prevention Science , 8 3 , — Harel, O. Statistical Methodology , 4 1 , Royston, Patrick, John B. Carlin, and Ian R.

White, I. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine , 30 4 , — Subscribe to RSS Feed. Course Materials Please fill out the form below to download sample course materials. This field is for validation purposes and should be left unchanged. How many imputations do you need? October 30, By Paul von Hippel When using multiple imputation, you may wonder how many imputations you need.

So you often need more imputations to get replicable SE estimates. But how many more? Read on. A New Formula I recently published a new formula von Hippel that estimates how many imputations M you need for replicable SE estimates. FMI is the fraction of missing information. The FMI is not the fraction of values that are missing; it is the fraction by which the squared SE would shrink if the data were complete. Standard MI software gives you an estimate. For that reason, I recommend a two-step recipe von Hippel, : First, carry out a pilot analysis.

Impute the data using a convenient number of imputations. Equation 2. The upper limit of the confidence interval is then plugged into Equation 2. They take a quote from Von Hippel as a rule of thumb: the number of imputations should be similar to the percentage of cases that are incomplete. This rule applies to fractions of missing information of up to 0.

White, Royston, and Wood suggest these criteria provide an adequate level of reproducibility in practice.



0コメント

  • 1000 / 1000