Should I use the imputed values in the HILDA Survey data?

Some variables in the HILDA Survey data are imputed when complete responses are not available from respondents. For example, many income variables contain imputed values. The HILDA Survey team provides users with information about which values are imputed. For example, for “household financial year disposable total income” (_hifditp/_hifditn) there is an imputation flag, _hifditf. Across the first 16 waves of the HILDA Survey, about 25% of the values for this variable at the household level are imputed. This variable is the sum of many income components and it only takes one missing value at the lower level for this overall total to be missing.

A user might be tempted to throw these observations away as they do not contain actual responses from participants. Users might be worried that their analysis is affected by using these imputed values which are the product of some model that is being used by the HILDA Survey team.

First, users should know that most imputation is relatively innocuous. Often, respondents will have left one item blank in one year and it is pretty easy to work out from other years of the same respondents and from other respondents a good guess for this item.

Most importantly, however, is that throwing away imputed observations will create large amounts of bias in estimates relative to including the imputed values. While there may be some errors introduced by the imputation procedure, the errors introduced by excluding observations with imputed values will be much larger. This is due to “selection bias.” Observations for which values have been imputed are systematically different than those for which imputation has not been done. By excluding those observations, users risk introducing large amounts of selection bias into their estimates.

There is widespread agreement in the empirical social science literature and in the statistics literature that it is far superior to include the observations with imputed values. It is also recommended to include the imputation indicator (dummy) variable as an explanatory variable in your regression. For example, if you are estimating a model with “household financial year disposable total income”, you should include the _hifditf indicator as an additional explanatory variable. This will help to “soak up” any errors that may have been introduced by the imputation process. (See Frick and Grabka (2007), “Item non-response and Imputation of Annual Labor Income in Panel Surveys from a Cross-National Perspective”. DIW Discussion Paper 736.)

This answer was provided by Prof Robert Breunig, Australian National University.