Frequently Asked Questions
Analysing HILDA Survey data
How do I navigate through all of the documentation supplied with HILDA?
The HILDA Survey User Manual is the best place to start, as it answers countless questions about using the data.
The Manual covers a variety of topics, including:
- Missing data conventions
- Derived variables
- Matching wave data files to create longitudinal files
- Income and expenditure imputation
- Industry and occupation variables
- Australian and derived international coding schemes
- Using weights, and
- Data quality issues.
The User Manual also provides an overview of the data documentation and a summary of the design and data collection procedures used by the HILDA Survey.
What is the Documentation zip file?
The Documentation zip file contains:
- Subject level coding frameworks (in PDF format) for all variables in the household, enumerated person, responding person and combined files
- A cross-wave index (a brief summary of the Subject level coding framework)
- Coding frameworks for the longitudinal weights and master files
- Marked-up questionnaires, and
- Showcards (in PDF format) showing the associated variable names excluding derived and history variables.
Frequencies for each wave are also provided in a separate file. String variables (IDs and timestamps) are usually excluded from these frequencies.
To quickly locate variable names, the HILDA Survey User Manual should be used in conjunction with the cross-wave index, which is searchable by question number, keyword or variable name (excluding the first character wave identifier).
The cross-wave index indicates the availability of a particular variable in each wave, as well as the source questionnaire (or history, for derived variables).
What is the distinction between “employee of own business” and “employer/self-employed”?
The HILDA Survey generally adopts the standard labour market variables defined by the Australian Bureau of Statistics (ABS).
However, we are not comfortable with the ABS’ definition of the self-employed.
The ABS defines an employee as
a person who works for a public or private employer and receives remuneration in wages, salary, a retainer fee from their employer while working on a commission basis, tips, piece-rates, or payment in kind; or a person who operates his or her own incorporated enterprise with or without hiring employees.
In other words, their definition of employee includes owner managers who operate their own incorporated businesses (i.e., they are treated as “employees of their own business”).
In contrast, a person who operates their own unincorporated business is treated as an “own account worker” (i.e., they are self-employed).
We believe this distinction is misleading for many research purposes, so in our data releases we provide all of the necessary information for researchers to construct their own definition of employees and self-employed.
If you wish to adopt the ABS definition of “employee”, you should take the variable
_esempstand combine the two groups “employee” (1) and “employee of own business” (2).
Alternatively, you can use the variable
_es, which is a derived variable that reproduces the ABS definition of employment status.
Whether you combine “employee of own business” and “employer/self-employed” into one group depends on your research question.
If you wish to conform to ABS definitions, you should combine “employee” and “employee of own business”.
In Mark Wooden’s own research of labour market behaviour, for example, he almost always discards the ABS definition and combines “employee of own business” with the “employer/self-employed” group.
Which weight should I use?
Weights are used to make inferences from the sample to the population.
The weight you use depends on the question you are answering. The HILDA Survey User Manual provides some guidance on which weight to use in which circumstances.
Should I weight an unbalanced panel?
Maybe. When you construct an unbalanced panel of responding persons, you take all of the responding persons from each wave and stack them into a long file that has one record per person per wave.
The weight that could be used to weight this sample is the cross-sectional responding person weight from each wave. That is, in their Wave 1 observation, the person would be weighted by their Wave 1 cross-sectional responding person weight, their Wave 2 observation would be weighted by their Wave 2 cross-sectional responding person weight, and so on.
Similarly, if you are constructing an unbalanced panel of enumerated persons, then you could use the cross-sectional enumerated person weight.
If you pool, say, five waves of data together, the sum of the weights will be around 100 million (that is, five times the average population size between 2001 and 2005). Therefore, you may wish to rescale the weights by dividing the total by the number of waves you have included in the unbalanced panel.
The decision to weight the sample in this way depends on the type of analysis you are undertaking on the unbalanced panel. For example:
- If your analysis is of uncommon events and you are effectively taking a pooled sample, then the weighting strategy suggested above should be fine.
- Alternatively, if your analysis requires at least two observations on the same individual, then you will be dropping those people who are only interviewed once. The cross-sectional weights, therefore, will not be appropriate (nor will the longitudinal weights).
What weight should I use if I pool a sample across waves?
When you are analysing an uncommon event (for example, divorce), you can pool the sample across waves. This sample, however, is subject to attrition that is not random, so it needs to be weighted.
If you have pooled responding persons across waves, you should use the cross-sectional responding person weight for the wave from which the case has been contributed.
How do I match people across waves?
Use the cross-wave identifier
xwaveidto match people across waves.
How do I match people within households?
People within the same household have the same household identifier
The household identifier will change from wave to wave. Use a person's cross-wave identifier
xwaveidto match them over time.
* Replace the underscore with the appropriate letter for the wave, where ‘a’ corresponds to Wave 1, ‘b’ corresponds to Wave 2, and so on.
How do I match couples together?
People who are married or in a defacto relationship can be matched to their partner via:
_hhpxid, the partner’s cross-wave identifier, or
_hhprtrid, the partner’s two-digit person number, which can be appended to the end of the household identifier
_hhrhidto create the partner’s identifer for that wave.
A partner identifier is only available for partners living in the same household. Same sex couples have a partner identifier.
Note: Replace the underscore with the appropriate letter for the wave, where ‘a’ corresponds to Wave 1, ‘b’ corresponds to Wave 2, and so on.
How do I match children to their parents?
A child can be matched to their mother or father via:
_hhfxid, the cross-wave identifiers for mothers and fathers; or
_hhfid, the two-digit person number for mothers and fathers, which can be appended to the end of the household identifier
_hhrhidto create the mother’s and father’s identifiers for a particular wave; or
_hhfxid, the cross-wave identifiers for biological mothers and fathers.
Note that the first four identifiers for mothers and fathers are only available for people who live in the same household as their parent or parents. These identifiers include step and de facto parents . The last two identifiers are for the biological parents and require the parent to have lived in the same household as the child in at least one wave. The identifier is then distributed to every wave the child is in.
Note: Replace the underscore with the appropriate letter for the wave, where ‘a’ corresponds to Wave 1, ‘b’ corresponds to Wave 2, and so on.
Why do some respondents have zero weights?
Zero weights for respondents can occur for two reasons.
The HILDA sample in Wave 1 excluded people living in institutions (such as hospitals and other healthcare institutions, military and police installations, correctional and penal institutions, convents and monasteries) and other non-private dwellings (such as hotels and motels).
As a result, the HILDA sample is not representative of people living in non-private dwellings.
People that move into these dwellings after Wave 1 are given zero cross-sectional weights and zero longitudinal weights for the balanced panel starting from the wave in which they began living in a non-private dwelling.
The HILDA sample also excluded people living in remote and sparsely populated areas. Some of these areas are excluded from the Australian Bureau of Statistics' population benchmarks, which are used in the weighting process.
For Releases 1 to 4, the benchmarks only excluded remote and sparsely populated areas in the Northern Territory. Following Release 4, however, the ABS revised the areas considered remote and sparsely populated to include very remote parts of New South Wales, Queensland, South Australia, Western Australia and the Northern Territory.
These areas are determined by the Remoteness Area classification and have a value greater than 10.53 in the Accessibility/Remoteness Index of Australia. As a result, from Release 5, a small number of sample members living in these areas are given zero cross-sectional weights and zero longitudinal weights.
How do I reference HILDA?
The following paragraph must appear in any research that uses HILDA Survey data:
This paper uses unit record data from Household, Income and Labour Dynamics in Australia Survey [HILDA] conducted by the Australian Government Department of Social Services (DSS). The findings and views reported in this paper, however, are those of the author[s] and should not be attributed to the Australian Government, DSS, or any of DSS’ contractors or partners. DOI: ####
Note: The DOI (digital object identifier) references the Dataset(s) used and can be found on the download page of the DSS Longitudinal Studies Dataverse website.
Including the above statement is a requirement of the terms and conditions set out in the Confidentiality Deed Poll signed when you obtained the HILDA Survey data.
The following reference is also suggested if you wish to refer to the design of HILDA:
Watson, N., and Wooden, M. (2012), 'The HILDA Survey: A Case Study in the Design and Development of a Successful Household Panel Study', Longitudinal and Life Course Studies, vol. 3, no. 3, pp. 369-381.
How do I calculate if someone is retired in Wave 4?
Retirement status in Wave 4 is problematic. There was an oversight during preparation for Wave 4 that resulted in questions on retirement status contained in the Wave 2 Continuing Person Questionnaire not being reinstated.
The questions were removed in Wave 3 because a more comprehensive set of retirement-related questions were included as part of a retirement module.
Removal of this retirement module for Wave 4 should have been accompanied by the reinstatement of the original retirement questions, but this was overlooked and not rectified until Wave 5.
You can define retirement status based solely on age and labour force status, but to be consistent across waves you would need to apply the same criteria across all waves.
The alternative is to exclude Wave 4 entirely.
How do I find the household reference person for a household?
A household reference person is not provided in the HILDA datasets. Researchers will have different definitions they may wish to apply to define a household reference person. It may depend on their particular research topic or on how they want this definition to apply over time as circumstances within the household change (e.g., if relative incomes levels differ over time, if relationships change over time, or if when someone moves out or in, etc.). Some variables that you might find useful in defining a household reference person is relationship in household (_hhrih), income (_tifefp and _tifefn), owner (_hsoid1 to _hsoid18, but these are only available in some years) and age (_hgage).
Please note that the person numbers (_hhpno) indicate which row on the Household Form that person is listed. The order in the first wave is simply the order the respondent mentions the people in the household to the interviewer. In later waves, joiners are added and leavers are removed and people are shuffled up for the next wave.
How do I link households over time?
We don’t provide a longitudinal household id as different users will have different definitions of what it means to be part of a longitudinal household. Does a birth or death change the household? What if someone moves in or out? Does it matter who they are or how they are related to the ‘core’ people in the household? If a couple divorces, who does the household belong to after the divorce? Or what happens if an adult son moves back into the family home – is it the same household or a different one? Does it matter if the adult son is 25 or 60? You would need to link households over time via the people that living within them.
The best file to use to do this is the master file as it contains summary information of all people who were ever part of an enumerated household. This includes the xwaveid and, for each wave, the household id and outcome status. You would need to make some decisions about what constitutes a continuing household or a new household for your purposes.
You might also like to consider if you actually do need to think about your research question in terms of the longitudinal household concept. It may be possible to redefine it to what happens to people who live in certain types of households over time. Households are not a well-defined concept over time (as researchers would have different definitions depending on their particular research question) whereas individuals are.
Should I use the imputed values in the HILDA Survey data?
Some variables in the HILDA Survey data are imputed when complete responses are not available from respondents. For example, many income variables contain imputed values. The HILDA Survey team provides users with information about which values are imputed. For example, for “household financial year disposable total income” (_hifditp/_hifditn) there is an imputation flag, _hifditf. Across the first 16 waves of the HILDA Survey, about 25% of the values for this variable at the household level are imputed. This variable is the sum of many income components and it only takes one missing value at the lower level for this overall total to be missing.
A user might be tempted to throw these observations away as they do not contain actual responses from participants. Users might be worried that their analysis is affected by using these imputed values which are the product of some model that is being used by the HILDA Survey team.
First, users should know that most imputation is relatively innocuous. Often, respondents will have left one item blank in one year and it is pretty easy to work out from other years of the same respondents and from other respondents a good guess for this item.
Most importantly, however, is that throwing away imputed observations will create large amounts of bias in estimates relative to including the imputed values. While there may be some errors introduced by the imputation procedure, the errors introduced by excluding observations with imputed values will be much larger. This is due to “selection bias.” Observations for which values have been imputed are systematically different than those for which imputation has not been done. By excluding those observations, users risk introducing large amounts of selection bias into their estimates.
There is widespread agreement in the empirical social science literature and in the statistics literature that it is far superior to include the observations with imputed values. It is also recommended to include the imputation indicator (dummy) variable as an explanatory variable in your regression. For example, if you are estimating a model with “household financial year disposable total income”, you should include the _hifditf indicator as an additional explanatory variable. This will help to “soak up” any errors that may have been introduced by the imputation process. (See Frick and Grabka (2007), “Item non-response and Imputation of Annual Labor Income in Panel Surveys from a Cross-National Perspective”. DIW Discussion Paper 736.)
This answer was provided by Prof Robert Breunig, Australian National University.
What level of geography can I use?
We did not oversample any particular areas or regions in the initial sample (or in the Top-up sample added in 2011) and the sample is not designed to support small area estimates. There are also restrictions on the use of geographic variables as set out by DSS in the National Centre for Longitudinal Data Access and Use Guidelines . For the level of geography suitable for your particular analysis, you will need to let the sample guide you via the size of the appropriately calculated standard errors .
More detailed geographic variables are provided in the Restricted Release than the General Release. This is only so that users can build up alternative higher level geographic classifications from these standard lower level geographic classifications. The HILDA User Manual describes the geographic variables provided in the two datasets. See the ‘Geography’ section in the ‘Derived variables’ chapter (Section 4.3 in the HILDA User Manual ).
Finally, if your analysis is purely cross-sectional in nature, you should consider whether the Australian Bureau of Statistics has data that could better meet your needs. In many ABS surveys, smaller states and regions are oversampled to get more accurate estimates for those areas.