Analysis of Longitudinal Data with Missing Values.: Methods and Applications in Medical Statistics.
Abstract
Missing data is a concept used to describe the values that are, for some reason, not observed in datasets. Most standard analysis methods are not feasible for datasets with missing values. The methods handling missing data may result in biased and/or imprecise estimates if methods are not appropriate. It is therefore important to employ suitable methods when analyzing such data. Cardiac surgery is a procedure suitable for patients suffering from different types of heart diseases. It is a physical and psychical demanding surgical operation for the patients, although the mortality rate is low. Health-related quality of life (HRQOL) is a popular and widespread measurement tool to monitor the overall situation of patients undergoing cardiac surgery, especially in elderly patients with naturally limited life expectancies [Gjeilo, 2009]. There has been a growing attention to possible differences between men and women with respect to HRQOL after cardiac surgery. The literature is not consistent regarding this topic. Gjeilo et al. [2008] studied HRQOL in patients before and after cardiac surgery with emphasis on differences between men and women. In the period from September 2004 to September 2005, 534 patients undergoing cardiac surgery at St Olavs Hospital were included in the study. HRQOL were measured by the self-reported questionnaires Short-Form 36 (SF-36) and the Brief Pain Inventory (BPI) before surgery and at six and twelve months follow-up. The SF-36 reflects health-related quality of life measuring eight conceptual domains of health [Loge and Kaasa, 1998]. Some of the patients have not responded to all questions, and there are missing values in the records for about 41% of the patients. Women have more missing values than men at all time points. The statistical analyses performed in Gjeilo et al. [2008] employ the complete-case method, which is the most common method to handle missing data until recent years. The complete-case method discards all subjects with unobserved data prior to the analyses. It makes standard statistical analyses accessible and is the default method to handle missing data in several statistical software packages. The complete-case method gives correct estimates only if data are missing completely at random without any relation to other observed or unobserved measurements. This assumption is seldom met, and violations can result in incorrect estimates and decreased efficiency. The focus of this paper is on improved methods to handle missing values in longitudinal data, that is observations of the same subjects at multiple occasions. Multiple imputation and imputation by expectation maximization are general methods that can be applied with many standard analysis methods and several missing data situations. Regression models can also give correct estimates and are available for longitudinal data. In this paper we present the theory of these approaches and application to the dataset introduced above. The results are compared to the complete-case analyses published in Gjeilo et al. [2008], and the methods are discussed with respect to their properties of handling missing values in this setting. The data of patients undergoing cardiac surgery are analyzed in Gjeilo et al. [2008] with respect to gender differences at each of the measurement occasions; Presurgery, six months, and twelve months after the operation. This is done by a two-sample Student's t-test assuming unequal variances. All patients observed at the relevant occasion is included in the analyses. Repeated measures ANOVA are used to determine gender differences in the evolution of the HRQOL-variables. Only patients with fully observed measurements at all three occasions are included in the ANOVA. The methods of expectation maximization (EM) and multiple imputation (MI) are used to obtain plausible complete datasets including all patients. EM gives a single imputed dataset that can be analyzed similar to the complete-case analysis. MI gives multiple imputed datasets where all dataset must be analyzed sepearately and their estimates combined according to a technique called Rubin's rules. Results of both Student's t-tests and repeated measures ANOVA can be performed by these imputation methods. The repeated measures ANOVA can be expressed as a regression equation that describes the HRQOL-score improvement in time and the variation between subjects. The mixed regression models (MRM) are known to model longitudinal data with non-responses, and can further be extended from the repeated measures ANOVA to fit data more sufficiently. Several MRM are fitted to the data of cardiac surgery patients to display their properties and advantages over ANOVA. These models are alternatives to the imputation analyses when the aim is to determine gender differences in improvement of HRQOL after surgery. The imputation methods and mixed regression models are assumed to handle missing data in an adequate way, and gives similar analysis results for all methods. These results differ from the complete-case method results for some of the HRQOL-variables when examining the gender differences in improvement of HRQOL after surgery.