A framework for significance analysis of gene expression data using dimension reduction methods
Journal article, Peer reviewed
View/ Open
Date
2007Metadata
Show full item recordCollections
- Institutt for kjemi [1418]
- Publikasjoner fra CRIStin - NTNU [39165]
Abstract
Background: The most popular methods for significance analysis on microarray data are well
suited to find genes differentially expressed across predefined categories. However, identification
of features that correlate with continuous dependent variables is more difficult using these
methods, and long lists of significant genes returned are not easily probed for co-regulations and
dependencies. Dimension reduction methods are much used in the microarray literature for
classification or for obtaining low-dimensional representations of data sets. These methods have an
additional interpretation strength that is often not fully exploited when expression data are
analysed. In addition, significance analysis may be performed directly on the model parameters to
find genes that are important for any number of categorical or continuous responses. We
introduce a general scheme for analysis of expression data that combines significance testing with
the interpretative advantages of the dimension reduction methods. This approach is applicable both
for explorative analysis and for classification and regression problems.
Results: Three public data sets are analysed. One is used for classification, one contains spiked-in
transcripts of known concentrations, and one represents a regression problem with several
measured responses. Model-based significance analysis is performed using a modified version of
Hotelling's T2-test, and a false discovery rate significance level is estimated by resampling. Our
results show that underlying biological phenomena and unknown relationships in the data can be
detected by a simple visual interpretation of the model parameters. It is also found that measured
phenotypic responses may model the expression data more accurately than if the designparameters
are used as input. For the classification data, our method finds much the same genes as
the standard methods, in addition to some extra which are shown to be biologically relevant. The
list of spiked-in genes is also reproduced with high accuracy.
Conclusion: The dimension reduction methods are versatile tools that may also be used for
significance testing. Visual inspection of model components is useful for interpretation, and the
methodology is the same whether the goal is classification, prediction of responses, feature
selection or exploration of a data set. The presented framework is conceptually and algorithmically
simple, and a Matlab toolbox (Mathworks Inc, USA) is supplemented.