Loading Gene Expression Data
Interaction networks are
useful as stand-alone models. However, they are most powerful for
answering scientific questions when integrated with additional biological
information, such as
gene or protein expression levels. Once loaded, expression
ratios/levels may be visually superimposed on the network, used in a
filter to select a subset of nodes, or used to identify active
modules and subsystems (via plugin analysis tools). An expression
data set can be loaded at any time, but are only relevant once a
network has been loaded.
Format
Gene expression ratios or values are specified
over one or more experiments using a text file.
Ratios result from a comparison of two expression measurements (experiment vs.
control). Some expression platforms, such as Affymetrix, directly measure
expression values, without a comparison.
The file consists of
a header and a number of space- or tab-delimited fields, one line per
gene, with the following format:
GeneName
[CommonName] ratio1 ratio2 ... ratioN [pval1 pval2 ... pvalN]
Brackets [] indicate fields that are
optional. The first two fields are the systematic gene name followed
by an optional common name. Expression ratios/values are provided for each
experiment, optionally followed by a p-value per experiment or other
measure of the significance of each ratio/value, i.e. whether the ratio
represents a true change in expression or whether the value is accurately
measuring the value of a gene (according to some statistical
model.) Significance values are generated by a variety of software
packages for analyzing expression data generated by DNA microarrays,
for instance a program VERA from the Institute of Systems Biology
(http://www.systemsbiology.org/VERAandSAM).
A list of other microarray analysis packages is available at:
http://www.nslij-genetics.org/microarray/soft.html
Example
GENE
DESCRIPT gal1RG.sig gal2RG.sig gal3RG.sig gal1RG.sig gal2RG.sig
gal3RG.sig
YHR051W COX6
-0.034 -0.052 0.152 1.177 0.102 0.857
YHR124W NDT80
-0.090 -0.000 0.041 0.130 0.341 0.061
YKL181W PRS1
-0.167 -0.063 -0.230 -0.233 0.143 0.089
The first line is a header line
giving the names of the experimental conditions. Note that each
condition is duplicated; the first set of columns gives expression
ratios and the second set gives significance values. The significance
columns can be omitted if your data doesn't include significance
measures. Every remaining row specifies the values for a gene,
starting with the formal name of the gene, then a common name, then
the ratios, then the significance values.
Some variations on this basic format
are recognized: see the formal file format specification below for
more information. Expression data files commonly have the file
extensions ".mrna" or ".pvals", and these file
extensions are recognized by Cytoscape when browsing for data files.
Commands
Load an expression data file using
the File menu of Cytoscape, or by specifying the filename
using the -e option at the command line. The –x
command line option indicates that the expression data should not be
loaded into node attributes. This is an advanced option, and is
typically only used when the number of expression conditions is
sufficiently large that it becomes unwieldy in the normal user
interface.
Example
Load a sample gene expression data
set using the menu: File / Load / Expression Matrix File. In
the resulting file dialog box (shown at right), select the file
“sampleData/galExpData.pvals”. As described in the following
sections, Cytoscape is now ready to integrate these data with the
underlying molecular interaction network.
Detailed file format (Advanced
users)
In all expression data files, any
whitespace (spaces and/or tabs) is considered a delimiter between
adjacent fields. Every line of text is either the header line or
contains all the measurements for a particular gene. No name
conversion is applied to expression data files (see the section on
name resolution in section 5. Building and Storing Interaction
Networks).
The names given in the first column of the expression data file
should match exactly the names used elsewhere (i.e. in SIF or GML
files).
The first line is a header line with
one of the following three formats:
<text>
<text> cond1 cond2 ... cond1 cond2 ... [NumSigConds]
<text>
<text> cond1 cond2 ...
<tab><tab>RATIOS<tab><tab>...LAMBDAS
The first format specifies that both
expression ratios and significance values are included in the file.
The first two text tokens are ignored; these columns will contain
names for each gene. The next token set specifies the names of the
experimental conditions; these columns will contain ratio values.
This list of condition names must then be duplicated exactly, each
spelled the same way and in the same order. Optionally, a final
column with the title NumSigConds
may be present. If present, this column will contain integer values
indicating the number of conditions in which each gene had a
statistically significant change according to some threshold.
The second format is similar to the
first except that the duplicate column names are omitted, and there
is no NumSigConds
fields. This format specifies data with ratios but no significance
values.
The third format specifies an MTX
header, which is a commonly used format. Two tab characters precede
the RATIOS
token. This token is followed by a number of tabs equal to the number
of conditions, followed by the LAMBDAS
token. This format specifies both ratios and significance values.
Each line after the first is a data
line with the following format:
FormalGeneName
CommonGeneName ratio1 ratio2 ... [lambda1 lambda2 ...] [numSigConds]
The first two tokens are gene names.
The names in the first column are the keys used for node name lookup;
these names should be the same as the names used elsewhere in
Cytoscape (i.e. in the SIF or GML files). Traditionally in the gene
expression microarray community, who defined these file formats, the
first token is expected to be the formal name of the gene (in systems
where there is a formal naming scheme for genes), while the second is
expected to be a synonym for the gene commonly used by biologists,
although Cytoscape does not make use of the common name column. The
next columns contain floating point values for the ratios, followed
by columns with the significance values if specified by the header
line. The final column, if specified by the header line, should
contain an integer giving the number of significant conditions for
that gene.
Missing values are not allowed and
will confuse the parser. For example, using two consecutive tabs to
indicate a missing value will not work; the parser will regard both
tabs as a single delimiter and be unable to parse the line correctly.
Optionally, the last line of the
file may be a special footer line with the following format:
NumSigGenes int1 int2 ...
This line specified the number of
genes that were significantly differentially expressed in each
condition. The first text token must be spelled exactly as shown; the
rest of the line should contain one integer value for each
experimental condition.