Building and
Storing Interaction Networks
Cytoscape reads an interaction network in two
ways: from a simple interaction file (SIF or .sif format) or from a
standard format known as Graph Markup Language (GML or .gml format).
SIF specifies nodes and interactions only, while GML stores
additional information about network layout and allows network data
exchange with a variety of other network display programs. Typically,
SIF is used to import interactions when building a network
for the first time, since it is easy to create in a text editor or
spreadsheet. Once the interactions have been loaded and layout has
been performed, the network may be saved to and subsequently reloaded
from GML format in future Cytoscape sessions. Both SIF and GML are
ASCII text files, and you can edit and view them in a regular text
editor. Additionally, GML is supported by some other network
visualization tools.
SIF FORMAT
The simple interaction format is
convenient for building a graph from a list of interactions. It also
makes it easy to combine different interaction sets into a larger
network, or add new interactions to an existing data set. The main
disadvantage is that this format does not include any layout
information, forcing Cytoscape to re-compute a new layout of the
network each time it is loaded.
Lines in the SIF file specify a
source node, a relationship type (or edge type), and one or more
target nodes:
nodeA
<relationship type> nodeB
nodeC
<relationship type> nodeA
nodeD
<relationship type> nodeE nodeF nodeB
nodeG
...
nodeY
<relationship type> nodeZ
A more specific example is:
node1
typeA node2
node2
typeB node3 node4 node5
node0
The first line identifies two nodes,
called node1
and node2,
and a single relationship between node1
and node2
of type typeA.
The second line specifies three new nodes, node3,
node4,
and node5;
here "node2" refers to the same node as in the first line.
The second line also specifies three relationships, all of type typeB
and with node2
as the source, with node3,
node4,
and node5
as the targets, respectively. This second form is simply shorthand
for specifying multiple relationships of the same type with the same
source node. The third line indicates how to specify a node that has
no relationships with other nodes. This form is not needed for nodes
that do have relationships, since the specification of the
relationship implicitly identifies the nodes as well.
Duplicate entries are allowed and
indicate multiple edges between the same nodes. For example, the
following specifies three edges between the same pair of nodes, two
of type pp
and one of type pd:
node1
pp node2
node1
pp node2
node1
pd node2
Edges connecting a node to itself
(self-edges) are also allowed:
node1
pp node1
Every node and edge in Cytoscape has
an identifying name, most commonly used with the node and edge data
attribute structures. Node names must be unique as identically names
nodes will be treated as identical nodes. The name of each node will
be the name in this file by default (unless another string is mapped
to display on the node using the visual mapper – see 9. Visual
Styles).
The name of each edge will be formed from the name of the source and
target nodes plus the interaction type: for example, sourceName
edgeType targetName.
The tag <interaction
type> should be one of:
pp
.................. protein – protein interaction
pd
.................. protein -> DNA
(e.g.
transcription factor binding upstream of a regulating gene.)
Any text string will work, but the
above are the conventions that have been followed thus far.
Additional interaction types are also
possible, but not widely used, e.g.:
pr
.................. protein -> reaction
rc
.................. reaction -> compound
cr
.................. compound -> reaction
gl
.................. genetic lethal relationship
pm
.................. protein-metabolite interaction
mp
.................. metabolite-protein interaction
Even whole words or concatenated words
may be used to define other types of relationships e.g. geneFusion,
cogInference, pullsDown, activates, degrades, inactivates, inhibits,
phosphorylates, upRegulates
Delimiters. Whitespace (space
or tab) is used to delimit the names in the simple interaction file
format. However, in some cases spaces are desired in a node name or
edge type. The standard is that, if the file contains any tab
characters, then tabs are used to delimit the fields and spaces are
considered part of the name. If the file contains no tabs, then any
spaces are delimiters that separate names (and names cannot contain
spaces).
If your network unexpectedly
contains no edges and node names that look like edge names, it
probably means your file contains a stray tab that's fooling the
parser. On the other hand, if your network has nodes whose names are
half of a full name, then you probably meant to use tabs to separate
node names with spaces.
Networks in simple interactions
format are often stored in files with a ".sif" extension,
and Cytoscape recognizes this extension when browsing a directory for
files of this type.
GML FORMAT
In contrast to SIF, GML is a rich
graph format language supported by many other network visualization
packages. The GML file format specification is available at:
http://www.infosun.fmi.uni-passau.de/Graphlet/GML/
It is generally not necessary to
modify the content of a GML file directly. Once a network is built
in SIF format and then laid out, the layout is preserved by saving to
and loading from GML.
Visual attributes specified in a GML file will result in a new visual style
named "Filename.style" when that GML file is loaded.
COMMANDS:
Load and save network files using
the File menu of Cytoscape. Network files may also be loaded directly
from the command line using the –i (SIF format) or -g (GML format)
options.
FOR EXAMPLE:

To
load a sample molecular interaction network in SIF format, use the
menu File / Load / Network. In the resulting file
dialog box, select the file “sampleData/galFiltered.sif”. After a
few seconds, a small network of 331 nodes should appear in the main
window. To load the same interaction network as a GML, use the menu:
File / Load / Network again. In the resulting file dialog box,
select the file “sampleData/galFiltered.gml”. Node and edge
attribute files as well as expression data and extra annotation can
be loaded as well.
NODE NAMING ISSUES IN CYTOSCAPE:
Typically, genes are represented by
nodes, and interactions (or other biological relationships) are
represented by edges between nodes. For compactness, a gene also
represents its corresponding protein. Nodes may also be used to
represent compounds and reactions (or anything else) instead of
genes.
If a network of genes or proteins is
to be integrated with Gene Ontology (GO) annotation or gene
expression data, the gene names must exactly match the
names specified in the other data files. We strongly encourage
naming genes and proteins by their systematic ORF name or standard
accession number; common names may be displayed on the screen for
ease of interpretation, so long as these are available to the program
in the annotation directory or in a node attribute file. Cytoscape
ships with all yeast ORF-to-common name mappings in a synonym table
within the annotation/ directory. Other organisms will be supported
in the future.
Why do we recommend using standard
gene names? All of the external data formats recognized by Cytoscape
provide data associated with particular names of particular objects.
For example, a network of protein-protein interactions would list the
names of the proteins, and the attribute and expression data would
likewise be indexed by the name of the object.
The problem is in connecting data
from different data sources that don't necessarily use the same name
for the same object. For example, genes are commonly referred to by
different names, including a formal "location on the chromosome"
identifier and one or more common names that are used by ordinary
researchers when talking about that gene. Additionally, database
identifiers from every database where the gene is stored may be used
to refer to a gene (e.g. protein accession numbers from Swiss-Prot).
If one data source uses the formal name while a different data source
used a common name or identifier, then Cytoscape must figure out that
these two different names really refer to the same biological entity.
Cytoscape has two strategies for
dealing with this naming issue, one simple and one more complex. The
simple strategy is to assume that every data source uses the
same set of names for every object. If this is the case, then
Cytoscape can easily connect all of the different data sources.
To handle data sources with
different sets of names, as is usually the case when manually
integrating gene information from different sources, Cytoscape needs
a data server that provides synonym information (See section on
Annotation Server Format).
A synonym table gives a canonical name for each object in a given
organism and one or more recognized synonyms for that object. Note
that the synonym table itself defines what set of names are the
"canonical" names. For example, in budding yeast the ORF
names are commonly used as the canonical names.
If a synonym server is available,
then by default Cytoscape will convert every name that appears in a
data file to the associated canonical name. Unrecognized names will
not be changed. This conversion of names to a common set allows
Cytoscape to connect the genes present in different data sources,
even if they have different names – as long as those names are
recognized by the synonym server.
For this to work, Cytoscape must
also be provided with the species to which the objects belong, since
the data server requires the species in order to uniquely identify
the object referred to by a particular name. This is usually done in
Cytoscape by specifying the species name on the command line with the
–s option or by adding a line to the cytoscape.props file of the
form:
defaultSpeciesName=Saccharomyces
cerevisiae
The automatic canonicalization of
names can be turned off with the -c command line argument (i.e. java
-jar cytoscape.jar -c) or by not loading any annotation. This
canonicalization of names currently does not apply to expression
data. Expression data should use the same names as the other data
sources or use the canonical names as defined by the synonym table.