Srinivasan et al use this paper as a call to the community to begin the development of whole-genome reference networks for key model organisms. This paper is a combination of a review (in
that it summarizes methods
of network generation and analysis) and a call to arms, stating that
reference networks are needed. It begins by describing systems biology as "the science of quantitatively
defining and analyzing" functional modules, or components of
There are many different definitions of systems
biology (see here, here, here, here and here,
just to name a few), but generally it seems the twin pillars of data
integration and study – at various levels of granularity – of
biological systems are present in most of them. A focus on integration
and top-down research rather than the more traditional reductionist
point of view is also often mentioned.
The authors then divide systems
biology into three broad categories: high-level networks of the
interactome or metabolome, deterministic models of kinetics and
diffusion, and finally stochastic models of variation in cell lines.
This division would be slightly clearer if they specified continuous deterministic models and discrete
stochastic models. I realize that these adjectives are generally
assumed for these model types, but as it is their discrete- or
continuous-ness that increases the complexity of the models, it is
something that would be useful to include.
They collapse many
different types of network data into a single global interaction
network, stating that it would be prohibitively expensive to try to
prise out all of the sub-graphs, as variables such as time or
sub-cellular location are often not simple to pull out on their own.
This "lowest common denominator" method of network generation is not
ideal, but does provide more information than, they attest, a simple
genome sequence. In their networks, nodes represent proteins and edge
weights are probabilities of association between proteins.
a real problem in most of these high-throughput data sets, and such
data sets are not all created equal: one group may make a very good
gene expression data set, and another may not. How can variable quality
of data be dealt with? Early efforts focused on integrating multiple
networks and only taking those nodes and edges that were present in
more than one network. After that, methods of network generation that
used "gold standards" created better integrated networks.
of network analyses (rather than network creation) focus on network
alignment and experiment prioritization. The latter is a general term
for pulling out elements of the network that haven't been
experimentally verified, such as likely additions to known pathways or
important disease genes. The former is an interesting extrapolation to
networks of sequence alignments for genomes. In network alignments,
conserved modules of nodes are identified if they have "both conserved
primary sequences and conserved pair-wise interactions
between species". They specifically mention Graemlin, which is a tool
they have developed that can identify conserved functional modules across multiple networks.
Finally, they suggest that the reference networks should show only those reactions present in the "‘average cell’ of a given organism near the median of the norm of reaction".
While they acknowledge that, like the reference human genome sequence,
such a creation is a "useful fiction", it is my opinion that finding
the average cell will be much more difficult, and perhaps less
illuminating, than its equivalent in the sequencing world. Further,
describing what is "normal" is something that is truly difficult, and
will vary from species to species. The PATO / quality ontology people
(http://obofoundry.org/cgi-bin/detail.cgi?quality) have known about the
problems facing the "average" phenotype for a while now. I do, however, like their
idea of storing the reference networks using RDF, as that seems a
fitting format for networks. Overall, a laudable goal but one which
will need some more thinking about. I've tried to run Graemlin
using one of their example searches, and
it didn't run (at least today), and the main author's website won't load for me to today, though one of
the other author's pages did work.
All-in-all, a useful review of recent network methods in bioinformatics, and an interesting goal described. Low-noise reference networks for key model organisms, together with the annotation tracks that would describe deviations from the norm is a good idea.
Topics for discussion (aka leading questions): More fine-grained reference implementations are available, such as Reactome. Reactome provides a curated database of human biological pathways, with inferred orthologous events for 22 other organisms. Do we need reference networks when we're gradually growing our knowledge of reference pathways? Are reference networks of "normal" organisms states helpful? How do we define average? Would the median of the norm of a reaction be different under different environmental conditions? What if what one group considers an average cell differs from another group's average cell? Having reference networks would mean easier comparisons of different network analysis programs. Would this end up being a major use of the networks? Would such comparisons just lead to network analysis programs that fit the reference network, but not work in a generic manner? What do others think?
Srinivasan, B.S., Shah, N.H., Flannick, J.A., Abeliuk, E., Novak, A.F., Batzoglou, S. (2007). Current progress in network research: toward reference networks for key model organisms. Briefings in Bioinformatics, 8(5), 318-332. DOI: 10.1093/bib/bbm038