Whole-Genome Reference Networks for the Community
Srinivasan et al use this paper as a call to the community to begin the development of whole-genome reference networks for key model organisms. This paper is a combination of a review (in that it summarizes methods of network generation and analysis) and a call to arms, stating that reference networks are needed. It begins by describing systems biology as “the science of quantitatively defining and analyzing” functional modules, or components of biological systems.
There are many different definitions of systems biology (see here, here, here, here and here, just to name a few), but generally it seems the twin pillars of data integration and study – at various levels of granularity – of biological systems are present in most of them. A focus on integration and top-down research rather than the more traditional reductionist point of view is also often mentioned.
The authors then divide systems biology into three broad categories: high-level networks of the interactome or metabolome, deterministic models of kinetics and diffusion, and finally stochastic models of variation in cell lines. This division would be slightly clearer if they specified continuous deterministic models and discrete stochastic models. I realize that these adjectives are generally assumed for these model types, but as it is their discrete- or continuous-ness that increases the complexity of the models, it is something that would be useful to include.
They collapse many different types of network data into a single global interaction network, stating that it would be prohibitively expensive to try to prise out all of the sub-graphs, as variables such as time or sub-cellular location are often not simple to pull out on their own. This “lowest common denominator” method of network generation is not ideal, but does provide more information than, they attest, a simple genome sequence. In their networks, nodes represent proteins and edge weights are probabilities of association between proteins.
Noise is a real problem in most of these high-throughput data sets, and such data sets are not all created equal: one group may make a very good gene expression data set, and another may not. How can variable quality of data be dealt with? Early efforts focused on integrating multiple networks and only taking those nodes and edges that were present in more than one network. After that, methods of network generation that used “gold standards” created better integrated networks.
Descriptions of network analyses (rather than network creation) focus on network alignment and experiment prioritization. The latter is a general term for pulling out elements of the network that haven’t been experimentally verified, such as likely additions to known pathways or important disease genes. The former is an interesting extrapolation to networks of sequence alignments for genomes. In network alignments, conserved modules of nodes are identified if they have “both conserved primary sequences and conserved pair-wise interactions between species”. They specifically mention Graemlin, which is a tool they have developed that can identify conserved functional modules across multiple networks.
Finally, they suggest that the reference networks should show only those reactions present in the “‘average cell’ of a given organism near the median of the norm of reaction”.
While they acknowledge that, like the reference human genome sequence, such a creation is a “useful fiction”, it is my opinion that finding the average cell will be much more difficult, and perhaps less illuminating, than its equivalent in the sequencing world. Further, describing what is “normal” is something that is truly difficult, and will vary from species to species. The PATO / quality ontology people (http://obofoundry.org/cgi-bin/detail.cgi?quality) have known about the problems facing the “average” phenotype for a while now. I do, however, like their idea of storing the reference networks using RDF, as that seems a fitting format for networks. Overall, a laudable goal but one which will need some more thinking about. I’ve tried to run Graemlin using one of their example searches, and it didn’t run (at least today), and the main author’s website won’t load for me to today, though one of the other author’s pages did work.
All-in-all, a useful review of recent network methods in bioinformatics, and an interesting goal described. Low-noise reference networks for key model organisms, together with the annotation tracks that would describe deviations from the norm is a good idea.
Topics for discussion (aka leading questions): More fine-grained reference implementations are available, such as Reactome. Reactome provides a curated database of human biological pathways, with inferred orthologous events for 22 other organisms. Do we need reference networks when we’re gradually growing our knowledge of reference pathways? Are reference networks of “normal” organisms states helpful? How do we define average? Would the median of the norm of a reaction be different under different environmental conditions? What if what one group considers an average cell differs from another group’s average cell? Having reference networks would mean easier comparisons of different network analysis programs. Would this end up being a major use of the networks? Would such comparisons just lead to network analysis programs that fit the reference network, but not work in a generic manner? What do others think?
Srinivasan, B.S., Shah, N.H., Flannick, J.A., Abeliuk, E., Novak, A.F., Batzoglou, S. (2007). Current progress in network research: toward reference networks for key model organisms. Briefings in Bioinformatics, 8(5), 318-332. DOI: 10.1093/bib/bbm038