Data Integration

Background: Modelling Biological Systems (Thesis 1.3)

[Previous: What does systems biology data look like?]
[Next: Standards as a shared structure for data]

Modelling biological systems

Le Novère described the 1952 Hodgkin–Huxley model of the squid giant nerve fibre [1] as the beginning of computational systems biology [2]. Since that time, systems biology models have been available in a variety of representations and with a large number of corresponding syntaxes. This variation is mainly a result of the different approaches used to model pathways and interactions in systems biology. Biological networks are generally qualitative and large-scale, and are created to provide a high level of granularity. Most networks do not yet have the information required to run successful mathematical simulations of the interactions under study; their remit is much broader and they are often composed of transitive binary interactions discovered in high-throughput experiments rather than complex biology-based pathways.

However, many experiments also produce quantitative data. For instance, high-throughput experimentation creates both qualitative and quantitative signalling pathway data important for the understanding of cellular communication [3]. Therefore modelling at the level of quantitative biological pathways is common in systems biology. Quantitative simulatable models are mainly used to enhance dynamic analyses and study biological pathways, and include specific details on parametrisation of those pathways. The quantitative description of a biological pathway or behaviour is essential for high-quality systems biology, informing hypotheses and creating an iterative cycle of prediction and experimental verification [4, 5, 3]. This section describes the basics of both networks and quantitative models in systems biology.

1 Networks in systems biology

To the majority of scientists, networks are perceived as views over sections of an in vivo cellular network [6]. There are five main types of network in systems biology: (i) transcription factor-binding networks, (ii) protein-protein interaction networks, (iii) protein phosphorylation networks, (iv) metabolic interaction networks and (v) genetic and small molecule interaction networks [7]. While such networks are just a conceptualisation of biological reality, they remain a useful virtual organisation of biological entities. For instance, network-mediated integration methodologies make use of the inherent graph-based organisation of networks to add many different datasets.

The CID integrates interaction data for multiple organisms into a weighted PFIN (Note: CID allows users to determine the reliability of a particular connection between two interactors. Additional work has made use of the inherent bias of a dataset to generate PFIN networks which include a relevance score [8]. Biases are exploited rather than eliminated; they can be introduced by an experiment being designed for a particular biological process or by choosing the final published data because it reflects the process of interest [8]. By adding a relevance score to the existing integrated confidence scores of a network, biased networks perform better than unbiased networks at gene function assignment and identification of important sets of interactors [8].

Genome-scale metabolic networks attempt to add new annotation as well as reconcile often-contradictory information present in the original networks and have been created for yeast [9, 10, 11] and human [12, 13]. The ultimate goal is to generate a view spanning an entire cellular network rather than sections of it.

Common formats for large-scale network data include linked data via the RDF [14] or semantically structured data via BioPAX [15]. RDF is a format which uses triples to create a directed, labelled graph of data and is the underpinning of the Semantic Web. A triple is similar to a sentence comprising a Subject, a Predicate and an Object [14]. A collection of RDF data can also be visualised as a graph where the Subjects and Objects are nodes linked together by Predicates, which form the edges between nodes.

Linked Life Data, one of the biggest networks of life science data, uses RDF to store and link its entities (Note: Linked Life Data contains over one billion entities and over five billion statements (Note: as of December 2011, ONDEX [16] is a tool for data visualisation and integration which has been used to generate [17, 18, 19] and visualise [18] biological networks. While ONDEX can export data as RDF, internally it uses a labelled graph with a similar level of expressivity to RDF. BioPAX describes pathways in great qualitative detail, but does not have the capacity to store parameters or other quantitative data about pathways and interactions.

2 From networks to mathematical models

Although networks are generally qualitative, they can aid in the creation of quantitative models; indeed, mathematical modelling is one way of studying the complex behaviour of networks [3]. Some researchers are even blurring the line between networks and models, attempting to create genome-scale models with partial quantisation of data. These integrated consensus networks sit on the border between qualitative large-scale networks and quantitative small-scale pathway models. Herrgård and colleagues [9] created the “Yeast 1.0” consensus network by re-formatting existing yeast interaction data in SBML [20], a structure more commonly used for quantitative modelling. Although Yeast 1.0 is represented in SBML, it does not yet contain enough quantitative data to be simulated.

Irrespective of a computational model’s size reactions, effectors, kinetic rate equations and parameters of those equations need to be added for it to be fully functional [9]. Of those requirements, the original version of Yeast 1.0 only contained known reactions. Updates to the consensus network have vastly increased the connectivity of the nodes as well as the number of metabolites and enzymes, but have not yet increased its quantitative information [21]. Once complete, such network-scale models will benefit both from the large amount of data contained within them and from the ability to run in silico modelling experiments normally only accessible to smaller quantitative models.

Formats such as SBML are able to describe pathways quantitatively, and were created to provide machine-readable formats for model simulation. Other resources such as the BioPAX ontology were created to describe pathways qualitatively. Even with this logical division in purpose between qualitative and quantitative descriptions of systems biology, some projects have begun to bridge the divide. The network-scale Yeast 1.0 SBML model does not yet contain quantitative information. Rather than producing Yeast 1.0 for the purposes of simulation, there is a commitment to realistic representation and high quality selection of reactions [21]. Additionally, research into adding quantitative information to the qualitative pathway ontology BioPAX via SBPAX is progressing [22].

Work by Smallbone and colleagues [23, 24] uses flux balance analysis to create kinetic models of metabolism using only information regarding the reaction stoichiometries. Even though there is little experimental data for the variables, the dynamics concerning the concentrations of cellular metabolites can be inferred. Work on genome-scale kinetic models has progressed by adding the information from kinetic models stored within BioModels to this flux balance analysis method [24].

3 Quantitative modelling

Processes are the fundamental unit of systems biology, and the biological entities and associated quantitative data such as concentrations and rate constants are of prime importance for systems biology research [6]. The simulation and analysis of computational models that describe the dynamics of the interactions between biological entities is a vital facet of systems biology research [5]. Models and experiments are typically refined iteratively, as described in Section 1.2; models provide useful feedback to experimentalists, and experimental results help in turn to improve the models. The creation of systems biology models, such as those written in SBML or CellML [25], is primarily performed manually. When faced with such a time-consuming process, many researchers will not represent the biological context of a pathway or make use of many of the data sources and formats relevant to model development. While a small number of core databases can be used to retrieve a large amount of relevant biological information, accessing the “long tail” of information stored in other resources is a more complicated process. However, computational aids could help modellers retrieve new biological information easily and quickly.

Formats such as SBML and CellML provide machine-readable interpretations of biological pathways, complex formation and fundamental processes such as transcription and translation [26]. They are intended to make the task of understanding models easy for the programs that process them. The success of computational models in systems biology is not just shown by their prevalence in literature, but also by the 231 (Note: as of December 2011, programs and applications making use of them. These programs allow the creation, simulation, analysis, annotation, and storage of SBML in a way that hides the underlying machine-readable format, making the model information accessible to humans. Further information on systems biology formats is available in Section 1.4.

4 Annotation of systems biology models

Systems biology models may contain both the quantitative information required to run a simulation of a biological system and biologically meaningful annotation describing the entities in the system. Annotation provides a description of how a model has been generated and defines the biology of its components in a computationally accessible fashion. However, while the mathematical information necessary for simulation models must be included, syntactically valid models capable of simulation are not required to contain explicit information about the biological context. Therefore, even though the presence of biological annotation aids efficient exchange, reuse, and integration of models, such information is often limited or lacking [27]. As a result, model usefulness is often limited to the person who created it; ambiguity in naming schemes and a lack of biological context hinders model reuse as an input in other computational tasks and as a reference for researchers [28, 29].

BioModels is a database of SBML models divided into curated and non-curated branches [27]. In the curated section, MIRIAM [28] compliance is assured and biological semantics have been added. BioModels curators regularly add biological annotations to an entry prior to promotion to the curated branch. These annotations resolve ambiguity of identification through links to external resources using stable URIs as provided by the MIRIAM Registry (Note: However, model annotation either by BioModels curators or the modellers themselves is complex [27]. Programmatic methods to add such annotation would enrich publicly available models and therefore improve their quality and reusability.

In SBML, biological annotation is structured according to the MIRIAM specification [28]. There are three parts to MIRIAM: (i) a recommended URI-based structure for compliant annotations, (ii) a set of resources to generate and interpret those URIs and (iii) a checklist of minimal information requested in the annotation of biological models. While other annotations are allowed within the specification, MIRIAM annotations are the most relevant to the work presented here. MIRIAM annotations are added to models in a standardised way, and link external resources such as ontologies or data sources to a model. MIRIAM provides a standard structure for explicit links between the mathematical and biological aspects of a model. Aids to model annotation exist [30, 31, 22, 32, 33, 34], but rely extensively on the expert knowledge of the modeller for identification of appropriate additions. More information on such tools is available in Chapters 3 and 5. Ultimately, there is a need for computational approaches that automate the integration of multiple sources to decrease the annotation burden on the modeller.


A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, August 1952.
Nicolas Le Novère. The long journey to a Systems Biology of neuronal function. BMC systems biology, 1(1):28+, 2007.
Anna Bauer-Mehren, Laura I. Furlong, and Ferran Sanz. Pathway databases and tools for their exploitation: benefits, current limitations and challenges. Molecular Systems Biology, 5, July 2009.
Hiroaki Kitano. Systems Biology: A Brief Overview. Science, 295(5560):1662–1664, March 2002.
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
Joanne S. Luciano and Robert D. Stevens. e-Science and biological pathway semantics. BMC bioinformatics, 8 Suppl 3(Suppl 3):S3+, 2007.
Xiaowei Zhu, Mark Gerstein, and Michael Snyder. Getting connected: analysis and principles of biological networks. Genes & Development, 21(9):1010–1024, May 2007.
Katherine James, Anil Wipat, and Jennifer Hallinan. Integration of Full-Coverage Probabilistic Functional Networks with Relevance to Specific Biological Processes. In Norman Paton, Paolo Missier, and Cornelia Hedeler, editors, Data Integration in the Life Sciences, volume 5647 of Lecture Notes in Computer Science, chapter 4, pages 31–46. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2009.
Markus J. Herrgard, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalcin Arga, Mikko Arvas, Nils Buthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Nicolas Le Novere, Peter Li, Wolfram Liebermeister, Monica L. Mo, Ana P. Oliveira, Dina Petranovic, Stephen Pettifer, Evangelos Simeonidis, Kieran Smallbone, Irena Spasie, Dieter Weichart, Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betul Kurdar, Merja Penttila, Edda Klipp, Bernhard O. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen, and Douglas B. Kell. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology, 26(10):1155–1160, October 2008.
Natalie C. Duarte, Markus J. Herrgard, and Bernhard Palsson. Reconstruction and Validation of Saccharomyces cerevisiae iND750, a Fully Compartmentalized Genome-Scale Metabolic Model. Genome Research, 14(7):1298–1309, July 2004.
Jochen Förster, Iman Famili, Patrick Fu, Bernhard Ø. Palsson, and Jens Nielsen. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome research, 13(2):244–253, February 2003.
Hongwu Ma, Anatoly Sorokin, Alexander Mazein, Alex Selkov, Evgeni Selkov, Oleg Demin, and Igor Goryanin. The Edinburgh human metabolic network reconstruction and its functional analysis. Molecular systems biology, 3, 2007.
Natalie C. Duarte, Scott A. Becker, Neema Jamshidi, Ines Thiele, Monica L. Mo, Thuy D. Vo, Rohith Srivas, and Bernhard Ø. Palsson. Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences of the United States of America, 104(6):1777–1782, February 2007.
Dave Beckett. RDF/XML Syntax Specification (Revised)., February 2004.
Emek Demir, Michael P. Cary, Suzanne Paley, Ken Fukuda, Christian Lemer, Imre Vastrik, Guanming Wu, Peter D’Eustachio, Carl Schaefer, Joanne Luciano, Frank Schacherer, Irma Martinez-Flores, Zhenjun Hu, Veronica Jimenez-Jacinto, Geeta Joshi-Tope, Kumaran Kandasamy, Alejandra C. Lopez-Fuentes, Huaiyu Mi, Elgar Pichler, Igor Rodchenkov, Andrea Splendiani, Sasha Tkachev, Jeremy Zucker, Gopal Gopinath, Harsha Rajasimha, Ranjani Ramakrishnan, Imran Shah, Mustafa Syed, Nadia Anwar, Ozgün Babur, Michael Blinov, Erik Brauner, Dan Corwin, Sylva Donaldson, Frank Gibbons, Robert Goldberg, Peter Hornbeck, Augustin Luna, Peter Murray-Rust, Eric Neumann, Oliver Reubenacker, Matthias Samwald, Martijn van Iersel, Sarala Wimalaratne, Keith Allen, Burk Braun, Michelle Whirl-Carrillo, Kei-Hoi H. Cheung, Kam Dahlquist, Andrew Finney, Marc Gillespie, Elizabeth Glass, Li Gong, Robin Haw, Michael Honig, Olivier Hubaut, David Kane, Shiva Krupa, Martina Kutmon, Julie Leonard, Debbie Marks, David Merberg, Victoria Petri, Alex Pico, Dean Ravenscroft, Liya Ren, Nigam Shah, Margot Sunshine, Rebecca Tang, Ryan Whaley, Stan Letovksy, Kenneth H. Buetow, Andrey Rzhetsky, Vincent Schachter, Bruno S. Sobral, Ugur Dogrusoz, Shannon McWeeney, Mirit Aladjem, Ewan Birney, Julio Collado-Vides, Susumu Goto, Michael Hucka, Nicolas Le Novère, Natalia Maltsev, Akhilesh Pandey, Paul Thomas, Edgar Wingender, Peter D. Karp, Chris Sander, and Gary D. Bader. The BioPAX community standard for pathway data sharing. Nature biotechnology, 28(9):935–942, September 2010.
Jacob Köhler, Jan Baumbach, Jan Taubert, Michael Specht, Andre Skusa, Alexander Rüegg, Chris Rawlings, Paul Verrier, and Stephan Philippi. Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics, 22(11):1383–1390, June 2006.
Artem Lysenko, Michael D. Platel, Keywan H. Pak, Jan Taubert, Charlie Hodgman, Christopher Rawlings, and Mansoor Saqi. Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis. BMC Bioinformatics, 12(1):203+, 2011.
Jochen Weile, Matthew Pocock, Simon J. Cockell, Phillip Lord, James M. Dewar, Eva-Maria Holstein, Darren Wilkinson, David Lydall, Jennifer Hallinan, and Anil Wipat. Customizable views on semantically integrated networks for systems biology. Bioinformatics, 27(9):1299–1306, May 2011.
Simon J. Cockell, Jochen Weile, Phillip Lord, Claire Wipat, Dmytro Andriychenko, Matthew Pocock, Darren Wilkinson, Malcolm Young, and Anil Wipat. An integrated dataset for in silico drug discovery. Journal of integrative bioinformatics, 7(3), 2010.
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
Paul Dobson, Kieran Smallbone, Daniel Jameson, Evangelos Simeonidis, Karin Lanthaler, Pinar Pir, Chuan Lu, Neil Swainston, Warwick Dunn, Paul Fisher, Duncan Hull, Marie Brown, Olusegun Oshota, Natalie Stanford, Douglas Kell, Ross King, Stephen Oliver, Robert Stevens, and Pedro Mendes. Further developments towards a genome-scale metabolic model of yeast. BMC Systems Biology, 4(1):145+, October 2010.
O. Ruebenacker, I. I. Moraru, J. C. Schaff, and M. L. Blinov. Integrating BioPAX pathway knowledge with SBML models. IET Systems Biology, 3(5):317–328, 2009.
Kieran Smallbone, Evangelos Simeonidis, David S. Broomhead, and Douglas B. Kell. Something from nothing – bridging the gap between constraint-based and kinetic modelling. FEBS Journal, 274(21):5576–5585, November 2007.
Kieran Smallbone, Evangelos Simeonidis, Neil Swainston, and Pedro Mendes. Towards a genome-scale kinetic model of cellular metabolism. BMC Systems Biology, 4(1):6+, January 2010.
Catherine M. Lloyd, James R. Lawson, Peter J. Hunter, and Poul F. Nielsen. The CellML Model Repository. Bioinformatics, 24(18):2122–2123, September 2008.
M. Hucka, A. Finney, B. J. Bornstein, S. M. Keating, B. E. Shapiro, J. Matthews, B. L. Kovitz, M. J. Schilstra, A. Funahashi, J. C. Doyle, and H. Kitano. Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project. Systems biology, 1(1):41–53, June 2004.
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
Camille Laibe and Nicolas Le Novere. MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology, 1(1):58+, 2007.
Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
Peter Li, Tom Oinn, Stian Soiland, and Douglas B. Kell. Automated manipulation of systems biology models using libSBML within Taverna workflows. Bioinformatics (Oxford, England), 24(2):287–289, January 2008.
Neil Swainston and Pedro Mendes. libAnnotationSBML: a library for exploiting SBML annotations. Bioinformatics, 25(17):2292–2293, September 2009.
M. L. Blinov, O. Ruebenacker, and I. I. Moraru. Complexity and modularity of intracellular networks: a systematic approach for modelling and simulation. IET systems biology, 2(5):363–368, September 2008.
Falko Krause, Jannis Uhlendorf, Timo Lubitz, Marvin Schulz, Edda Klipp, and Wolfram Liebermeister. Annotation and merging of SBML models with semanticSBML. Bioinformatics, 26(3):421–422, February 2010.

By Allyson Lister

Find me at and

6 replies on “Background: Modelling Biological Systems (Thesis 1.3)”

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s