Data Integration

Background: Standards as a shared structure for data (Thesis: 1.4)

[Previous: Modelling biological systems]
[Next: What are ontologies?]

Standards as a shared structure for data

As described in Section 1.3, systems biology benefits from the use of common standards for describing data and from unified naming schemes to ensure precise identification of entities. The research community must make use of standardised methods to increase the annotation of data to a point where it matches the rate of data generation [1]. Researchers must adapt, documenting and managing their data “with as much professionalism as they devote to their experiments” [2]. However, the level of standards support for consistent adoption and deployment is a difficult commitment for individuals [2]. Careful deployment by bioinformaticians can make standards use transparent to the researchers generating the data. This section provides an overview of the standards specific to systems biology as well as those useful both to systems biologists and the wider life sciences community.

Virtually every life science community has at least one proposed or accepted standard, including transcriptomics [3, 4], proteomics [5], genomics [6], flow cytometry [7] and systems biology [8, 9]. The MIBBI Registry alone contains 32 minimal information checklists for various experimental types1. Standards are difficult to create and maintain in terms of manpower, money, and consensus. Standards begin with islands of initial researchers in a field, who gradually develop into a nascent community. With new scientific developments, ‘just-enough’ strategies for storing and structuring data become ‘nowhere-near-enough’. Communication with peers and among machines becomes more important and standards become a critical requirement.

The best solution may seem to be the creation of an elegant or complex ontology which models a domain in a richly described and logically rigorous way. However, the most practical solutions are generally easy to use and intuitive to understand, characteristics which do not always apply to more complex ontologies. Some solutions involve relying on realist upper level ontologies such as the BFO [10], but even experienced researchers find limitations and difficulties in such highly philosophical solutions [11, 12, 13, 14]. In addition to the complexity of the solution, the practicalities of creating a community standard require a large amount of effort and continued goodwill. Even with the best of intentions, with every participant working towards the same goal, it can take months—or years—of meetings, document revisions and conference calls to derive a working standard. For instance, OBI [15] has been in development for seven years and is yet to be officially published. Despite the time it takes to reach consensus, even the best structure or semantics will not guarantee usage if the work has been developed without the input of the wider community.

While it may seem that only one standard is needed for each community, there are in fact two axes along which multiple standards can be developed. The first is the axis of scope and the second is the axis of type. Even though a single community may wish to describe a single type of information (for instance, a systems biology model), the amount and scope of the data might be too large to feasibly fit into just one standard. There are three types of standard which must be considered: the minimal descriptive content, or metadata, a piece of data must include; the standard syntax to which that data must adhere; and a standard semantics for describing the meaning of the data in a way understandable to both computers and humans. Unlike the scope of a standard, which changes depending on the community, standards types are consistent across communities. Table 1 summarises these axes with the use of existing systems biology community standards. The standards relevant to systems biology and to the work presented in this thesis are described in sections below.

Figure 1: The two axes of standards development, illustrated using systems biology community standards. The horizontal axis is the scope of the standard and the vertical axis is the type of standard. The content and syntax standards for system behaviour are empty as these standards have not yet been finalised. *BioPAX is both a syntax standard and a semantic standard. Written in OWL, it provides both the structure and the meaning for representing pathways qualitatively. Adapted from [16], first presented in [17].

Some researchers may wish to describe or visualise a mathematical model while others might be more interested in storing the simulation details or the results of multiple simulation runs. Though mathematical descriptions, visualisations and results storage are all separate activities, they fall within a single community and are related via the computational model itself. These activities create the scope of the various systems biology standards: one representing the model, another describing a simulation and a third structuring the results of a simulation. These divisions are present within the SBML community, and at least three other similar – but not identical – representation standards have also been developed in the systems biology community as a whole [18, 19, 20]. The columns in Table 1 sort the most common systems biology standards according to scope.

Discovering where to delineate areas within a single community’s standards can be difficult. Further, the scope of a standard might overlap with another community’s, requiring cross-community participation. Such efforts result in higher-level standards which many different communities can utilise. Resources such as OBI and the FuGE [21] standards have a very broad scope and a correspondingly high level of granularity, enabling the description of virtually any experiment. Individual communities are meant to extend these upper-level standards to provide terminology and structure to meet their specific needs at a lower level of granularity. By sharing a common upper-level standard, integration and reuse of the data it organises becomes much easier. The interconnectedness of standards is a task in itself, and one for which dedicated organisations such as BioSharing [22] and the MIBBI Registry [23] were created.

1 Systems biology modelling standards

This section describes a number of standards important in systems biology according the type of the standard, as described in Table 1. Greater detail is provided for those standards used extensively in the work described in this thesis. In particular, a comparison of the main systems biology formats is provided in Figure 2. Other cross-community standards commonly used in systems biology are described as they are introduced.

Figure 2: A graphical comparison of the capabilities of BioPAX, SBML, PSI-MIF and CellML. While CellML and SBML can model many of the same concepts, CellML is also capable of modelling tissue and organ interactions, and SBML currently provides richer biological annotations. While PSI and BioPAX can both be used to model interactions, BioPAX is able to model more types of interaction as well as pathways. The figure shows BioPAX as capable of modelling pathways in low detail because BioPAX does not capture quantitative information. Modified from [24].

Content standards

Content standards provide a checklist for the minimal descriptive content an experimental type must include. The results and conclusions of a scientific investigation are dependent upon the investigation’s context, such as the methods and other metadata describing an experiment. Defining a common set of metadata to guide researchers in reporting scientific context is increasingly gaining favour with data repositories, journals and funders [23]. Checklists outline the minimal information required to evaluate, interpret and disseminate an experiment; such guidelines effectively define what is contained within a scientific dataset and how the set was generated. Currently, the MIBBI Registry provides a list of these guidelines2.

MIRIAM is a content standard which provides a minimal checklist for interpreting a model correctly, a controlled method of providing annotation through URIs [25] and, via the MIRIAM Registry, Web services for correctly resolving these URIs and for listing supported data types [26]. MIRIAM annotations are URIs which are added to SBML in a standardised way and link external resources such as ontologies or data sources to a model. MIRIAM provides a standard scheme for unambiguously identifying biological entities in networks and models; without such a scheme, the quality of the data suffers [27]. By providing an integration methodology which enhances MIRIAM annotations, this thesis aids collaboration and data reuse in systems biology.

While models complying with MIRIAM provide consistently named annotations about the model to allow for its correct interpretation and reuse, the MIRIAM checklist does not cover the reproduction of simulation results. For this, the MIASE was created [28]. This checklist describes the models to use, the modifications made to those models, the order in which all simulation procedures were applied, how the raw results were processed and a description of the final output [28].

Syntax standards

While the difficulty inherent in translating native data schemas to a unified format is just one of the integration challenges facing researchers [29], it can also be one of the most easily remedied if there is a strong focus within a community for creating a common syntax. Choosing a structure for the data such as XML, RDF or even structured flat files creates a single format and eases data sharing and reuse. Syntax standards aim to provide a common structure for all data of a given experimental type.

SBML is a syntax standard which allows the exchange and reuse of quantitative models for systems biology [8]. SBML is primarily an XML format for describing computational models in systems biology. It is supported by a large community and a wide range of tools, allowing model generation, analysis and curation in any one of the many independently maintained software applications3. While the majority of models written in this format describe relatively small and well contained pathways, SBML is capable of storing larger-scale views of entire metabolic networks [30]. SBML models tend to remain small due to limitations in both the simulation environments and in the kinetic data available for parametrising the models. Herrgård and colleagues [30] sidestep this problem by currently providing only qualitative information for their large SBML metabolic network, without parametrisation.

Metadata, or extra information about any component of an SBML model, is stored in two ways: (i) as a link to an SBO [31] term and (ii) as an RDF triple structured according to the MIRIAM specification. In the SBML data model, each element inherits from the SBase top-level class which has an optional attribute sboTerm for referencing a specific term in SBO [32]. Further biological annotations can be added to the annotation element of any SBase class. Although the annotation element may contain any RDF, annotation complying with the MIRIAM specification must be in the form of valid MIRIAM identifiers. Resolution of these identifiers is available from An example of annotation within SBML is shown in Figure 3. A detailed description of the components and constraints on an SBML model is available in the specification document [33].

Figure 3: How SBO and MIRIAM annotation are used within an SBML element. MIRIAM-compliant RDF is present within the annotation element and via the sboTerm attribute linking to the SBO term SBO:0000297, “protein complex”. RDF namespaces of the rdf element and some children of the species element have been removed to aid readability of the figure. The identifiers used in this example are URNs rather than URIs; BioModels is in the process of converting all URNs to URIs. XML taken from the “BLL” species of BioModels entry BIOMD0000000001.

The SBGN is the first community standard for the graphical representation of systems biology models [34]. Though many different notations were available prior to SBGN, those efforts dealt mainly with notation proposals and software implementations without seeking the backing of the entire community and without addressing the many biological and technical needs of the users. Specifically, SBGN was created for the following purposes: to be semantically, syntactically and visually concise and unambiguous; to be free of copyright restrictions; to minimise the number of possible symbols; to support modularity as well as many different biological entities; and to support the automated generation of diagrams based on simulatable models [34]. Conversion between SBGN and a descriptive format such as SBML is only possible through the shared semantics such as those provided by SBO [32].

SED-ML is the format counterpart to the MIASE checklist and provides an XML structure for describing simulations that are independent of both the model encoding and the software used to perform the simulations [35]. SED-ML tasks are used to link a model to a particular simulation setting and a DataGenerator is used to structure the post-processing steps used on the simulation result before final output [35].

Like SBML, CellML is an XML format used to encode and exchange quantitative systems biology models. Unlike SBML, CellML is able to describe a wider range of mathematical expressions [36]. Additionally, CellML has a component-based structure, allowing reuse of individual modules of a model in a way currently impossible for SBML4. However, SBML user support and software functionality is much higher than that provided by CellML, and SBML has a much more active user community. While historically SBML was able to provide richer biological annotations, recent developments within CellML have all but resolved this limitation5.

The Physiome project stores models of integrative functions of cells, organs and organisms [37]. This project was expressly developed for modelling at both a cellular and at an organ level, and the models can vary from non-simulatable diagrammatic schema to fully quantitative computational models6. A final example is the Virtual Cell project, which provides not just the VCML XML format for describing mathematical and biological models for simulation, but also an entire software infrastructure and user interfaces for performing creation, simulation and analysis of those models [38].

Semantic standards

Describing the meaning, or semantics, of data and its associated metadata is a complex and difficult problem being addressed in the life sciences through the use of controlled vocabularies and ontologies [39]. The use of ontologies for describing data has a twofold purpose. Firstly, ontologies help ensure consistent annotation, such as the spellings and choice of terms. Consistent naming of terms allows the use of a single word-definition pair to describe a single concept. Secondly, ontologies can add human- and computationally amenable semantics to the data. The curation of datasets with common ontology terms minimises querying and integration errors due to semantic ambiguity by providing a method of consistent annotation; for instance, GO is a community-driven ontology in widespread use, and its presence in many datasets creates semantic links between them [40, 41, 42, 43, 39]. In addition to ontologies, RDF can be used to organise life science data in a generic, high-level fashion. RDF can provide a simple, single format for scientific data7, but cannot provide a biologically meaningful semantic layer.

SBO is a semantic standard initially developed for the addition of a biologically meaningful semantic layer on quantitative systems biology models [44]. SBO provides unambiguous terms for biological annotation [31], therefore increasing MIRIAM compliance. Initially, only SBML models supported SBO’s use; currently many other formats such as CellML use SBO. Further, SBO is capable of more than simple entity annotation:

  • every SBGN glyph corresponds exactly to an SBO term, allowing model conversion from a simulation format such as SBML to the graphical SBGN notation;
  • SBO aids conversion between pathway formats utilising its terms;
  • SBO-annotated models can be translated between continuous deterministic frameworks and discrete stochastic frameworks; and
  • models can be merged more cleanly and precisely when model entities are unambiguously defined with SBO [32].

While SBO is natively stored as a relational database, it can be exported either in OBO [39] or OWL on demand8. The only property used within SBO is the subsumption, or “is a” relationship. A summary of the main classes and their children is available in Figure 4.

Figure 4: A summary of the SBO subsumption hierarchy, taken from [31, Box 1]. The seven orthogonal branches of the SBO hierarchy each have their own colour. Dashed lines indicate that intermediate terms have been removed from the summary for readability.

All information other than the SBO class names, identifiers, synonyms and subsumption hierarchy is contained either within human-readable annotations on each SBO term or within MathML annotations. Human-readable annotations include general comments, history of the term and a textual definition. Any constraints on the usage of an SBO term is not defined in the ontology itself, but rather in the other formats (such as SBML) where the constraint is applied.

The MIASE checklist requires that the applied algorithm and the initial set-up for a model simulation are described. However, as some algorithms are proprietary and others are not well documented, repeatability can be difficult. KiSAO is an ontology which describes algorithms, unambiguously identifying those which are similar enough to perform a particular simulation task [31]9. Unlike SBO, the native encoding of KiSAO is OWL. KiSAO can be used in conjunction with SED-ML to allow software to automatically choose the best algorithm for a simulation.

TEDDY is an ontology which models the dynamic behaviour, observable phenomena and control elements of systems and synthetic biology models [31]10. It is still at an early stage of development, but the ontology and some limited documentation are available. Like KiSAO, its native format is OWL.

BioPAX is an OWL ontology created to qualitatively describe biological pathway information [18]. It uses GO and the cell type ontology [45] to describe compartments and locations as well as the NCBI taxonomy database for organisms [46]. In contrast to other standards such as SBML which provide a syntactic representation, BioPAX provides a semantic representation of pathways. An example of BioPAX properties and classes is available in Figure 5 and a summary of classes in BioPAX Level 3 is available in Figure 611.

While SBML is primarily intended for quantitative encoding of pathways for simulation and PSI-MIF is capable only of storing binary interactions, BioPAX is intended to store both levels of granularity equally well, albeit at a qualitative level [47]. For a graphical representation of the differences between BioPAX, SBML, PSI-MIF and CellML, see Figure 2. Each release, or level, of BioPAX has increased the expressivity of the ontology. The latest version of BioPAX is Level 3, which is capable of modelling signalling pathways, molecular state, gene regulation and genetic interactions [18]. As a comparison, the earliest release, Level 1, supported only metabolic pathways. However, most databases such as Pathway Commons still provide their data in BioPAX Level 2, which adds molecular interactions and post-translational modifications.

Figure 5: The AKT pathway shown graphically (left) and using BioPAX (right). Taken from [18, Figure 3].

Although BioPAX provides a detailed semantic representation of pathways and interactions, it has a number of limitations: there is no way of explicitly describing broader experimental metadata other than through simple cross-references [47, 48], no ability to represent mathematical relations other than providing chemical details about interactions [47, 48], and dynamic and quantitative aspects of processes are not supported [18]. Although there is a BioPAX export available for all entries in the BioModels database, such a conversion is not perfect. However, a new bridge between SBML and BioPAX in the form of a quantitative module for BioPAX, called SBPAX, is under development12 [49].

Figure 6: High level overview of BioPAX Level 3. Taken from [18, Figure 4].

2 Data Sources for Systems Biology

This section describes three commonly used data sources for the creation of systems biology models which have been used in the work described in this thesis (see Chapter ). This list is not intended to be exhaustive, and many other databases are available13. For information on BioModels, a database for storing systems biology models, see Section 1.3.


BioGRID stores 24 different types of interactions and exports pairs of interacting entities in PSI-MIF 2.5 format [5]. As described in Figure 2, Pathway Commons and BioGRID store similar types of data, but have different underlying representations. As this thesis was being completed Pathway Commons began importing limited BioGRID data, allowing retrieval of some BioGRID data in BioPAX format and simplifying the integration process for that portion of BioGRID.


The UniProtKB is a comprehensive public protein sequence and function database, consisting both of manually curated and automatically annotated data [50]. While some limited pathway and interaction data is provided, mainly within the comments and feature table sections of a UniProtKB entry, its main use in the creation of systems biology models is as a high quality reference for protein information. It also contains 144 cross references to other resources such as GO and IntAct [51], many of which are useful for model creation.

Pathway Commons

Pathway Commons is a database which provides pathway and interaction data in a number of formats, including BioPAX. The Pathway Commons binary interaction data has limitations not present in the BioPAX format itself. For binary interactions, Pathway Commons describes the participants in a reaction without providing any directionality. Specifically, the participant subtypes—product, reactant, and modifier—available within BioPAX are unused. Where pathway (rather than interaction) data is provided from Pathway Commons, more complete use of BioPAX is possible. Pathway Commons uses BioPAX Level 2, which has a number of limitations compared with BioPAX Level 3, as described in Section .

3 Data standards and data integration

Although data standards are key to data sharing and reuse, they do not provide a complete solution. By their very nature, standards cannot be implemented until a new experimental type has been available long enough to produce a list of standard requirements. Further, while orthogonality is important, in practice there remains some level of overlap among standards. As such, the integration of information stored in different representations remains important. To integrate systems biology data successfully, the data needs to be human readable and computationally accessible. Section 1.6 describes data integration methodologies and their use within systems biology.


Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Y. Rhee. Big data: The future of biocuration. Nature, 455(7209):47–50, September 2008.
Community cleverness required. Nature, 455(7209):1, September 2008.
Ron Edgar and Tanya Barrett. NCBI GEO standards and services for microarray data. Nature Biotechnology, 24(12):1471–1472, December 2006.
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, Terry Gaasterland, Patrick Glenisson, Frank C. P. Holstege, Irene F. Kim, Victor Markowitz, John C. Matese, Helen Parkinson, Alan Robinson, Ugis Sarkans, Steffen Schulze-Kremer, Jason Stewart, Ronald Taylor, Jaak Vilo, and Martin Vingron. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genetics, 29(4):365–371, December 2001.
Henning Hermjakob, Luisa Montecchi-Palazzi, Gary Bader, Jérôme Wojcik, Lukasz Salwinski, Arnaud Ceol, Susan Moore, Sandra Orchard, Ugis Sarkans, Christian von Mering, Bernd Roechert, Sylvain Poux, Eva Jung, Henning Mersch, Paul Kersey, Michael Lappe, Yixue Li, Rong Zeng, Debashis Rana, Macha Nikolski, Holger Husi, Christine Brun, K. Shanker, Seth G. Grant, Chris Sander, Peer Bork, Weimin Zhu, Akhilesh Pandey, Alvis Brazma, Bernard Jacq, Marc Vidal, David Sherman, Pierre Legrain, Gianni Cesareni, Ioannis Xenarios, David Eisenberg, Boris Steipe, Chris Hogue, and Rolf Apweiler. The HUPO PSI’s molecular interaction format–a community standard for the representation of protein interaction data. Nature biotechnology, 22(2):177–183, February 2004.
Guy Cochrane, Ruth Akhtar, James Bonfield, Lawrence Bower, Fehmi Demiralp, Nadeem Faruque, Richard Gibson, Gemma Hoad, Tim Hubbard, Christopher Hunter, Mikyung Jang, Szilveszter Juhos, Rasko Leinonen, Steven Leonard, Quan Lin, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Sheila Plaister, Rajesh Radhakrishnan, Stephen Robinson, Siamak Sobhany, Petra T. Hoopen, Robert Vaughan, Vadim Zalunin, and Ewan Birney. Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Research, 37(suppl 1):D19–D25, January 2009.
Yu Qian, Olga Tchuvatkina, Josef Spidlen, Peter Wilkinson, Maura Gasparetto, Andrew Jones, Frank Manion, Richard Scheuermann, Rafick P. Sekaly, and Ryan Brinkman. FuGEFlow: data model and markup language for flow cytometry. BMC Bioinformatics, 10(1):184+, June 2009.
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
Andrew Miller, Justin Marsh, Adam Reeve, Alan Garny, Randall Britten, Matt Halstead, Jonathan Cooper, David Nickerson, and Poul Nielsen. An overview of the CellML API and its implementation. BMC Bioinformatics, 11(1):178+, April 2010.
Robert Arp and Barry Smith. Function, Role, and Disposition in Basic Formal Ontology. Nature Precedings, (713).
Phillip Lord and Robert Stevens. Adding a Little Reality to Building Ontologies for Biology. PLoS ONE, 5(9):e12258+, September 2010.
Phillip Lord. An evolutionary approach to Function. Journal of Biomedical Semantics, 1(Suppl 1):S4+, 2010.
Robert Stevens. Unicorns in my Ontology, May 2011.
Michel Dumontier and Robert Hoehndorf. Realism for scientific ontologies. In Proceeding of the 2010 conference on Formal Ontology in Information Systems: Proceedings of the Sixth International Conference (FOIS 2010), pages 387–399, Amsterdam, The Netherlands, The Netherlands, 2010. IOS Press.
The OBI Consortium. OBI Ontology.
V. Chelliah, L. Endler, N. Juty, C. Laibe, C. Li, N. Rodriguez, and N. Le Novere. Data Integration and Semantic Enrichment of Systems Biology Models and Simulations. In N. W. Paton, P. Missier, and C. Hedeler, editors, Data Integration in the Life Sciences, Proceedings; Lecture Notes in Computer Science; 6th International Workshop on Data Integration in the Life Sciences, volume 5647, pages 5–15. [Chelliah, Vijayalakshmi; Endler, Lukas; Juty, Nick; Laibe, Camille; Li, Chen; Rodriguez, Nicolas; Le Novere, Nicolas] EMBL European Bioinformat Inst, Cambridge CB10 1SD, England.; Le Novere, N, EMBL European Bioinformat Inst, Wellcome Trust Genome Campus, Cambridge CB10 1SD, England., July 2009.
Nicolas Le Novere. Principled annotation of quantitative models in systems biology. In Genomes to Systems, 2008.
Emek Demir, Michael P. Cary, Suzanne Paley, Ken Fukuda, Christian Lemer, Imre Vastrik, Guanming Wu, Peter D’Eustachio, Carl Schaefer, Joanne Luciano, Frank Schacherer, Irma Martinez-Flores, Zhenjun Hu, Veronica Jimenez-Jacinto, Geeta Joshi-Tope, Kumaran Kandasamy, Alejandra C. Lopez-Fuentes, Huaiyu Mi, Elgar Pichler, Igor Rodchenkov, Andrea Splendiani, Sasha Tkachev, Jeremy Zucker, Gopal Gopinath, Harsha Rajasimha, Ranjani Ramakrishnan, Imran Shah, Mustafa Syed, Nadia Anwar, Ozgün Babur, Michael Blinov, Erik Brauner, Dan Corwin, Sylva Donaldson, Frank Gibbons, Robert Goldberg, Peter Hornbeck, Augustin Luna, Peter Murray-Rust, Eric Neumann, Oliver Reubenacker, Matthias Samwald, Martijn van Iersel, Sarala Wimalaratne, Keith Allen, Burk Braun, Michelle Whirl-Carrillo, Kei-Hoi H. Cheung, Kam Dahlquist, Andrew Finney, Marc Gillespie, Elizabeth Glass, Li Gong, Robin Haw, Michael Honig, Olivier Hubaut, David Kane, Shiva Krupa, Martina Kutmon, Julie Leonard, Debbie Marks, David Merberg, Victoria Petri, Alex Pico, Dean Ravenscroft, Liya Ren, Nigam Shah, Margot Sunshine, Rebecca Tang, Ryan Whaley, Stan Letovksy, Kenneth H. Buetow, Andrey Rzhetsky, Vincent Schachter, Bruno S. Sobral, Ugur Dogrusoz, Shannon McWeeney, Mirit Aladjem, Ewan Birney, Julio Collado-Vides, Susumu Goto, Michael Hucka, Nicolas Le Novère, Natalia Maltsev, Akhilesh Pandey, Paul Thomas, Edgar Wingender, Peter D. Karp, Chris Sander, and Gary D. Bader. The BioPAX community standard for pathway data sharing. Nature biotechnology, 28(9):935–942, September 2010.
Alan Garny, David P. Nickerson, Jonathan Cooper, Rodrigo W. Santos, Andrew K. Miller, Steve Mckeever, Poul M. F. Nielsen, and Peter J. Hunter. CellML and associated tools and techniques. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 366(1878):3017–3043, September 2008.
Leslie M. Loew. The Virtual Cell project. Novartis Foundation symposium, 247, 2002.
Andrew R. Jones, Michael Miller, Ruedi Aebersold, Rolf Apweiler, Catherine A. Ball, Alvis Brazma, James DeGreef, Nigel Hardy, Henning Hermjakob, Simon J. Hubbard, Peter Hussey, Mark Igra, Helen Jenkins, Randall K. Julian, Kent Laursen, Stephen G. Oliver, Norman W. Paton, Susanna-Assunta Sansone, Ugis Sarkans, Christian J. Stoeckert, Chris F. Taylor, Patricia L. Whetzel, Joseph A. White, Paul Spellman, and Angel Pizarro. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnology, 25(10):1127–1133, October 2007.
Dawn Field, Susanna-Assunta A. Sansone, Amanda Collis, Tim Booth, Peter Dukes, Susan K. Gregurick, Karen Kennedy, Patrik Kolar, Eugene Kolker, Mary Maxon, Siân Millard, Alexis-Michel M. Mugabushaka, Nicola Perrin, Jacques E. Remacle, Karin Remington, Philippe Rocca-Serra, Chris F. Taylor, Mark Thorley, Bela Tiwari, and John Wilbanks. Megascience. ’Omics data sharing. Science (New York, N.Y.), 326(5950):234–236, October 2009.
Chris F. Taylor, Dawn Field, Susanna-Assunta Sansone, Jan Aerts, Rolf Apweiler, Michael Ashburner, Catherine A. Ball, Pierre-Alain Binz, Molly Bogue, Tim Booth, Alvis Brazma, Ryan R. Brinkman, Adam Michael Clark, Eric W. Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Frank Gibson, Tanya Gray, Graeme Grimes, John M. Hancock, Nigel W. Hardy, Henning Hermjakob, Randall K. Julian, Matthew Kane, Carsten Kettner, Christopher Kinsinger, Eugene Kolker, Martin Kuiper, Nicolas L. Novere, Jim Leebens-Mack, Suzanna E. Lewis, Phillip Lord, Ann-Marie Mallon, Nishanth Marthandan, Hiroshi Masuya, Ruth McNally, Alexander Mehrle, Norman Morrison, Sandra Orchard, John Quackenbush, James M. Reecy, Donald G. Robertson, Philippe Rocca-Serra, Henry Rodriguez, Heiko Rosenfelder, Javier Santoyo-Lopez, Richard H. Scheuermann, Daniel Schober, Barry Smith, Jason Snape, Christian J. Stoeckert, Keith Tipton, Peter Sterk, Andreas Untergasser, Jo Vandesompele, and Stefan Wiemann. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology, 26(8):889–896, August 2008.
Abhishek Tiwari. BioPAX or SBML?, July 2009.
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
Camille Laibe and Nicolas Le Novere. MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology, 1(1):58+, 2007.
Paul Dobson, Kieran Smallbone, Daniel Jameson, Evangelos Simeonidis, Karin Lanthaler, Pinar Pir, Chuan Lu, Neil Swainston, Warwick Dunn, Paul Fisher, Duncan Hull, Marie Brown, Olusegun Oshota, Natalie Stanford, Douglas Kell, Ross King, Stephen Oliver, Robert Stevens, and Pedro Mendes. Further developments towards a genome-scale metabolic model of yeast. BMC Systems Biology, 4(1):145+, October 2010.
Dagmar Waltemath, Richard Adams, Daniel A. Beard, Frank T. Bergmann, Upinder S. Bhalla, Randall Britten, Vijayalakshmi Chelliah, Michael T. Cooling, Jonathan Cooper, Edmund J. Crampin, Alan Garny, Stefan Hoops, Michael Hucka, Peter Hunter, Edda Klipp, Camille Laibe, Andrew K. Miller, Ion Moraru, David Nickerson, Poul Nielsen, Macha Nikolski, Sven Sahle, Herbert M. Sauro, Henning Schmidt, Jacky L. Snoep, Dominic Tolle, Olaf Wolkenhauer, and Nicolas Le Novère. Minimum Information About a Simulation Experiment (MIASE). PLoS Comput Biol, 7(4):e1001122+, April 2011.
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
Markus J. Herrgard, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalcin Arga, Mikko Arvas, Nils Buthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Nicolas Le Novere, Peter Li, Wolfram Liebermeister, Monica L. Mo, Ana P. Oliveira, Dina Petranovic, Stephen Pettifer, Evangelos Simeonidis, Kieran Smallbone, Irena Spasie, Dieter Weichart, Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betul Kurdar, Merja Penttila, Edda Klipp, Bernhard O. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen, and Douglas B. Kell. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology, 26(10):1155–1160, October 2008.
Melanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novere. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011.
Nick Juty, Nick Juty, and Nick Juty. Systems Biology Ontology: Update. Nature Precedings, (713), October 2010.
Michael Hucka, Michael Hucka, Frank Bergmann, Stefan Hoops, Sarah Keating, Sven Sahle, James Schaff, Lucian Smith, Darren Wilkinson, Michael Hucka, Frank T. Bergmann, Stefan Hoops, Sarah M. Keating, Sven Sahle, James C. Schaff, Lucian P. Smith, and Darren J. Wilkinson. The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core. Nature Precedings, (713), October 2010.
Nicolas L. Novere, Michael Hucka, Huaiyu Mi, Stuart Moodie, Falk Schreiber, Anatoly Sorokin, Emek Demir, Katja Wegner, Mirit I. Aladjem, Sarala M. Wimalaratne, Frank T. Bergman, Ralph Gauges, Peter Ghazal, Hideya Kawaji, Lu Li, Yukiko Matsuoka, Alice Villeger, Sarah E. Boyd, Laurence Calzone, Melanie Courtot, Ugur Dogrusoz, Tom C. Freeman, Akira Funahashi, Samik Ghosh, Akiya Jouraku, Sohyoung Kim, Fedor Kolpakov, Augustin Luna, Sven Sahle, Esther Schmidt, Steven Watterson, Guanming Wu, Igor Goryanin, Douglas B. Kell, Chris Sander, Herbert Sauro, Jacky L. Snoep, Kurt Kohn, and Hiroaki Kitano. The Systems Biology Graphical Notation. Nature Biotechnology, 27(8):735–741, August 2009.
Dagmar Köhn and Nicolas Le Novère. SED-ML – An XML Format for the Implementation of the MIASE Guidelines. In Monika Heiner and Adelinde Uhrmacher, editors, Computational Methods in Systems Biology, volume 5307 of Lecture Notes in Computer Science, chapter 15, pages 176–190. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2008.
Catherine M. Lloyd, James R. Lawson, Peter J. Hunter, and Poul F. Nielsen. The CellML Model Repository. Bioinformatics, 24(18):2122–2123, September 2008.
James B. Bassingthwaighte. Strategies for the Physiome Project. Annals of Biomedical Engineering, 28(8):1043–1058, August 2000.
L. M. Loew and J. C. Schaff. The Virtual Cell: a software environment for computational cell biology. Trends in biotechnology, 19(10):401–406, October 2001.
Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J. Mungall, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H. Scheuermann, Nigam Shah, Patricia L. Whetzel, and Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11):1251–1255, November 2007.
Judith A. Blake and Carol J. Bult. Beyond the data deluge: data integration and bio-ontologies. Journal of biomedical informatics, 39(3):314–320, June 2006.
J. Lomax and A. T. McCray. Mapping the gene ontology into the unified medical language system. Comparative and functional genomics, 5(4):354–361, 2004.
Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, AmiGO Hub, and Web Presence Working Group. AmiGO: online access to ontology and annotation data. Bioinformatics (Oxford, England), 25(2):288–289, January 2009.
Erick Antezana, Ward Blondé, Mikel Egaña, Alistair Rutherford, Robert Stevens, Bernard De Baets, Vladimir Mironov, and Martin Kuiper. BioGateway: a semantic systems biology tool for the life sciences. BMC bioinformatics, 10 Suppl 10(Suppl 10):S11+, 2009.
Nicolas Le Novère. Model storage, exchange and integration. BMC neuroscience, 7 Suppl 1(Suppl 1):S11+, 2006.
Jonathan Bard, Seung Y. Rhee, and Michael Ashburner. An ontology for cell types. Genome biology, 6(2):R21+, 2005.
J. S. Luciano. PAX of mind for pathway researchers. Drug discovery today, 10(13):937–942, July 2005.
Lena Stromback and Patrick Lambrix. Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics, 21(24):4401–4407, December 2005.
L. Strömbäck, V. Jakoniene, H. Tan, and P. Lambrix. Representing, storing and accessing molecular interaction data: a review of models and tools. Briefings in bioinformatics, 7(4):331–338, December 2006.
O. Ruebenacker, I. I. Moraru, J. C. Schaff, and M. L. Blinov. Integrating BioPAX pathway knowledge with SBML models. IET Systems Biology, 3(5):317–328, 2009.
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The IntAct molecular interaction database in 2010. Nucleic Acids Research, 38(Database issue):D525–D531, October 2009.

By Allyson Lister

Find me at and

9 replies on “Background: Standards as a shared structure for data (Thesis: 1.4)”

[…] SyMBA makes use of a number of standards, emerging standards, and common formats to save time downstream of data generation by ensuring compatibility with other centres of research and journals complying with the same standards. SyMBA can help research groups follow all of the stewardship methods described above and can limit tedious and repetitive data entry. A brief overview of content, syntax and semantic data standards that are or can be integrated within SyMBA is provided in this section. A more detailed description of these standards is available in Section 1.4. […]

[…] SBO is a hierarchy of terms developed by the systems biology modelling community to assist compliance with MIRIAM, ensure unambiguous understanding of the meaning of the annotated entities and foster mapping between annotated elements from multiple formats making use of the ontology [10]. By adding SBO terms, modellers take an essential step towards such compliance [3]. Each term in SBO is related to its direct parent with an is a subsumption relationship such as catalyst is a modifier. For a full description of SBO itself and its relationship to other systems biology standards, see Section 1.4. […]

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s