Categories
Data Integration

Background: Standards as a shared structure for data (Thesis: 1.4)

[Previous: Modelling biological systems]
[Next: What are ontologies?]

Standards as a shared structure for data

As described in Section 1.3, systems biology benefits from the use of common standards for describing data and from unified naming schemes to ensure precise identification of entities. The research community must make use of standardised methods to increase the annotation of data to a point where it matches the rate of data generation [1]. Researchers must adapt, documenting and managing their data “with as much professionalism as they devote to their experiments” [2]. However, the level of standards support for consistent adoption and deployment is a difficult commitment for individuals [2]. Careful deployment by bioinformaticians can make standards use transparent to the researchers generating the data. This section provides an overview of the standards specific to systems biology as well as those useful both to systems biologists and the wider life sciences community.

Virtually every life science community has at least one proposed or accepted standard, including transcriptomics [3, 4], proteomics [5], genomics [6], flow cytometry [7] and systems biology [8, 9]. The MIBBI Registry alone contains 32 minimal information checklists for various experimental types1. Standards are difficult to create and maintain in terms of manpower, money, and consensus. Standards begin with islands of initial researchers in a field, who gradually develop into a nascent community. With new scientific developments, ‘just-enough’ strategies for storing and structuring data become ‘nowhere-near-enough’. Communication with peers and among machines becomes more important and standards become a critical requirement.

The best solution may seem to be the creation of an elegant or complex ontology which models a domain in a richly described and logically rigorous way. However, the most practical solutions are generally easy to use and intuitive to understand, characteristics which do not always apply to more complex ontologies. Some solutions involve relying on realist upper level ontologies such as the BFO [10], but even experienced researchers find limitations and difficulties in such highly philosophical solutions [11, 12, 13, 14]. In addition to the complexity of the solution, the practicalities of creating a community standard require a large amount of effort and continued goodwill. Even with the best of intentions, with every participant working towards the same goal, it can take months—or years—of meetings, document revisions and conference calls to derive a working standard. For instance, OBI [15] has been in development for seven years and is yet to be officially published. Despite the time it takes to reach consensus, even the best structure or semantics will not guarantee usage if the work has been developed without the input of the wider community.

While it may seem that only one standard is needed for each community, there are in fact two axes along which multiple standards can be developed. The first is the axis of scope and the second is the axis of type. Even though a single community may wish to describe a single type of information (for instance, a systems biology model), the amount and scope of the data might be too large to feasibly fit into just one standard. There are three types of standard which must be considered: the minimal descriptive content, or metadata, a piece of data must include; the standard syntax to which that data must adhere; and a standard semantics for describing the meaning of the data in a way understandable to both computers and humans. Unlike the scope of a standard, which changes depending on the community, standards types are consistent across communities. Table 1 summarises these axes with the use of existing systems biology community standards. The standards relevant to systems biology and to the work presented in this thesis are described in sections below.

Figure 1: The two axes of standards development, illustrated using systems biology community standards. The horizontal axis is the scope of the standard and the vertical axis is the type of standard. The content and syntax standards for system behaviour are empty as these standards have not yet been finalised. *BioPAX is both a syntax standard and a semantic standard. Written in OWL, it provides both the structure and the meaning for representing pathways qualitatively. Adapted from [16], first presented in [17].

Some researchers may wish to describe or visualise a mathematical model while others might be more interested in storing the simulation details or the results of multiple simulation runs. Though mathematical descriptions, visualisations and results storage are all separate activities, they fall within a single community and are related via the computational model itself. These activities create the scope of the various systems biology standards: one representing the model, another describing a simulation and a third structuring the results of a simulation. These divisions are present within the SBML community, and at least three other similar – but not identical – representation standards have also been developed in the systems biology community as a whole [18, 19, 20]. The columns in Table 1 sort the most common systems biology standards according to scope.

Discovering where to delineate areas within a single community’s standards can be difficult. Further, the scope of a standard might overlap with another community’s, requiring cross-community participation. Such efforts result in higher-level standards which many different communities can utilise. Resources such as OBI and the FuGE [21] standards have a very broad scope and a correspondingly high level of granularity, enabling the description of virtually any experiment. Individual communities are meant to extend these upper-level standards to provide terminology and structure to meet their specific needs at a lower level of granularity. By sharing a common upper-level standard, integration and reuse of the data it organises becomes much easier. The interconnectedness of standards is a task in itself, and one for which dedicated organisations such as BioSharing [22] and the MIBBI Registry [23] were created.

1 Systems biology modelling standards

This section describes a number of standards important in systems biology according the type of the standard, as described in Table 1. Greater detail is provided for those standards used extensively in the work described in this thesis. In particular, a comparison of the main systems biology formats is provided in Figure 2. Other cross-community standards commonly used in systems biology are described as they are introduced.

Figure 2: A graphical comparison of the capabilities of BioPAX, SBML, PSI-MIF and CellML. While CellML and SBML can model many of the same concepts, CellML is also capable of modelling tissue and organ interactions, and SBML currently provides richer biological annotations. While PSI and BioPAX can both be used to model interactions, BioPAX is able to model more types of interaction as well as pathways. The figure shows BioPAX as capable of modelling pathways in low detail because BioPAX does not capture quantitative information. Modified from [24].

Content standards

Content standards provide a checklist for the minimal descriptive content an experimental type must include. The results and conclusions of a scientific investigation are dependent upon the investigation’s context, such as the methods and other metadata describing an experiment. Defining a common set of metadata to guide researchers in reporting scientific context is increasingly gaining favour with data repositories, journals and funders [23]. Checklists outline the minimal information required to evaluate, interpret and disseminate an experiment; such guidelines effectively define what is contained within a scientific dataset and how the set was generated. Currently, the MIBBI Registry provides a list of these guidelines2.

MIRIAM is a content standard which provides a minimal checklist for interpreting a model correctly, a controlled method of providing annotation through URIs [25] and, via the MIRIAM Registry, Web services for correctly resolving these URIs and for listing supported data types [26]. MIRIAM annotations are URIs which are added to SBML in a standardised way and link external resources such as ontologies or data sources to a model. MIRIAM provides a standard scheme for unambiguously identifying biological entities in networks and models; without such a scheme, the quality of the data suffers [27]. By providing an integration methodology which enhances MIRIAM annotations, this thesis aids collaboration and data reuse in systems biology.

While models complying with MIRIAM provide consistently named annotations about the model to allow for its correct interpretation and reuse, the MIRIAM checklist does not cover the reproduction of simulation results. For this, the MIASE was created [28]. This checklist describes the models to use, the modifications made to those models, the order in which all simulation procedures were applied, how the raw results were processed and a description of the final output [28].

Syntax standards

While the difficulty inherent in translating native data schemas to a unified format is just one of the integration challenges facing researchers [29], it can also be one of the most easily remedied if there is a strong focus within a community for creating a common syntax. Choosing a structure for the data such as XML, RDF or even structured flat files creates a single format and eases data sharing and reuse. Syntax standards aim to provide a common structure for all data of a given experimental type.

SBML is a syntax standard which allows the exchange and reuse of quantitative models for systems biology [8]. SBML is primarily an XML format for describing computational models in systems biology. It is supported by a large community and a wide range of tools, allowing model generation, analysis and curation in any one of the many independently maintained software applications3. While the majority of models written in this format describe relatively small and well contained pathways, SBML is capable of storing larger-scale views of entire metabolic networks [30]. SBML models tend to remain small due to limitations in both the simulation environments and in the kinetic data available for parametrising the models. Herrgård and colleagues [30] sidestep this problem by currently providing only qualitative information for their large SBML metabolic network, without parametrisation.

Metadata, or extra information about any component of an SBML model, is stored in two ways: (i) as a link to an SBO [31] term and (ii) as an RDF triple structured according to the MIRIAM specification. In the SBML data model, each element inherits from the SBase top-level class which has an optional attribute sboTerm for referencing a specific term in SBO [32]. Further biological annotations can be added to the annotation element of any SBase class. Although the annotation element may contain any RDF, annotation complying with the MIRIAM specification must be in the form of valid MIRIAM identifiers. Resolution of these identifiers is available from http://www.identifiers.org. An example of annotation within SBML is shown in Figure 3. A detailed description of the components and constraints on an SBML model is available in the specification document [33].

Figure 3: How SBO and MIRIAM annotation are used within an SBML element. MIRIAM-compliant RDF is present within the annotation element and via the sboTerm attribute linking to the SBO term SBO:0000297, “protein complex”. RDF namespaces of the rdf element and some children of the species element have been removed to aid readability of the figure. The identifiers used in this example are URNs rather than URIs; BioModels is in the process of converting all URNs to identifiers.org-based URIs. XML taken from the “BLL” species of BioModels entry BIOMD0000000001.

The SBGN is the first community standard for the graphical representation of systems biology models [34]. Though many different notations were available prior to SBGN, those efforts dealt mainly with notation proposals and software implementations without seeking the backing of the entire community and without addressing the many biological and technical needs of the users. Specifically, SBGN was created for the following purposes: to be semantically, syntactically and visually concise and unambiguous; to be free of copyright restrictions; to minimise the number of possible symbols; to support modularity as well as many different biological entities; and to support the automated generation of diagrams based on simulatable models [34]. Conversion between SBGN and a descriptive format such as SBML is only possible through the shared semantics such as those provided by SBO [32].

SED-ML is the format counterpart to the MIASE checklist and provides an XML structure for describing simulations that are independent of both the model encoding and the software used to perform the simulations [35]. SED-ML tasks are used to link a model to a particular simulation setting and a DataGenerator is used to structure the post-processing steps used on the simulation result before final output [35].

Like SBML, CellML is an XML format used to encode and exchange quantitative systems biology models. Unlike SBML, CellML is able to describe a wider range of mathematical expressions [36]. Additionally, CellML has a component-based structure, allowing reuse of individual modules of a model in a way currently impossible for SBML4. However, SBML user support and software functionality is much higher than that provided by CellML, and SBML has a much more active user community. While historically SBML was able to provide richer biological annotations, recent developments within CellML have all but resolved this limitation5.

The Physiome project stores models of integrative functions of cells, organs and organisms [37]. This project was expressly developed for modelling at both a cellular and at an organ level, and the models can vary from non-simulatable diagrammatic schema to fully quantitative computational models6. A final example is the Virtual Cell project, which provides not just the VCML XML format for describing mathematical and biological models for simulation, but also an entire software infrastructure and user interfaces for performing creation, simulation and analysis of those models [38].

Semantic standards

Describing the meaning, or semantics, of data and its associated metadata is a complex and difficult problem being addressed in the life sciences through the use of controlled vocabularies and ontologies [39]. The use of ontologies for describing data has a twofold purpose. Firstly, ontologies help ensure consistent annotation, such as the spellings and choice of terms. Consistent naming of terms allows the use of a single word-definition pair to describe a single concept. Secondly, ontologies can add human- and computationally amenable semantics to the data. The curation of datasets with common ontology terms minimises querying and integration errors due to semantic ambiguity by providing a method of consistent annotation; for instance, GO is a community-driven ontology in widespread use, and its presence in many datasets creates semantic links between them [40, 41, 42, 43, 39]. In addition to ontologies, RDF can be used to organise life science data in a generic, high-level fashion. RDF can provide a simple, single format for scientific data7, but cannot provide a biologically meaningful semantic layer.

SBO is a semantic standard initially developed for the addition of a biologically meaningful semantic layer on quantitative systems biology models [44]. SBO provides unambiguous terms for biological annotation [31], therefore increasing MIRIAM compliance. Initially, only SBML models supported SBO’s use; currently many other formats such as CellML use SBO. Further, SBO is capable of more than simple entity annotation:

  • every SBGN glyph corresponds exactly to an SBO term, allowing model conversion from a simulation format such as SBML to the graphical SBGN notation;
  • SBO aids conversion between pathway formats utilising its terms;
  • SBO-annotated models can be translated between continuous deterministic frameworks and discrete stochastic frameworks; and
  • models can be merged more cleanly and precisely when model entities are unambiguously defined with SBO [32].

While SBO is natively stored as a relational database, it can be exported either in OBO [39] or OWL on demand8. The only property used within SBO is the subsumption, or “is a” relationship. A summary of the main classes and their children is available in Figure 4.

Figure 4: A summary of the SBO subsumption hierarchy, taken from [31, Box 1]. The seven orthogonal branches of the SBO hierarchy each have their own colour. Dashed lines indicate that intermediate terms have been removed from the summary for readability.

All information other than the SBO class names, identifiers, synonyms and subsumption hierarchy is contained either within human-readable annotations on each SBO term or within MathML annotations. Human-readable annotations include general comments, history of the term and a textual definition. Any constraints on the usage of an SBO term is not defined in the ontology itself, but rather in the other formats (such as SBML) where the constraint is applied.

The MIASE checklist requires that the applied algorithm and the initial set-up for a model simulation are described. However, as some algorithms are proprietary and others are not well documented, repeatability can be difficult. KiSAO is an ontology which describes algorithms, unambiguously identifying those which are similar enough to perform a particular simulation task [31]9. Unlike SBO, the native encoding of KiSAO is OWL. KiSAO can be used in conjunction with SED-ML to allow software to automatically choose the best algorithm for a simulation.

TEDDY is an ontology which models the dynamic behaviour, observable phenomena and control elements of systems and synthetic biology models [31]10. It is still at an early stage of development, but the ontology and some limited documentation are available. Like KiSAO, its native format is OWL.

BioPAX is an OWL ontology created to qualitatively describe biological pathway information [18]. It uses GO and the cell type ontology [45] to describe compartments and locations as well as the NCBI taxonomy database for organisms [46]. In contrast to other standards such as SBML which provide a syntactic representation, BioPAX provides a semantic representation of pathways. An example of BioPAX properties and classes is available in Figure 5 and a summary of classes in BioPAX Level 3 is available in Figure 611.

While SBML is primarily intended for quantitative encoding of pathways for simulation and PSI-MIF is capable only of storing binary interactions, BioPAX is intended to store both levels of granularity equally well, albeit at a qualitative level [47]. For a graphical representation of the differences between BioPAX, SBML, PSI-MIF and CellML, see Figure 2. Each release, or level, of BioPAX has increased the expressivity of the ontology. The latest version of BioPAX is Level 3, which is capable of modelling signalling pathways, molecular state, gene regulation and genetic interactions [18]. As a comparison, the earliest release, Level 1, supported only metabolic pathways. However, most databases such as Pathway Commons still provide their data in BioPAX Level 2, which adds molecular interactions and post-translational modifications.

Figure 5: The AKT pathway shown graphically (left) and using BioPAX (right). Taken from [18, Figure 3].

Although BioPAX provides a detailed semantic representation of pathways and interactions, it has a number of limitations: there is no way of explicitly describing broader experimental metadata other than through simple cross-references [47, 48], no ability to represent mathematical relations other than providing chemical details about interactions [47, 48], and dynamic and quantitative aspects of processes are not supported [18]. Although there is a BioPAX export available for all entries in the BioModels database, such a conversion is not perfect. However, a new bridge between SBML and BioPAX in the form of a quantitative module for BioPAX, called SBPAX, is under development12 [49].

Figure 6: High level overview of BioPAX Level 3. Taken from [18, Figure 4].

2 Data Sources for Systems Biology

This section describes three commonly used data sources for the creation of systems biology models which have been used in the work described in this thesis (see Chapter ). This list is not intended to be exhaustive, and many other databases are available13. For information on BioModels, a database for storing systems biology models, see Section 1.3.

BioGRID

BioGRID stores 24 different types of interactions and exports pairs of interacting entities in PSI-MIF 2.5 format [5]. As described in Figure 2, Pathway Commons and BioGRID store similar types of data, but have different underlying representations. As this thesis was being completed Pathway Commons began importing limited BioGRID data, allowing retrieval of some BioGRID data in BioPAX format and simplifying the integration process for that portion of BioGRID.

UniProtKB

The UniProtKB is a comprehensive public protein sequence and function database, consisting both of manually curated and automatically annotated data [50]. While some limited pathway and interaction data is provided, mainly within the comments and feature table sections of a UniProtKB entry, its main use in the creation of systems biology models is as a high quality reference for protein information. It also contains 144 cross references to other resources such as GO and IntAct [51], many of which are useful for model creation.

Pathway Commons

Pathway Commons is a database which provides pathway and interaction data in a number of formats, including BioPAX. The Pathway Commons binary interaction data has limitations not present in the BioPAX format itself. For binary interactions, Pathway Commons describes the participants in a reaction without providing any directionality. Specifically, the participant subtypes—product, reactant, and modifier—available within BioPAX are unused. Where pathway (rather than interaction) data is provided from Pathway Commons, more complete use of BioPAX is possible. Pathway Commons uses BioPAX Level 2, which has a number of limitations compared with BioPAX Level 3, as described in Section .

3 Data standards and data integration

Although data standards are key to data sharing and reuse, they do not provide a complete solution. By their very nature, standards cannot be implemented until a new experimental type has been available long enough to produce a list of standard requirements. Further, while orthogonality is important, in practice there remains some level of overlap among standards. As such, the integration of information stored in different representations remains important. To integrate systems biology data successfully, the data needs to be human readable and computationally accessible. Section 1.6 describes data integration methodologies and their use within systems biology.

Bibliography

[1]
Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Y. Rhee. Big data: The future of biocuration. Nature, 455(7209):47–50, September 2008.
[2]
Community cleverness required. Nature, 455(7209):1, September 2008.
[3]
Ron Edgar and Tanya Barrett. NCBI GEO standards and services for microarray data. Nature Biotechnology, 24(12):1471–1472, December 2006.
[4]
Alvis Brazma, Pascal Hingamp, John Quackenbush, Gavin Sherlock, Paul Spellman, Chris Stoeckert, John Aach, Wilhelm Ansorge, Catherine A. Ball, Helen C. Causton, Terry Gaasterland, Patrick Glenisson, Frank C. P. Holstege, Irene F. Kim, Victor Markowitz, John C. Matese, Helen Parkinson, Alan Robinson, Ugis Sarkans, Steffen Schulze-Kremer, Jason Stewart, Ronald Taylor, Jaak Vilo, and Martin Vingron. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genetics, 29(4):365–371, December 2001.
[5]
Henning Hermjakob, Luisa Montecchi-Palazzi, Gary Bader, Jérôme Wojcik, Lukasz Salwinski, Arnaud Ceol, Susan Moore, Sandra Orchard, Ugis Sarkans, Christian von Mering, Bernd Roechert, Sylvain Poux, Eva Jung, Henning Mersch, Paul Kersey, Michael Lappe, Yixue Li, Rong Zeng, Debashis Rana, Macha Nikolski, Holger Husi, Christine Brun, K. Shanker, Seth G. Grant, Chris Sander, Peer Bork, Weimin Zhu, Akhilesh Pandey, Alvis Brazma, Bernard Jacq, Marc Vidal, David Sherman, Pierre Legrain, Gianni Cesareni, Ioannis Xenarios, David Eisenberg, Boris Steipe, Chris Hogue, and Rolf Apweiler. The HUPO PSI’s molecular interaction format–a community standard for the representation of protein interaction data. Nature biotechnology, 22(2):177–183, February 2004.
[6]
Guy Cochrane, Ruth Akhtar, James Bonfield, Lawrence Bower, Fehmi Demiralp, Nadeem Faruque, Richard Gibson, Gemma Hoad, Tim Hubbard, Christopher Hunter, Mikyung Jang, Szilveszter Juhos, Rasko Leinonen, Steven Leonard, Quan Lin, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Sheila Plaister, Rajesh Radhakrishnan, Stephen Robinson, Siamak Sobhany, Petra T. Hoopen, Robert Vaughan, Vadim Zalunin, and Ewan Birney. Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Research, 37(suppl 1):D19–D25, January 2009.
[7]
Yu Qian, Olga Tchuvatkina, Josef Spidlen, Peter Wilkinson, Maura Gasparetto, Andrew Jones, Frank Manion, Richard Scheuermann, Rafick P. Sekaly, and Ryan Brinkman. FuGEFlow: data model and markup language for flow cytometry. BMC Bioinformatics, 10(1):184+, June 2009.
[8]
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
[9]
Andrew Miller, Justin Marsh, Adam Reeve, Alan Garny, Randall Britten, Matt Halstead, Jonathan Cooper, David Nickerson, and Poul Nielsen. An overview of the CellML API and its implementation. BMC Bioinformatics, 11(1):178+, April 2010.
[10]
Robert Arp and Barry Smith. Function, Role, and Disposition in Basic Formal Ontology. Nature Precedings, (713).
[11]
Phillip Lord and Robert Stevens. Adding a Little Reality to Building Ontologies for Biology. PLoS ONE, 5(9):e12258+, September 2010.
[12]
Phillip Lord. An evolutionary approach to Function. Journal of Biomedical Semantics, 1(Suppl 1):S4+, 2010.
[13]
Robert Stevens. Unicorns in my Ontology, May 2011.
[14]
Michel Dumontier and Robert Hoehndorf. Realism for scientific ontologies. In Proceeding of the 2010 conference on Formal Ontology in Information Systems: Proceedings of the Sixth International Conference (FOIS 2010), pages 387–399, Amsterdam, The Netherlands, The Netherlands, 2010. IOS Press.
[15]
The OBI Consortium. OBI Ontology.
[16]
V. Chelliah, L. Endler, N. Juty, C. Laibe, C. Li, N. Rodriguez, and N. Le Novere. Data Integration and Semantic Enrichment of Systems Biology Models and Simulations. In N. W. Paton, P. Missier, and C. Hedeler, editors, Data Integration in the Life Sciences, Proceedings; Lecture Notes in Computer Science; 6th International Workshop on Data Integration in the Life Sciences, volume 5647, pages 5–15. [Chelliah, Vijayalakshmi; Endler, Lukas; Juty, Nick; Laibe, Camille; Li, Chen; Rodriguez, Nicolas; Le Novere, Nicolas] EMBL European Bioinformat Inst, Cambridge CB10 1SD, England.; Le Novere, N, EMBL European Bioinformat Inst, Wellcome Trust Genome Campus, Cambridge CB10 1SD, England., July 2009.
[17]
Nicolas Le Novere. Principled annotation of quantitative models in systems biology. In Genomes to Systems, 2008.
[18]
Emek Demir, Michael P. Cary, Suzanne Paley, Ken Fukuda, Christian Lemer, Imre Vastrik, Guanming Wu, Peter D’Eustachio, Carl Schaefer, Joanne Luciano, Frank Schacherer, Irma Martinez-Flores, Zhenjun Hu, Veronica Jimenez-Jacinto, Geeta Joshi-Tope, Kumaran Kandasamy, Alejandra C. Lopez-Fuentes, Huaiyu Mi, Elgar Pichler, Igor Rodchenkov, Andrea Splendiani, Sasha Tkachev, Jeremy Zucker, Gopal Gopinath, Harsha Rajasimha, Ranjani Ramakrishnan, Imran Shah, Mustafa Syed, Nadia Anwar, Ozgün Babur, Michael Blinov, Erik Brauner, Dan Corwin, Sylva Donaldson, Frank Gibbons, Robert Goldberg, Peter Hornbeck, Augustin Luna, Peter Murray-Rust, Eric Neumann, Oliver Reubenacker, Matthias Samwald, Martijn van Iersel, Sarala Wimalaratne, Keith Allen, Burk Braun, Michelle Whirl-Carrillo, Kei-Hoi H. Cheung, Kam Dahlquist, Andrew Finney, Marc Gillespie, Elizabeth Glass, Li Gong, Robin Haw, Michael Honig, Olivier Hubaut, David Kane, Shiva Krupa, Martina Kutmon, Julie Leonard, Debbie Marks, David Merberg, Victoria Petri, Alex Pico, Dean Ravenscroft, Liya Ren, Nigam Shah, Margot Sunshine, Rebecca Tang, Ryan Whaley, Stan Letovksy, Kenneth H. Buetow, Andrey Rzhetsky, Vincent Schachter, Bruno S. Sobral, Ugur Dogrusoz, Shannon McWeeney, Mirit Aladjem, Ewan Birney, Julio Collado-Vides, Susumu Goto, Michael Hucka, Nicolas Le Novère, Natalia Maltsev, Akhilesh Pandey, Paul Thomas, Edgar Wingender, Peter D. Karp, Chris Sander, and Gary D. Bader. The BioPAX community standard for pathway data sharing. Nature biotechnology, 28(9):935–942, September 2010.
[19]
Alan Garny, David P. Nickerson, Jonathan Cooper, Rodrigo W. Santos, Andrew K. Miller, Steve Mckeever, Poul M. F. Nielsen, and Peter J. Hunter. CellML and associated tools and techniques. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 366(1878):3017–3043, September 2008.
[20]
Leslie M. Loew. The Virtual Cell project. Novartis Foundation symposium, 247, 2002.
[21]
Andrew R. Jones, Michael Miller, Ruedi Aebersold, Rolf Apweiler, Catherine A. Ball, Alvis Brazma, James DeGreef, Nigel Hardy, Henning Hermjakob, Simon J. Hubbard, Peter Hussey, Mark Igra, Helen Jenkins, Randall K. Julian, Kent Laursen, Stephen G. Oliver, Norman W. Paton, Susanna-Assunta Sansone, Ugis Sarkans, Christian J. Stoeckert, Chris F. Taylor, Patricia L. Whetzel, Joseph A. White, Paul Spellman, and Angel Pizarro. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnology, 25(10):1127–1133, October 2007.
[22]
Dawn Field, Susanna-Assunta A. Sansone, Amanda Collis, Tim Booth, Peter Dukes, Susan K. Gregurick, Karen Kennedy, Patrik Kolar, Eugene Kolker, Mary Maxon, Siân Millard, Alexis-Michel M. Mugabushaka, Nicola Perrin, Jacques E. Remacle, Karin Remington, Philippe Rocca-Serra, Chris F. Taylor, Mark Thorley, Bela Tiwari, and John Wilbanks. Megascience. ’Omics data sharing. Science (New York, N.Y.), 326(5950):234–236, October 2009.
[23]
Chris F. Taylor, Dawn Field, Susanna-Assunta Sansone, Jan Aerts, Rolf Apweiler, Michael Ashburner, Catherine A. Ball, Pierre-Alain Binz, Molly Bogue, Tim Booth, Alvis Brazma, Ryan R. Brinkman, Adam Michael Clark, Eric W. Deutsch, Oliver Fiehn, Jennifer Fostel, Peter Ghazal, Frank Gibson, Tanya Gray, Graeme Grimes, John M. Hancock, Nigel W. Hardy, Henning Hermjakob, Randall K. Julian, Matthew Kane, Carsten Kettner, Christopher Kinsinger, Eugene Kolker, Martin Kuiper, Nicolas L. Novere, Jim Leebens-Mack, Suzanna E. Lewis, Phillip Lord, Ann-Marie Mallon, Nishanth Marthandan, Hiroshi Masuya, Ruth McNally, Alexander Mehrle, Norman Morrison, Sandra Orchard, John Quackenbush, James M. Reecy, Donald G. Robertson, Philippe Rocca-Serra, Henry Rodriguez, Heiko Rosenfelder, Javier Santoyo-Lopez, Richard H. Scheuermann, Daniel Schober, Barry Smith, Jason Snape, Christian J. Stoeckert, Keith Tipton, Peter Sterk, Andreas Untergasser, Jo Vandesompele, and Stefan Wiemann. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology, 26(8):889–896, August 2008.
[24]
Abhishek Tiwari. BioPAX or SBML?, July 2009.
[25]
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
[26]
Camille Laibe and Nicolas Le Novere. MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology, 1(1):58+, 2007.
[27]
Paul Dobson, Kieran Smallbone, Daniel Jameson, Evangelos Simeonidis, Karin Lanthaler, Pinar Pir, Chuan Lu, Neil Swainston, Warwick Dunn, Paul Fisher, Duncan Hull, Marie Brown, Olusegun Oshota, Natalie Stanford, Douglas Kell, Ross King, Stephen Oliver, Robert Stevens, and Pedro Mendes. Further developments towards a genome-scale metabolic model of yeast. BMC Systems Biology, 4(1):145+, October 2010.
[28]
Dagmar Waltemath, Richard Adams, Daniel A. Beard, Frank T. Bergmann, Upinder S. Bhalla, Randall Britten, Vijayalakshmi Chelliah, Michael T. Cooling, Jonathan Cooper, Edmund J. Crampin, Alan Garny, Stefan Hoops, Michael Hucka, Peter Hunter, Edda Klipp, Camille Laibe, Andrew K. Miller, Ion Moraru, David Nickerson, Poul Nielsen, Macha Nikolski, Sven Sahle, Herbert M. Sauro, Henning Schmidt, Jacky L. Snoep, Dominic Tolle, Olaf Wolkenhauer, and Nicolas Le Novère. Minimum Information About a Simulation Experiment (MIASE). PLoS Comput Biol, 7(4):e1001122+, April 2011.
[29]
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
[30]
Markus J. Herrgard, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalcin Arga, Mikko Arvas, Nils Buthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Nicolas Le Novere, Peter Li, Wolfram Liebermeister, Monica L. Mo, Ana P. Oliveira, Dina Petranovic, Stephen Pettifer, Evangelos Simeonidis, Kieran Smallbone, Irena Spasie, Dieter Weichart, Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betul Kurdar, Merja Penttila, Edda Klipp, Bernhard O. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen, and Douglas B. Kell. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology, 26(10):1155–1160, October 2008.
[31]
Melanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novere. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011.
[32]
Nick Juty, Nick Juty, and Nick Juty. Systems Biology Ontology: Update. Nature Precedings, (713), October 2010.
[33]
Michael Hucka, Michael Hucka, Frank Bergmann, Stefan Hoops, Sarah Keating, Sven Sahle, James Schaff, Lucian Smith, Darren Wilkinson, Michael Hucka, Frank T. Bergmann, Stefan Hoops, Sarah M. Keating, Sven Sahle, James C. Schaff, Lucian P. Smith, and Darren J. Wilkinson. The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core. Nature Precedings, (713), October 2010.
[34]
Nicolas L. Novere, Michael Hucka, Huaiyu Mi, Stuart Moodie, Falk Schreiber, Anatoly Sorokin, Emek Demir, Katja Wegner, Mirit I. Aladjem, Sarala M. Wimalaratne, Frank T. Bergman, Ralph Gauges, Peter Ghazal, Hideya Kawaji, Lu Li, Yukiko Matsuoka, Alice Villeger, Sarah E. Boyd, Laurence Calzone, Melanie Courtot, Ugur Dogrusoz, Tom C. Freeman, Akira Funahashi, Samik Ghosh, Akiya Jouraku, Sohyoung Kim, Fedor Kolpakov, Augustin Luna, Sven Sahle, Esther Schmidt, Steven Watterson, Guanming Wu, Igor Goryanin, Douglas B. Kell, Chris Sander, Herbert Sauro, Jacky L. Snoep, Kurt Kohn, and Hiroaki Kitano. The Systems Biology Graphical Notation. Nature Biotechnology, 27(8):735–741, August 2009.
[35]
Dagmar Köhn and Nicolas Le Novère. SED-ML – An XML Format for the Implementation of the MIASE Guidelines. In Monika Heiner and Adelinde Uhrmacher, editors, Computational Methods in Systems Biology, volume 5307 of Lecture Notes in Computer Science, chapter 15, pages 176–190. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2008.
[36]
Catherine M. Lloyd, James R. Lawson, Peter J. Hunter, and Poul F. Nielsen. The CellML Model Repository. Bioinformatics, 24(18):2122–2123, September 2008.
[37]
James B. Bassingthwaighte. Strategies for the Physiome Project. Annals of Biomedical Engineering, 28(8):1043–1058, August 2000.
[38]
L. M. Loew and J. C. Schaff. The Virtual Cell: a software environment for computational cell biology. Trends in biotechnology, 19(10):401–406, October 2001.
[39]
Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J. Mungall, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H. Scheuermann, Nigam Shah, Patricia L. Whetzel, and Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11):1251–1255, November 2007.
[40]
Judith A. Blake and Carol J. Bult. Beyond the data deluge: data integration and bio-ontologies. Journal of biomedical informatics, 39(3):314–320, June 2006.
[41]
J. Lomax and A. T. McCray. Mapping the gene ontology into the unified medical language system. Comparative and functional genomics, 5(4):354–361, 2004.
[42]
Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, AmiGO Hub, and Web Presence Working Group. AmiGO: online access to ontology and annotation data. Bioinformatics (Oxford, England), 25(2):288–289, January 2009.
[43]
Erick Antezana, Ward Blondé, Mikel Egaña, Alistair Rutherford, Robert Stevens, Bernard De Baets, Vladimir Mironov, and Martin Kuiper. BioGateway: a semantic systems biology tool for the life sciences. BMC bioinformatics, 10 Suppl 10(Suppl 10):S11+, 2009.
[44]
Nicolas Le Novère. Model storage, exchange and integration. BMC neuroscience, 7 Suppl 1(Suppl 1):S11+, 2006.
[45]
Jonathan Bard, Seung Y. Rhee, and Michael Ashburner. An ontology for cell types. Genome biology, 6(2):R21+, 2005.
[46]
J. S. Luciano. PAX of mind for pathway researchers. Drug discovery today, 10(13):937–942, July 2005.
[47]
Lena Stromback and Patrick Lambrix. Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics, 21(24):4401–4407, December 2005.
[48]
L. Strömbäck, V. Jakoniene, H. Tan, and P. Lambrix. Representing, storing and accessing molecular interaction data: a review of models and tools. Briefings in bioinformatics, 7(4):331–338, December 2006.
[49]
O. Ruebenacker, I. I. Moraru, J. C. Schaff, and M. L. Blinov. Integrating BioPAX pathway knowledge with SBML models. IET Systems Biology, 3(5):317–328, 2009.
[50]
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
[51]
B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow, M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Orchard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The IntAct molecular interaction database in 2010. Nucleic Acids Research, 38(Database issue):D525–D531, October 2009.
Categories
Data Integration

Background: Modelling Biological Systems (Thesis 1.3)

[Previous: What does systems biology data look like?]
[Next: Standards as a shared structure for data]

Modelling biological systems

Le Novère described the 1952 Hodgkin–Huxley model of the squid giant nerve fibre [1] as the beginning of computational systems biology [2]. Since that time, systems biology models have been available in a variety of representations and with a large number of corresponding syntaxes. This variation is mainly a result of the different approaches used to model pathways and interactions in systems biology. Biological networks are generally qualitative and large-scale, and are created to provide a high level of granularity. Most networks do not yet have the information required to run successful mathematical simulations of the interactions under study; their remit is much broader and they are often composed of transitive binary interactions discovered in high-throughput experiments rather than complex biology-based pathways.

However, many experiments also produce quantitative data. For instance, high-throughput experimentation creates both qualitative and quantitative signalling pathway data important for the understanding of cellular communication [3]. Therefore modelling at the level of quantitative biological pathways is common in systems biology. Quantitative simulatable models are mainly used to enhance dynamic analyses and study biological pathways, and include specific details on parametrisation of those pathways. The quantitative description of a biological pathway or behaviour is essential for high-quality systems biology, informing hypotheses and creating an iterative cycle of prediction and experimental verification [4, 5, 3]. This section describes the basics of both networks and quantitative models in systems biology.

1 Networks in systems biology

To the majority of scientists, networks are perceived as views over sections of an in vivo cellular network [6]. There are five main types of network in systems biology: (i) transcription factor-binding networks, (ii) protein-protein interaction networks, (iii) protein phosphorylation networks, (iv) metabolic interaction networks and (v) genetic and small molecule interaction networks [7]. While such networks are just a conceptualisation of biological reality, they remain a useful virtual organisation of biological entities. For instance, network-mediated integration methodologies make use of the inherent graph-based organisation of networks to add many different datasets.

The CID integrates interaction data for multiple organisms into a weighted PFIN (Note: http://cisban-silico.cs.ncl.ac.uk/cid.html). CID allows users to determine the reliability of a particular connection between two interactors. Additional work has made use of the inherent bias of a dataset to generate PFIN networks which include a relevance score [8]. Biases are exploited rather than eliminated; they can be introduced by an experiment being designed for a particular biological process or by choosing the final published data because it reflects the process of interest [8]. By adding a relevance score to the existing integrated confidence scores of a network, biased networks perform better than unbiased networks at gene function assignment and identification of important sets of interactors [8].

Genome-scale metabolic networks attempt to add new annotation as well as reconcile often-contradictory information present in the original networks and have been created for yeast [9, 10, 11] and human [12, 13]. The ultimate goal is to generate a view spanning an entire cellular network rather than sections of it.

Common formats for large-scale network data include linked data via the RDF [14] or semantically structured data via BioPAX [15]. RDF is a format which uses triples to create a directed, labelled graph of data and is the underpinning of the Semantic Web. A triple is similar to a sentence comprising a Subject, a Predicate and an Object [14]. A collection of RDF data can also be visualised as a graph where the Subjects and Objects are nodes linked together by Predicates, which form the edges between nodes.

Linked Life Data, one of the biggest networks of life science data, uses RDF to store and link its entities (Note: http://linkedlifedata.com). Linked Life Data contains over one billion entities and over five billion statements (Note: as of December 2011, http://linkedlifedata.com/sources). ONDEX [16] is a tool for data visualisation and integration which has been used to generate [17, 18, 19] and visualise [18] biological networks. While ONDEX can export data as RDF, internally it uses a labelled graph with a similar level of expressivity to RDF. BioPAX describes pathways in great qualitative detail, but does not have the capacity to store parameters or other quantitative data about pathways and interactions.

2 From networks to mathematical models

Although networks are generally qualitative, they can aid in the creation of quantitative models; indeed, mathematical modelling is one way of studying the complex behaviour of networks [3]. Some researchers are even blurring the line between networks and models, attempting to create genome-scale models with partial quantisation of data. These integrated consensus networks sit on the border between qualitative large-scale networks and quantitative small-scale pathway models. Herrgård and colleagues [9] created the “Yeast 1.0” consensus network by re-formatting existing yeast interaction data in SBML [20], a structure more commonly used for quantitative modelling. Although Yeast 1.0 is represented in SBML, it does not yet contain enough quantitative data to be simulated.

Irrespective of a computational model’s size reactions, effectors, kinetic rate equations and parameters of those equations need to be added for it to be fully functional [9]. Of those requirements, the original version of Yeast 1.0 only contained known reactions. Updates to the consensus network have vastly increased the connectivity of the nodes as well as the number of metabolites and enzymes, but have not yet increased its quantitative information [21]. Once complete, such network-scale models will benefit both from the large amount of data contained within them and from the ability to run in silico modelling experiments normally only accessible to smaller quantitative models.

Formats such as SBML are able to describe pathways quantitatively, and were created to provide machine-readable formats for model simulation. Other resources such as the BioPAX ontology were created to describe pathways qualitatively. Even with this logical division in purpose between qualitative and quantitative descriptions of systems biology, some projects have begun to bridge the divide. The network-scale Yeast 1.0 SBML model does not yet contain quantitative information. Rather than producing Yeast 1.0 for the purposes of simulation, there is a commitment to realistic representation and high quality selection of reactions [21]. Additionally, research into adding quantitative information to the qualitative pathway ontology BioPAX via SBPAX is progressing [22].

Work by Smallbone and colleagues [23, 24] uses flux balance analysis to create kinetic models of metabolism using only information regarding the reaction stoichiometries. Even though there is little experimental data for the variables, the dynamics concerning the concentrations of cellular metabolites can be inferred. Work on genome-scale kinetic models has progressed by adding the information from kinetic models stored within BioModels to this flux balance analysis method [24].

3 Quantitative modelling

Processes are the fundamental unit of systems biology, and the biological entities and associated quantitative data such as concentrations and rate constants are of prime importance for systems biology research [6]. The simulation and analysis of computational models that describe the dynamics of the interactions between biological entities is a vital facet of systems biology research [5]. Models and experiments are typically refined iteratively, as described in Section 1.2; models provide useful feedback to experimentalists, and experimental results help in turn to improve the models. The creation of systems biology models, such as those written in SBML or CellML [25], is primarily performed manually. When faced with such a time-consuming process, many researchers will not represent the biological context of a pathway or make use of many of the data sources and formats relevant to model development. While a small number of core databases can be used to retrieve a large amount of relevant biological information, accessing the “long tail” of information stored in other resources is a more complicated process. However, computational aids could help modellers retrieve new biological information easily and quickly.

Formats such as SBML and CellML provide machine-readable interpretations of biological pathways, complex formation and fundamental processes such as transcription and translation [26]. They are intended to make the task of understanding models easy for the programs that process them. The success of computational models in systems biology is not just shown by their prevalence in literature, but also by the 231 (Note: as of December 2011, http://sbml.org/SBML_Software_Guide) programs and applications making use of them. These programs allow the creation, simulation, analysis, annotation, and storage of SBML in a way that hides the underlying machine-readable format, making the model information accessible to humans. Further information on systems biology formats is available in Section 1.4.

4 Annotation of systems biology models

Systems biology models may contain both the quantitative information required to run a simulation of a biological system and biologically meaningful annotation describing the entities in the system. Annotation provides a description of how a model has been generated and defines the biology of its components in a computationally accessible fashion. However, while the mathematical information necessary for simulation models must be included, syntactically valid models capable of simulation are not required to contain explicit information about the biological context. Therefore, even though the presence of biological annotation aids efficient exchange, reuse, and integration of models, such information is often limited or lacking [27]. As a result, model usefulness is often limited to the person who created it; ambiguity in naming schemes and a lack of biological context hinders model reuse as an input in other computational tasks and as a reference for researchers [28, 29].

BioModels is a database of SBML models divided into curated and non-curated branches [27]. In the curated section, MIRIAM [28] compliance is assured and biological semantics have been added. BioModels curators regularly add biological annotations to an entry prior to promotion to the curated branch. These annotations resolve ambiguity of identification through links to external resources using stable URIs as provided by the MIRIAM Registry (Note: http://www.ebi.ac.uk/miriam). However, model annotation either by BioModels curators or the modellers themselves is complex [27]. Programmatic methods to add such annotation would enrich publicly available models and therefore improve their quality and reusability.

In SBML, biological annotation is structured according to the MIRIAM specification [28]. There are three parts to MIRIAM: (i) a recommended URI-based structure for compliant annotations, (ii) a set of resources to generate and interpret those URIs and (iii) a checklist of minimal information requested in the annotation of biological models. While other annotations are allowed within the specification, MIRIAM annotations are the most relevant to the work presented here. MIRIAM annotations are added to models in a standardised way, and link external resources such as ontologies or data sources to a model. MIRIAM provides a standard structure for explicit links between the mathematical and biological aspects of a model. Aids to model annotation exist [30, 31, 22, 32, 33, 34], but rely extensively on the expert knowledge of the modeller for identification of appropriate additions. More information on such tools is available in Chapters 3 and 5. Ultimately, there is a need for computational approaches that automate the integration of multiple sources to decrease the annotation burden on the modeller.

Bibliography

[1]
A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, August 1952.
[2]
Nicolas Le Novère. The long journey to a Systems Biology of neuronal function. BMC systems biology, 1(1):28+, 2007.
[3]
Anna Bauer-Mehren, Laura I. Furlong, and Ferran Sanz. Pathway databases and tools for their exploitation: benefits, current limitations and challenges. Molecular Systems Biology, 5, July 2009.
[4]
Hiroaki Kitano. Systems Biology: A Brief Overview. Science, 295(5560):1662–1664, March 2002.
[5]
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
[6]
Joanne S. Luciano and Robert D. Stevens. e-Science and biological pathway semantics. BMC bioinformatics, 8 Suppl 3(Suppl 3):S3+, 2007.
[7]
Xiaowei Zhu, Mark Gerstein, and Michael Snyder. Getting connected: analysis and principles of biological networks. Genes & Development, 21(9):1010–1024, May 2007.
[8]
Katherine James, Anil Wipat, and Jennifer Hallinan. Integration of Full-Coverage Probabilistic Functional Networks with Relevance to Specific Biological Processes. In Norman Paton, Paolo Missier, and Cornelia Hedeler, editors, Data Integration in the Life Sciences, volume 5647 of Lecture Notes in Computer Science, chapter 4, pages 31–46. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2009.
[9]
Markus J. Herrgard, Neil Swainston, Paul Dobson, Warwick B. Dunn, K. Yalcin Arga, Mikko Arvas, Nils Buthgen, Simon Borger, Roeland Costenoble, Matthias Heinemann, Michael Hucka, Nicolas Le Novere, Peter Li, Wolfram Liebermeister, Monica L. Mo, Ana P. Oliveira, Dina Petranovic, Stephen Pettifer, Evangelos Simeonidis, Kieran Smallbone, Irena Spasie, Dieter Weichart, Roger Brent, David S. Broomhead, Hans V. Westerhoff, Betul Kurdar, Merja Penttila, Edda Klipp, Bernhard O. Palsson, Uwe Sauer, Stephen G. Oliver, Pedro Mendes, Jens Nielsen, and Douglas B. Kell. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology, 26(10):1155–1160, October 2008.
[10]
Natalie C. Duarte, Markus J. Herrgard, and Bernhard Palsson. Reconstruction and Validation of Saccharomyces cerevisiae iND750, a Fully Compartmentalized Genome-Scale Metabolic Model. Genome Research, 14(7):1298–1309, July 2004.
[11]
Jochen Förster, Iman Famili, Patrick Fu, Bernhard Ø. Palsson, and Jens Nielsen. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome research, 13(2):244–253, February 2003.
[12]
Hongwu Ma, Anatoly Sorokin, Alexander Mazein, Alex Selkov, Evgeni Selkov, Oleg Demin, and Igor Goryanin. The Edinburgh human metabolic network reconstruction and its functional analysis. Molecular systems biology, 3, 2007.
[13]
Natalie C. Duarte, Scott A. Becker, Neema Jamshidi, Ines Thiele, Monica L. Mo, Thuy D. Vo, Rohith Srivas, and Bernhard Ø. Palsson. Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences of the United States of America, 104(6):1777–1782, February 2007.
[14]
Dave Beckett. RDF/XML Syntax Specification (Revised). http://www.w3.org/TR/rdf-syntax-grammar/, February 2004.
[15]
Emek Demir, Michael P. Cary, Suzanne Paley, Ken Fukuda, Christian Lemer, Imre Vastrik, Guanming Wu, Peter D’Eustachio, Carl Schaefer, Joanne Luciano, Frank Schacherer, Irma Martinez-Flores, Zhenjun Hu, Veronica Jimenez-Jacinto, Geeta Joshi-Tope, Kumaran Kandasamy, Alejandra C. Lopez-Fuentes, Huaiyu Mi, Elgar Pichler, Igor Rodchenkov, Andrea Splendiani, Sasha Tkachev, Jeremy Zucker, Gopal Gopinath, Harsha Rajasimha, Ranjani Ramakrishnan, Imran Shah, Mustafa Syed, Nadia Anwar, Ozgün Babur, Michael Blinov, Erik Brauner, Dan Corwin, Sylva Donaldson, Frank Gibbons, Robert Goldberg, Peter Hornbeck, Augustin Luna, Peter Murray-Rust, Eric Neumann, Oliver Reubenacker, Matthias Samwald, Martijn van Iersel, Sarala Wimalaratne, Keith Allen, Burk Braun, Michelle Whirl-Carrillo, Kei-Hoi H. Cheung, Kam Dahlquist, Andrew Finney, Marc Gillespie, Elizabeth Glass, Li Gong, Robin Haw, Michael Honig, Olivier Hubaut, David Kane, Shiva Krupa, Martina Kutmon, Julie Leonard, Debbie Marks, David Merberg, Victoria Petri, Alex Pico, Dean Ravenscroft, Liya Ren, Nigam Shah, Margot Sunshine, Rebecca Tang, Ryan Whaley, Stan Letovksy, Kenneth H. Buetow, Andrey Rzhetsky, Vincent Schachter, Bruno S. Sobral, Ugur Dogrusoz, Shannon McWeeney, Mirit Aladjem, Ewan Birney, Julio Collado-Vides, Susumu Goto, Michael Hucka, Nicolas Le Novère, Natalia Maltsev, Akhilesh Pandey, Paul Thomas, Edgar Wingender, Peter D. Karp, Chris Sander, and Gary D. Bader. The BioPAX community standard for pathway data sharing. Nature biotechnology, 28(9):935–942, September 2010.
[16]
Jacob Köhler, Jan Baumbach, Jan Taubert, Michael Specht, Andre Skusa, Alexander Rüegg, Chris Rawlings, Paul Verrier, and Stephan Philippi. Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics, 22(11):1383–1390, June 2006.
[17]
Artem Lysenko, Michael D. Platel, Keywan H. Pak, Jan Taubert, Charlie Hodgman, Christopher Rawlings, and Mansoor Saqi. Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis. BMC Bioinformatics, 12(1):203+, 2011.
[18]
Jochen Weile, Matthew Pocock, Simon J. Cockell, Phillip Lord, James M. Dewar, Eva-Maria Holstein, Darren Wilkinson, David Lydall, Jennifer Hallinan, and Anil Wipat. Customizable views on semantically integrated networks for systems biology. Bioinformatics, 27(9):1299–1306, May 2011.
[19]
Simon J. Cockell, Jochen Weile, Phillip Lord, Claire Wipat, Dmytro Andriychenko, Matthew Pocock, Darren Wilkinson, Malcolm Young, and Anil Wipat. An integrated dataset for in silico drug discovery. Journal of integrative bioinformatics, 7(3), 2010.
[20]
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
[21]
Paul Dobson, Kieran Smallbone, Daniel Jameson, Evangelos Simeonidis, Karin Lanthaler, Pinar Pir, Chuan Lu, Neil Swainston, Warwick Dunn, Paul Fisher, Duncan Hull, Marie Brown, Olusegun Oshota, Natalie Stanford, Douglas Kell, Ross King, Stephen Oliver, Robert Stevens, and Pedro Mendes. Further developments towards a genome-scale metabolic model of yeast. BMC Systems Biology, 4(1):145+, October 2010.
[22]
O. Ruebenacker, I. I. Moraru, J. C. Schaff, and M. L. Blinov. Integrating BioPAX pathway knowledge with SBML models. IET Systems Biology, 3(5):317–328, 2009.
[23]
Kieran Smallbone, Evangelos Simeonidis, David S. Broomhead, and Douglas B. Kell. Something from nothing – bridging the gap between constraint-based and kinetic modelling. FEBS Journal, 274(21):5576–5585, November 2007.
[24]
Kieran Smallbone, Evangelos Simeonidis, Neil Swainston, and Pedro Mendes. Towards a genome-scale kinetic model of cellular metabolism. BMC Systems Biology, 4(1):6+, January 2010.
[25]
Catherine M. Lloyd, James R. Lawson, Peter J. Hunter, and Poul F. Nielsen. The CellML Model Repository. Bioinformatics, 24(18):2122–2123, September 2008.
[26]
M. Hucka, A. Finney, B. J. Bornstein, S. M. Keating, B. E. Shapiro, J. Matthews, B. L. Kovitz, M. J. Schilstra, A. Funahashi, J. C. Doyle, and H. Kitano. Evolving a lingua franca and associated software infrastructure for computational systems biology: the Systems Biology Markup Language (SBML) project. Systems biology, 1(1):41–53, June 2004.
[27]
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
[28]
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
[29]
Camille Laibe and Nicolas Le Novere. MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology, 1(1):58+, 2007.
[30]
Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
[31]
Peter Li, Tom Oinn, Stian Soiland, and Douglas B. Kell. Automated manipulation of systems biology models using libSBML within Taverna workflows. Bioinformatics (Oxford, England), 24(2):287–289, January 2008.
[32]
Neil Swainston and Pedro Mendes. libAnnotationSBML: a library for exploiting SBML annotations. Bioinformatics, 25(17):2292–2293, September 2009.
[33]
M. L. Blinov, O. Ruebenacker, and I. I. Moraru. Complexity and modularity of intracellular networks: a systematic approach for modelling and simulation. IET systems biology, 2(5):363–368, September 2008.
[34]
Falko Krause, Jannis Uhlendorf, Timo Lubitz, Marvin Schulz, Edda Klipp, and Wolfram Liebermeister. Annotation and merging of SBML models with semanticSBML. Bioinformatics, 26(3):421–422, February 2010.
Categories
Data Integration

Background: What does systems biology data look like? (Thesis 1.2)

[Previous: Overview]
[Next: Modelling biological systems]

What does systems biology data look like?

Properties of a system exist that are more than just the sum of their parts; systems that contain these emergent properties are said to be irreducible (Note: Why Systems Matter, accessed December 2011.). Though reductionist methods of research can provide a large amount of detail for specific biological entities, a more holistic systems approach is required to understand emergent systems properties [1]. Such a top-down approach creates a life cycle of systems biology research. Beginning with the Hodgkin–Huxley model of squid axons in 1952 [2], hypotheses have been tested both in the laboratory and through simulations of mathematical models. Data from the laboratory informs these models, which can then be used to inform further experimentation and validate or invalidate hypotheses.

Systems biology focuses on the study of systems as a whole rather than on the examination of individual constituent parts. Data useful to systems biology tends to be large and heterogeneous both in dimensionality and in structure, with modern high-throughput techniques collecting vast amounts of relevant information [3]. It is standard practice to take data points from a sample not just once, but across space, time, geographical location, organisational or even spectral range [4]. The wide variety of experimental types leads to a correspondingly large number of data representations, analysis methods and modelling strategies [5]. The reconciliation of disparate systems biology data, and the concomitant organisation and management of biological data sources into an exploitable “resourceome”, is of great importance to researchers requiring access to existing data [6].

With the maturation of research methods, interpretations of the systems biology life cycle have become correspondingly more complex. Kitano detailed a relatively simple systems biology life cycle in 2002 which is summarised in Figure 1. By 2006, Philippi and colleagues had incorporated a data integration step as described in Figure 2. By 2009, semantics had become important enough to systems biology research that Antezana and colleagues had added formalisation of knowledge and reasoning to the cycle (see Figure 3).

Figure 1: The systems biology life cycle in 2002, based on Kitano [3, Fig.1]. “Dry”, in silico modelling and simulation experiments inform “wet” experiments, which in turn generate data used to create and further inform hypotheses.

Figure 2: The systems biology life cycle in 2006, based on Philippi and colleagues [7, Fig.1b]. Four years after the Kitano [3] life cycle was published, data integration methodologies, highlighted in yellow, were common enough to be added. Further, the entire cycle could be completed with either wet or dry experiments, or a combination of both.

Figure 3: The semantic systems biology life cycle in 2009, based on Antezana and colleagues [6, Fig.2]. The new methods of integration and the addition of a reasoning step are highlighted in yellow. The semantic phase is iterative, shown with an arrow back to the integration and formalism step. The continued importance of the original Kitano life cycle is described with an arrow bypassing the semantic phase. While the original figure by Antezana and colleagues did not explicitly include a reference to in silico research, the experiments described in the paper could have been either dry or wet.

Kitano’s life cycle does not mention databases or integration of generated data with other data sources. Philippi and colleagues’ modified life cycle has these additions as well as the acknowledgement that “dry” in silico experiments produce useful data independently of “wet” experiments. Historically data integration in bioinformatics consisted of cross references between databases or links out via URLs (see Section 1.6 for more information). More complex linking became common as ontologies such as the GO [8] made it possible to reference community-wide hierarchies of descriptive biological terms.

Very recently, with an increase in the use of Semantic Web (Note: http://www.w3.org/2001/sw/) technologies such as ontologies, semantic data integration has become an important tool in systems biology research [6] (see Section 1.6). Figure 3 shows an interesting progression in the perception of researchers with regard to the systems biology life cycle with the addition of semantic techniques. By 2009, semantics and ontologies were becoming a bigger part of systems biology research. As such, Antezana and colleagues added the formalisation of data to the integration step, allowing data to be viewed in a semantically uniform way. The semantic data then becomes accessible to computational methods, allowing reasoning and consistency checking of the data. Even so, the research described in this thesis is one of only a handful of projects focusing on semantic data integration in systems biology.

There are four main areas of study in systems biology research: (i) the structure (e.g. interactions and pathways) of a system; (ii) how a system behaves over time, or its dynamics; (iii) the method of controlling and modulating the system; and (iv) the design method, or the deliberate progress using well defined design principles [3]. These four properties are strongly tied to the quantitative modelling aspect of systems biology, and illustrate the importance of such models. However, models are of limited use to either people or computers if they do not have structured biological annotations to provide context [9]. For instance, until SBML [10] models are annotated by the BioModels team, elements often contain short-hand, biologically irrelevant names and descriptions in computationally incompatible free text [11]. While attaching additional biological knowledge to quantitative models is not a requirement for their simulation, without such annotations model sharing, interpretation of simulation results, integration and reuse becomes nearly impossible [9]. Therefore the addition of biologically relevant, computationally accessible metadata will not only enhance the semantics of a model but provide a method of unambiguously identifying its elements.

The majority of systems biology research projects can ultimately be interpreted to produce interconnected data such as gene networks, protein networks and metabolic networks [12]. The level of granularity of these networks of information can vary from large-scale omics networks with thousands of nodes to precisely calibrated quantitative models of specific molecular interactions. The integration of networks and models presents a challenge to systems biology, increasing the importance of bioinformatics techniques to the life science community, a result in opposition to early predictions [13]. In Section 1.3, the description of biological systems is examined through the use of networks and models.

Bibliography

[1]
Uwe Sauer, Matthias Heinemann, and Nicola Zamboni. Getting Closer to the Whole Picture. Science, 316(5824):550–551, April 2007.
[2]
A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, August 1952.
[3]
Hiroaki Kitano. Systems Biology: A Brief Overview. Science, 295(5560):1662–1664, March 2002.
[4]
Jason R. Swedlow, Suzanna E. Lewis, and Ilya G. Goldberg. Modelling data across labs, genomes, space and time. Nature Cell Biology, 8(11):1190–1194, November 2006.
[5]
Katrin Hübner, Sven Sahle, and Ursula Kummer. Applications and trends in systems biology in biochemistry. FEBS Journal, 278(16):2767–2857, August 2011.
[6]
Erick Antezana, Martin Kuiper, and Vladimir Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4):392–407, July 2009.
[7]
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
[8]
Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000.
[9]
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
[10]
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
[11]
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
[12]
James E. Ferrell. Q&A: systems biology. Journal of biology, 8(1):2+, January 2009.
[13]
Lincoln D. Stein. Bioinformatics: alive and kicking. Genome biology, 9(12):114+, December 2008.
Categories
Data Integration

Background: Overview (Thesis 1.1)

[Previous: Additional Front Material]
[Next: What does systems biology data look like?]

Overview

Studying biological systems requires a large amount of data of different experimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. The use of community data standards can reduce the need for specialised, independent formats by providing a common syntax to make data retrieval and manipulation easier. However, standards uptake is not universal and the disparate data types required by systems biologists creates data that is not, or cannot, be completely described by a single standard. If existing data does not share a standard structure, theoretically any heterogeneous data could be reproduced in a single format by rerunning experiments. Because in practice such a method would be expensive and time consuming, integrative methods which reuse existing data should be explored.

Though it is possible for the biology represented by any given data format to be completely orthogonal with other experimental types, more commonly, portions of the biology—but not necessarily the formats describing them—overlap. These common aspects of biological data representations create theoretical integration points among the representations, allowing information reuse and re-purposing. However, shared biological concepts do not necessarily result in shared definitions of the biology. Whereas differences in format result in syntactic heterogeneity, the differences in meaning of seemingly identical biological concepts across different formats results in semantic heterogeneity. A portion of the work presented in this thesis addresses syntactic heterogeneity both through the use of a common experimental metadata standard and by systematically integrating data for the purposes of systems biology model annotation. While syntactic heterogeneity can be resolved through the alignment of common structures, semantic heterogeneity is a more complex challenge. Once the meaning of the underlying biological concepts of interest for all data sources has been made explicit, the semantic heterogeneity can be identified. Further, if the semantics of a data format are accessible to machines, computational reasoning can be applied to find inferences and logical inconsistencies. The work described in this thesis includes the conversion of a systems biology standard specification—in multiple documents—into a single semantically aware model of that specification.

A wide variety of integration methodologies addressing various aspects of syntactic and semantic heterogeneity are available, often optimised for different situations. Many of these methods do not address semantic heterogeneity in systems biology data and of those that do, very few use existing technologies, address syntactic and semantic heterogeneity and make use both of simple syntactic conversions of non-semantic formats and semantically-meaningful models of the biological domain of interest. The work described in this thesis includes a method of semantic data integration called rule-based mediation which provides these features and which was developed as an aid to systems biology model annotation. Integrating resources with rule-based mediation accommodates multiple formats with different semantics and provides richly-modelled biological knowledge suitable for annotation of systems biology models.

Within this introductory chapter, Section 1.2 provides an overview of systems biology and the challenge presented when multiple formats are used. Section 1.3 describes how differences in format are not simply the result of different types of experiments, but are also due to the variety of ways systems are modelled. Section 1.4 describes the content, syntax and semantics standards relevant for systems biology. Data heterogeneity is an issue not limited to systems biology, and as such there is a large amount of previous work on data integration (see Section 1.6). As described in Section 1.5, existing technologies such as ontologies, rules and reasoning can bring together heterogeneous data in a homogeneous, meaning-rich, computationally-amenable manner.

Figure 1: How the work described in this thesis relates to the semantic systems biology life cycle described further in Section 1.2, Figure 3. SyMBA provides a common structure for experimental metadata and a archive for experimental data of any type, thus aiding data storage and analysis. Saint helps systems biology modellers annotate models with biological information in a standard way, thus enhancing the quality of models used in in silico experiments. MFO and rule-based mediation use Semantic Web technologies to formalise systems biology data and perform automated reasoning over that data. Rule-based mediation also semantically integrates information relevant to systems biology.

Figure 1 provides a summary of the work described in the chapters of thesis in the context of the semantic systems biology life cycle originally described by Antezana and colleagues [1, Fig.2]. Retrieval and storage of systems biology experimental metadata using a common syntax becomes easier with applications like the collaboratively-developed SyMBA (Chapter 2). The Saint Web application (Chapter 3) provides syntactic integration of systems biology data as well as a simple interface for viewing, manipulating and exporting new biological annotation to existing systems biology models. Saint is useful both in its own right and as a test case for determining the data sources, capabilities and requirements for the implementation of rule-based mediation. New data sources can easily be added, and therefore Saint has the capacity to provide access to data integrated via rule-based mediation.

In some cases, community standards and syntactic integration are not enough. Rules and restrictions on the use of a standard syntax are not always confined to the syntax itself; extra information can be present in human-readable documentation such as Word or PDF documents, but not directly accessible to computers. Therefore, even if the data is in a common syntax, there are limits on its computational accessibility. This problem is resolved for SBML models through the use of MFO (Chapter 4), an ontology which holds SBML data as well as rules and restrictions on the SBML structure. MFO provides a format through which reasoning can be applied to SBML models, and stands both on its own and as part of the semantic data integration methodology described in Chapter 5.

Semantic data integration via rule-based mediation, described in Chapter 5, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations. Syntactic heterogeneity is resolved through the conversion to a computationally-accessible syntactic ontology, and semantic heterogeneity is resolved by mapping the syntactic ontology to a biological domain of interest which is itself modelled using an ontology.

Chapter 6 discusses future possibilities for data use and reuse. Will data integration become a thing of the past? An increase in the uptake of standards will likely occur as communities mature. Experiments and their outputs could become better organised, cheaper and more open. Data usage would be much easier if integration were not required at all. Ultimately, however, science moves faster than standards and there are always new questions and new experiments. Hopefully, integration will become seamless and transparent to the user as semantic methods, combined with use of large, open resources on the Semantic Web, serve heterogeneous data via a homogeneous view.

Bibliography

[1]
Erick Antezana, Martin Kuiper, and Vladimir Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4):392–407, July 2009.
Categories
Data Integration

Additional Front Material (Thesis)

[Previous: Abstract]
[Next: 1.1 Overview]

Dedication

This thesis is dedicated to my parents, who encouraged me in all things, and is particularly dedicated to my husband and my son, without whose patience and support I could not have finished this research.

“Among those who have endeavoured to promote learning and rectify judgement, it has long been customary to complain of the abuse of words, which are often admitted to signify things so different, that, instead of assisting the understanding as vehicles of knowledge, they produce error, dissension, and perplexity….” Dr. Samuel Johnson, 1752, via Nature Structural & Molecular Biology 14, 681 (2007).

“Metadata is a love note to the future.” (NYPL Labs, http://twitpic.com/6ry6ar, via @CameronNeylon @kissane)

Metadata is a love note to the future

Acknowledgements

Many thanks go to my supervisors Dr. Anil Wipat, Dr. Phillip Lord and Dr. Matthew Pocock. A special thanks goes to those people who found extra commas and other errors: Phoebe Boatright, Dr. Dagmar Waltemath, Lucy Slattery, Mélanie Courtot and Dr. Paul Williams, Jr. Past and present colleagues at Newcastle University have provided much support and inspiration, including Dr. Frank Gibson and Dr. Katherine James.

I gratefully acknowledge the support of the BBSRC and the EPSRC for funding CISBAN at Newcastle University. I also acknowledge the support of the Newcastle University Systems Biology Resource Centre, the Newcastle University Bioinformatics Support Unit and the North East regional e-Science centre.

Declaration

I declare that the following work embodies the result of my own work, that it has been composed by myself and does not include work forming part of a thesis presented successfully for a degree in this or another University.

Allyson Lister

Contributions and Papers

During the course of this work, I have been involved both in the development of a number of standards efforts and in the publishing of a number of papers.

The list below describes the standards efforts to which I have contributed and the roles I have had within those efforts:

  • a developer of UniProt/TrEMBL [1];
  • a program developer and contributor to the FuGE [2, 3, 4], a standard XML syntax for describing experiments;
  • a core developer and coordinator of the OBI [5, 6, 7], a standard semantics for describing experiments;
  • an early developer of the ISA-TAB [8] tab-delimited syntax for describing experiments;
  • a developer of the minimal information checklist MIGS/MIMS for genomics and metagenomics information [9];
  • an advisor for the SBO, the KiSAO and the TEDDY;
  • a co-author of the SBML Level 3 Annotation package [10];
  • an advisor for the Cell Behavior Ontology1; and
  • an advisor in the nascent synthetic biology standards2.

I have co-authored 20 papers, specification documents, articles and technical reports during the course of this work:

  1. Mélanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novère. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011.
  2. Stephen G. Addinall, Eva-Maria Holstein, Conor Lawless, Min Yu, Kaye Chapman, A. Peter Banks, Hien-Ping Ngo, Laura Maringele, Morgan Taschuk, Alexander Young, Adam Ciesiolka, Allyson L. Lister, Anil Wipat, Darren J. Wilkinson, and David Lydall. Quantitative fitness analysis shows that NMD proteins and many other protein complexes suppress or enhance distinct telomere cap defects. PLoS Genet, 7(4):e1001362+, April 2011.
  3. Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1):23–33, January 2011.
  4. Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through rule-based semantic integration. Journal of biomedical semantics, 1 Suppl 1(Suppl 1):S3+, 2010.
  5. Andrew R. Jones and Allyson L. Lister. Managing experimental data using FuGE. Methods in molecular biology (Clifton, N.J.), 604:333–343, 2010.
  6. Allyson L. Lister. Semantic integration in the life sciences. Ontogenesis,
    http://ontogenesis.knowledgeblog.org/126. January 2010.
  7. Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live coverage of intelligent systems for molecular Biology/European conference on computational biology (ISMB/ECCB) 2009. PLoS Comput Biol, 6(1):e1000640+, January 2010.
  8. Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live coverage of scientific conferences using Web technologies. PLoS Comput Biol, 6(1):e1000563+, January 2010.
  9. Allyson Lister, Varodom Charoensawan, Subhajyoti De, Katherine James, Sarath Chandra C. Janga, and Julian Huppert. Interfacing systems biology and synthetic biology. Genome biology, 10(6):309+, 2009.
  10. Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
  11. Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: the minimum information to reference an external ontology term. In Barry Smith, editor, International Conference on Biomedical Ontology, pages 87–90. University at Buffalo College of Arts and Sciences, National Center for Ontological Research, National Center for Biomedical Ontology, July 2009.
  12. Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through Rule-Based semantic integration. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 49+, June 2009.
  13. The OBI Consortium. Modeling biomedical experimental processes with OBI. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 41+, June 2009.
  14. Andrew R. Jones, Allyson L. Lister, Leandro Hermida, Peter Wilkinson, Martin Eisenacher, Khalid Belhajjame, Frank Gibson, Phil Lord, Matthew Pocock, Heiko Rosenfelder, Javier Santoyo-Lopez, Anil Wipat, and Norman W. W. Paton. Modeling and managing experimental data using FuGE. OMICS: A Journal of Integrative Biology, 13(3):239–251, June 2009.
  15. Mélanie Courtot, William Bug, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. The OWL of biomedical investigations. In OWLED 2008, October 2008.
  16. Susanna-Assunta Sansone, Philippe Rocca-Serra, Marco Brandizi, Alvis Brazma, Dawn Field, Jennifer Fostel, Andrew G. Garrow, Jack Gilbert, Federico Goodsaid, Nigel Hardy, Phil Jones, Allyson Lister, Michael Miller, Norman Morrison, Tim Rayner, Nataliya Sklyar, Chris Taylor, Weida Tong, Guy Warner, and Stefan Wiemann. The first RSBI (ISA-TAB) workshop: ” can a simple format work for complex studies?”. OMICS: A Journal of Integrative Biology, 12(2):143–149, June 2008.
  17. Dawn Field, George Garrity, Tanya Gray, Norman Morrison, Jeremy Selengut, Peter Sterk, Tatiana Tatusova, Nicholas Thomson, Michael J. Allen, Samuel V. Angiuoli, Michael Ashburner, Nelson Axelrod, Sandra Baldauf, Stuart Ballard, Jeffrey Boore, Guy Cochrane, James Cole, Peter Dawyndt, Paul De Vos, Claude dePamphilis, Robert Edwards, Nadeem Faruque, Robert Feldman, Jack Gilbert, Paul Gilna, Frank O. Glockner, Philip Goldstein, Robert Guralnick, Dan Haft, David Hancock, Henning Hermjakob, Christiane Hertz-Fowler, Phil Hugenholtz, Ian Joint, Leonid Kagan, Matthew Kane, Jessie Kennedy, George Kowalchuk, Renzo Kottmann, Eugene Kolker, Saul Kravitz, Nikos Kyrpides, Jim Leebens-Mack, Suzanna E. Lewis, Kelvin Li, Allyson L. Lister, Phillip Lord, Natalia Maltsev, Victor Markowitz, Jennifer Martiny, Barbara Methe, Ilene Mizrachi, Richard Moxon, Karen Nelson, Julian Parkhill, Lita Proctor, Owen White, Susanna-Assunta Sansone, Andrew Spiers, Robert Stevens, Paul Swift, Chris Taylor, Yoshio Tateno, Adrian Tett, Sarah Turner, David Ussery, Bob Vaughan, Naomi Ward, Trish Whetzel, Ingio San Gil, Gareth Wilson, and Anil Wipat. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology, 26(5):541–547, May 2008.
  18. A. L. Lister, M. Pocock, and A. Wipat. Integration of constraints documented in SBML, SBO, and the SBML manual facilitates validation of biological models. Journal of Integrative Bioinformatics, 4(3):80+, 2007.
  19. Dawn Field, George Garrity, Tanya Gray, Jeremy Selengut, Peter Sterk, Nick Thomson, Tatiana Tatusova, Guy Cochrane, Frank O. Glöckner, Renzo Kottmann, Allyson L. Lister, Yoshio Tateno, and Robert Vaughan. eGenomics: Cataloguing our complete genome collection III. Comparative and Functional Genomics, 2007.
  20. A. L. Lister, A. R. Jones, M. Pocock, O. Shaw, and A. Wipat. CS-TR number 1016: Implementing the FuGE object model: a systems biology data portal and integrator. Technical report, Newcastle University, April 2007.

Bibliography

[1]
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
[2]
Andrew R. Jones, Michael Miller, Ruedi Aebersold, Rolf Apweiler, Catherine A. Ball, Alvis Brazma, James DeGreef, Nigel Hardy, Henning Hermjakob, Simon J. Hubbard, Peter Hussey, Mark Igra, Helen Jenkins, Randall K. Julian, Kent Laursen, Stephen G. Oliver, Norman W. Paton, Susanna-Assunta Sansone, Ugis Sarkans, Christian J. Stoeckert, Chris F. Taylor, Patricia L. Whetzel, Joseph A. White, Paul Spellman, and Angel Pizarro. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnology, 25(10):1127–1133, October 2007.
[3]
Andrew R. Jones and Allyson L. Lister. Managing experimental data using FuGE. Methods in molecular biology (Clifton, N.J.), 604:333–343, 2010.
[4]
Andrew R. Jones, Allyson L. Lister, Leandro Hermida, Peter Wilkinson, Martin Eisenacher, Khalid Belhajjame, Frank Gibson, Phil Lord, Matthew Pocock, Heiko Rosenfelder, Javier Santoyo-Lopez, Anil Wipat, and Norman W. W. Paton. Modeling and managing experimental data using FuGE. Omics : a journal of integrative biology, 13(3):239–251, June 2009.
[5]
Mélanie Courtot, William Bug, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. The OWL of Biomedical Investigations. In OWLED 2008, October 2008.
[6]
The OBI Consortium. Modeling biomedical experimental processes with OBI. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 41+, June 2009.
[7]
Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1):23–33, January 2011.
[8]
Susanna-Assunta Sansone, Philippe R. Serra, Marco Brandizi, Alvis Brazma, Dawn Field, Jennifer Fostel, Andrew G. Garrow, Jack Gilbert, Federico Goodsaid, Nigel Hardy, Phil Jones, Allyson Lister, Michael Miller, Norman Morrison, Tim Rayner, Nataliya Sklyar, Chris Taylor, Weida Tong, Guy Warner, and Stefan Wiemann. The First RSBI (ISA-TAB) Workshop: “Can a Simple Format Work for Complex Studies?”. OMICS: A Journal of Integrative Biology, 12(2):143–149, 2008.
[9]
Dawn Field, George Garrity, Tanya Gray, Norman Morrison, Jeremy Selengut, Peter Sterk, Tatiana Tatusova, Nicholas Thomson, Michael J. Allen, Samuel V. Angiuoli, Michael Ashburner, Nelson Axelrod, Sandra Baldauf, Stuart Ballard, Jeffrey Boore, Guy Cochrane, James Cole, Peter Dawyndt, Paul De Vos, Claude dePamphilis, Robert Edwards, Nadeem Faruque, Robert Feldman, Jack Gilbert, Paul Gilna, Frank O. Glockner, Philip Goldstein, Robert Guralnick, Dan Haft, David Hancock, Henning Hermjakob, Christiane Hertz-Fowler, Phil Hugenholtz, Ian Joint, Leonid Kagan, Matthew Kane, Jessie Kennedy, George Kowalchuk, Renzo Kottmann, Eugene Kolker, Saul Kravitz, Nikos Kyrpides, Jim Leebens-Mack, Suzanna E. Lewis, Kelvin Li, Allyson L. Lister, Phillip Lord, Natalia Maltsev, Victor Markowitz, Jennifer Martiny, Barbara Methe, Ilene Mizrachi, Richard Moxon, Karen Nelson, Julian Parkhill, Lita Proctor, Owen White, Susanna-Assunta Sansone, Andrew Spiers, Robert Stevens, Paul Swift, Chris Taylor, Yoshio Tateno, Adrian Tett, Sarah Turner, David Ussery, Bob Vaughan, Naomi Ward, Trish Whetzel, Ingio San Gil, Gareth Wilson, and Anil Wipat. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology, 26(5):541–547, May 2008.
[10]
Dagmar Waltemath, Neil Swainston, Allyson Lister, Frank Bergmann, Ron Henkel, Stefan Hoops, Michael Hucka, Nick Juty, Sarah Keating, Christian Knuepfer, Falko Krause, Camille Laibe, Wolfram Liebermeister, Catherine Lloyd, Goksel Misirli, Marvin Schulz, Morgan Taschuk, and Nicolas Le Novère. SBML Level 3 Package Proposal: Annotation. Nature Precedings, (713), January 2011.
Categories
Data Integration

Thesis Abstract

[Previous: Converting a Latex Thesis to Multiple WordPress Posts]
[Next: Additional Front Material]

Studying and modelling biology at a systems level requires a large amount of data of different experimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. While the use of community data standards can reduce the need for specialised, independent formats by providing a common syntax, standards uptake is not universal and a single standard cannot yet describe all biological data. In the work described in this thesis, a variety of integrative methods have been developed to reuse and restructure already extant systems biology data.

SyMBA is a simple Web interface which stores experimental metadata in a published, common format. The creation of accurate quantitative SBML models is a time-intensive manual process. Modellers need to understand both the systems they are modelling and the intricacies of the SBML format. However, the amount of relevant data for even a relatively small and well-scoped model can be overwhelming. Saint is a Web application which accesses a number of external Web services and which provides suggested annotation for SBML and CellML models. MFO was developed to formalise all of the knowledge within the multiple SBML specification documents in a manner which is both human and computationally accessible. Rule-based mediation, a form of semantic data integration, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of ontology-based integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations.

The work described in this thesis is one step towards the formalisation of biological knowledge useful to systems biology. Experimental metadata has been transformed into common structures, a Web application has been created for the retrieval of data appropriate to the annotation of systems biology models and multiple data models have been formalised and made accessible to semantic integration techniques.

Categories
Data Integration

Converting a Latex Thesis to Multiple WordPress Posts

A few months ago I finished my thesis, passed my viva and then submitted the hardbound copies to the library, and all was right with the world. However, after a few weeks I realised that I only had my thesis in either PDF form, which is very hard to read, and in Latex, which is unintelligible to people who don’t know how to use it. Therefore, with the kind permission of Phil Lord, I am trying out his latex to wordpress software which he and others have been developing for use with the various knowledgeblogs.

The output of this conversion work is now available here. The thesis was separated into chapters or sections (depending on size) and posted individually. You can get them all in one place via the “thesis” tag in each of the posts or via the list below. Alternatively, you can download the human friendly (but computer unfriendly) PDF version of the thesis.

Thesis Posts

Current limitations of the conversion

There are a few things that aren’t quite right with the conversion at the minute. These were fixed through manual changes to the resulting HTML.

  1. The URLs in the bibliography sections were not being displayed automatically.
  2. The footnotes were created in the main text but not displayed at the end of the text. This will not be fixed programmatically as it is too awkward – they are just in the text as “(Note: […])” instead.

Technical Conversion Details

Please only read this section if you’re interested in a similar conversion process using the knowledgeblogs code.

This code is still in development, so it isn’t easy for someone not familiar with Make to understand. However, if you have a working knowledge of how makefiles work, then please read on.

  1. First, you need to check out the knowledgeblog codebase. You can use mercurial to check out the project at http://code.google.com/p/knowledgeblog/. The code can be found in the trunk/tooling/latextowordpress subdirectory.
  2. You need to install plastex. If you’re running Ubuntu or similar this can be done with the standard sudo apt-get install python-plastex command.
  3. Everything runs using the Makefile available in the latextowordpress directory. Test that things are OK by running the make simple_test command.
  4. Make a directory to store your .tex file input (and put your tex files in there) and another directory to store your HTML output.
  5. Comment out (e.g. use a “##” at the beginning of the line) the self[‘cite’] = self.do_cite line in knowledgeblog/wordpress/__init__.py within the latextowordpress directory.
  6. There are a few changes you may need to make to your tex file prior to compiling it and running the makefile. Ensure your tex file is a complete document (and not included in a parent document, for example). You need those \begin{document} […] \end{document} tags in the file. I also add the following line at the very top to sort out the problem with footnotes: \newcommand{\footnot}[1]{ (Note: \textit{#1})}. Then I can do a global search and replace for “\footnote” and replace it with “\footnot”, ensuring that footnotes are present, even if not visually ideal. Finally, ensure you add your bibliography tags to the end of the file if you don’t already have them (e.g. if the are normally in a parent tex document). I also had to replace “compactitem” references with “itemize”, as I didn’t want to include the package. You may have similar replacements to make.
  7. You’ll need to run your normal latex/pdflatex command that you would run to compile your latex code. This will do things like create a .bbl file for latextowordpress to make use of when generating the HTML.
  8. Add a new command within your Makefile to ensure that your tex files are being converted, and run that command. As an example, here is my new rule within the makefile (make sure that your tabs are correct!):
    my_thesis :
    $(LTWP) -d ‘thesis-output’ tex-files/Abstract.tex
  9. Once you have your html file, you’ll notice that for internal hrefs (here, just for citations), a full url rather than a local url. You’ll have to do a global search-and-replace to make sure those appear correctly. For example, replace “Abstract.html#Smith2000” with “#Smith2000” by removing all references to “Abstract.html”.
  10. Create a new wordpress post, and put the HTML into the post. Upload images for any figures to wordpress.
  11. I manually added “Previous” and “Next” links to each post.
Categories
Data Integration Semantics and Ontologies Software and Tools

Current Research into Reasoning over BioPAX and SBML

What’s going on these days in the world of reasoning and systems biology modelling? What were people’s experiences when trying to reason over systems biology data in BioPAX and/or SBML format? These were the questions that Andrea Splendiani wanted to answer, and so he collected three of us with some experience in the field to give 10-minute presentations to interested parties at a BioPAX telecon. About 15 people turned up for the call, and there were some very interesting talks. I’ll leave you to decide for yourselves if you’d class my presentation as interesting: it was my first talk since getting back from leave, and so I may have been a little rusty!

The first talk was given by Michel Dumontier, and covered some recent work that he and colleagues performed on converting SBML to OWL and reasoning over the resulting entities.

Essentially, with the SBMLHarvester project, the entities in the resulting OWL file can be divided into two broad categories: in silico entites covering the model elements themselves, and in vivo entities covering the (generally biological) subjects the model elements represent. They copied all of BioModels into the OWL format and performed reasoning and analysis over the resulting information. Inconsistencies were found in the annotation of some of the models, and additionally queries can be performed over the resulting data set.

I gave the second talk about my experiences a few years ago converting SBML to OWL using Model Format OWL (MFO) (paper) and then, more recently, using MFO as part of a larger semantic data integration project whose ultimate aim is to annotate systems biology models as well as create skeleton (sub)models.

I first started working on MFO in 2007, and started applying that work to the wider data integration methodology called rule-based mediation (RBM) (paper) in 2009. As with SBMLHarvester, libSBML and the OWLAPI are used in the creation of the OWL files based on BioModels entries. All MFO entries can be reasoned over and constraints present within MFO from the SBML XSD, the SBML Manual, and from SBO do provide some useful checks on converted SBML entries. The semantics of SBMLHarvester are more advanced than that of MFO, however MFO is intended to be a conversion of a format only, so that SWRL mappings can be used to input/output data from MFO to/from the core of the rule-based mediation. Slide 8 of the above presentation provides a graphic of how rule-based mediation works. In summary, you start with a core ontology which should be small and tightly-scoped to your biological domain of interest. Data is fed to the core from multiple syntactic ontologies using SWRL mappings. These syntactic ontologies can be either direct format conversions from other, non-OWL, formats or pre-existing ontologies in their own right. I use BioPAX in this integration work, and while I have mainly reasoned over MFO (and therefore SBML), I do also work with BioPAX and plan to work more with it in the near future.

The final presenter was Ismael Navas Delgado, whose presentation is available from Dropbox. His talk covered two topics: reasoning over BioPAX data taken from Reactome, and the use of a database back-end called DBOWL for the OWL data. By simply performing reasoning over a large number of BioPAX entries, Ismael and colleagues were able to discover not just inconsistencies in the data entries themselves, but also in the structure of BioPAX. It was a very interesting summary of their work, and I highly recommend looking over the slides.

And what is the result of this TC? Andrea has suggested that, after discussion on the mailing list (contact Andrea Splendiani if you are not on it and want to be added) and then have another TC in a couple of weeks. Andrea has also suggested that it would be nice to “setup a task force within this group to prepare a proof of concept of reasoning on BioPAX, across BioPAX/SBML, or across information resources (BioPAX/OMIM…)”. I think that would be a lot of fun. Join us if you do too!

Categories
CISBAN Data Integration Meetings & Conferences

Henning Hermjakob: PSICQUIC and EnVision

This is a presentation given on 29 April, 2010, at the Link-Age / LifeSpan Workshop on Data Handling for Biogerontology Research held by CISBAN, Newcastle University.

Data integration: one definition is to combine data residing in different sources providing users a unified view of these data. Questions of relevance for the data integration field: scope (all, datasets), type (same, different), implementation (federation, centralisation), access (programmatic i.e. computer to computer, web i.e. interactive) and ownership (public, private). Henning covers federated, mainly programmatic techniques using data of the same type in this talk.

To take an example, if you start with a sample (e.g. from a mouse). Observations of this sample results in one or more (overlapping or non-overlapping) publications. Then, the publication information can be used to annotate interaction databases and sent to PSICQUIC servers. PSICQUIC should allow the user to reconstruct an idealised view of the original system from the interaction data.

The molecular interaction standard is the PSI-MI standard, whose first XML version was produced in February 2004. There have been updates and extensions since then, and has been widely implemented by the major interaction databases including DIP, MINT, MIPS, IntAct, HPRD, etc. (http://www.psidev.info/MI)

The PSI-MI XML format is full featured, but complex. This complexity is both its strength and its weakness. Therefore, due to user request, they developed a simplified tabular format called MITAB where one row equals one binary interaction. You loose a lot of information, such as whether a binary interaction is part of a more complex reaction, but it has proven popular.

PSICQUIC is one API which is implemented by many databases such as those mentioned earlier. Its purpose is for querying molecular interaction databases, and uses a common query language (MIQL, which is based on Lucene) for this data. Can be used for PPIs, drug-target interactions and simplified pathway data. The simple PSICQUIC viewer is at http://www.ebi.ac.uk/Tools/webservices/psicquic/view. The PSICQUIC viewer can also point to other resources such as IntAct and many other non-EBI databases. The viewer also has a more fancy, graphics-based implementation where there is an overlay of molecular interactions on Reactome pathways.

MIQL can query every field available within MITAB in a precise way. SOAP and REST interfaces are available and documented at http://code.google.com/p/psicquic.

The challenge is to move PSICQUIC from simple access to all the resources to a real integrated view of all those resources. How to determine if two sources really are talking about the same interaction? Also, the compute time quickly moves beyond suitable interactive times.

PSICQUIC is a technical solution, whereas IMEx is the social/collaborative answer. IMEx is the International Molecular Exchange Consortium. The aims of its members include: avoiding redundant curation, providing a consistent body of public data using detailed joint curation guidelines, and providing a network of stable and comprehensive resources for MI data. This work is now in production phase since February 2010. The work is split up into the different databases by journal type. You can find out more information about IMEx at http://imex.sf.net. Each interaction has its own database’s identifier, but also an identifier from a common IMEx identifier space. The hardest part was harmonizing curation procedures, and they now have a common curation manual across all databases.

Looking at another aspect of his work, EnCore, which is based on different data types integrated using a federated, programmatic approach. EnCore is an ENFIN platform to enable mining data across various domains, sources, formats and types. It integrates database resources and analysis tools across different disciplines. The first focus is on developing an EnXML format. Access interfaces include Perl API, Java API, ftp, SOAP, REST, GUI, etc. The return formats are in a variety of flavours, e.g. XML, CSV, plain text, JSON, etc. All of this must be squeezed into one consistent format. This is done by putting wrappers around the various programs.

The EnXML structure is set oriented – not only does it tell you about one thing (e.g. protein), but also about a set of them. In this structure, an experiment is run which identifies the results. Each experiment references a Set structure, which contains the structure of the result. Sets can hold further nested sets. There are a number of other further sub-structures. The EnCORE results always include both a positive and negative result set (in the case of the negative result, it lists all identifiers for which *no* hit was found). Negative results allow you to track why you might not have gotten a response, and how you “lost” some identifiers from the result.

EnVision is an end-user tool for the above EnCORE work based on the EnXML format. It provides an initial, integrative view for Sets of molecular entities without the need for programming. It also allows the possibility for further local processing. It allows you to save the status/analysis of your material on a particular date, and use that for, e.g., supplementary materials. You can also download your sub-results in a tabular format. Further information and the ability to run this GUI is available from http://www.enfin.org, where you can play with an EnCORE tutorial.

All of this can be quite laborious – web services that are used by EnCORE can change without warning, so it’s a constant challenge to maintain all of these wrappers. A partial answer is to use, wherever possible, underlying standards for the individual services. Such standards include PSICQUIC for MI data. DAS will be used to access protein annotation and information.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Data Integration Semantics and Ontologies Software and Tools

Short Tutorial on using Pellet + OWLAPI + SWRL Rules

I’ve been looking through Pellet and OWLAPI documentation over the past few days, looking for a good example of running existing SWRL rules via the OWLAPI using Pellet’s built-in DL-safe SWRL support. SWRL is used in ontology mappping, and is a powerful tool. Up until now, I’ve just used the SWRLTab, but needed to start running my rules via plain Java programs, and so needed to code the running of the mapping rules in the OWLAPI (which I’m more familiar with than Jena). Once I clean up the test code, I’ll link it from here so others can take a look if they feel like it.

This example uses the following versions of the software:

Pre-existing Examples

Pellet provides a SWRL rule example (RulesExample.java in the Pellet download), but only for Jena, and not the OWLAPI. The OWLAPI Example3.java covers the creation of SWRL rules, but not their running. Therefore, to help others who may be walking the same path as I, a short example of OWLAPI + Pellet + SWRL follows.

New Example

This example assumes that you already have the classes, individuals, and rules mentioned below in an OWL file or files. Here is how the test ontology looks, before running the rule (you can use reasoner.getKB().printClassTree() to get this sort of output):


owl:Thing
source:SourceA
- (source:indvSourceA)
source:SourceB - (source:indvSourceB)
target:TargetA
target:TargetB

The example SWRL rule is this (the rule.toString() method prints this kind of output, while iterating over ontology.getRules()):


Rule( antecedent(SourceA(?x)) consequent(TargetA(?x)) )

Please note that if you want to modularise your OWL files, as I do (I have different files for the source classes, the target classes, the source individuals, the target individuals, and the rules) then make sure your owl:imports in the primary OWL ontology are correct, and that you’ve mapped them correctly with the SimpleURIMapper class and the manager.addURIMapper(mapper)method. I will update this post with some unit tests of this setup once I’ve cleaned up the code for public consumption.

Once you have your ontology properly loaded into an OWLAPI OWLOntology class, you should simply realize the ontology with the following command to run the SWRL rules:


getReasoner().getKB().realize();

After this command, all that’s left to do is save the new inferences. In this simple case, one individual is asserted to also be a child of the TargetA class, as follows:


owl:Thing
source:SourceA - (source:indvSourceA)
source:SourceB - (source:indvSourceB)
target:TargetA - (source:indvSourceA)
target:TargetB

You can do this by using the following code to explicitly save the new inferences to a separate ontology file. You can modify InferredOntologyGenerator to just save a subset of the inferences, if you like. Have a look in the OWLAPI code or javadoc for more information. Alternatively, you could just iterate over the ABox and just save the new individuals to a file. Here’s the code for saving the ontology to a new location:


OWLOntology exportedOntology = manager.createOntology( URI.create( outputLogicalUri ) );
InferredOntologyGenerator generator = new InferredOntologyGenerator( reasoner );
generator.fillOntology( manager, exportedOntology );
manager.saveOntology( exportedOntology, new RDFXMLOntologyFormat(), URI.create( outputPhysicalUri ) );

I hope this helps some people!