Entanglement: Embarassingly-scalable graphs

In a previous post, I talked about what kind of visualizations would make sense for large-scale epigenomic and related data. Now, I’d like to introduce the kind of data structures we’re building in the Newcastle ARIES project to support the creation of such visualizations.

Entanglement is a software platform that provides a generic, scalable, graph framework suitable for data integration applications that require embarrassing scalability. With the data sets we are currently importing, write times for Entanglement scale linearly over millions of database entities. A poster (Entanglement: Embarrassingly Scalable Graphs) and abstract for Entanglement were presented last month at the Integrative Bioinformatics 2013 symposium in IPK-Gatersleben, Germany.

Included below is the text of the abstract, together with a summary of the poster’s contents.

Epigenetics is becoming a major focus of research in the area of human genetics. In addition to contributing to our knowledge of inheritance, epigenetic profiles can be used as prognostic or predictive biomarkers. Methylation of DNA in leukocytes is one of the most commonly measured forms of epigenetic modification. The ARIES project generates epigenomic information for a range of human tissues at multiple points in time. ARIES uses both Illumina Infinium 450K methylation arrays and BS-seq approaches to generate epigenetic data on a number of samples from the Avon Longitudinal Study of Parents and their Children (ALSPAC) cohort. ALSPAC is a unique resource for studying how methylation patterns change over time. The ARIES project also provides tools and resources to the community for the interpretation of epigenomic data in the form of an integrated dataset and associated Web portal for browsing and integrative analysis of experimental methylation data. The integrated dataset includes a range of data types such as phenotypic, transcriptomic and methylation data from rodents, together with data generated by studies such as the ENCODE project.

The integration of these data is a considerable bioinformatics challenge. To meet this challenge we are developing a graph-based data integration platform, extending our previous work with the ONDEX system. We have developed a scalable, parallel graph storage system that exploits Cloud computing infrastructures for integrating data to produce graphs of entities and the relationships between them. This system, called Entanglement, has been designed to tackle the problem of scalability that is inherent in most graph-based approaches to bioinformatics data integration.

Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time.

Data is distributed across a MongoDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including rapidly cloning or merging existing graphs to form new graphs. Although the ultimate aim is a fully integrated dataset, by intentionally storing different data sources in different graphs a large amount of flexibility can be obtained.
Multiple ad-hoc integrated views can be composed by importing references to the nodes and edges in various individual dataset graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi.

Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.

Entanglement Architecture Overview
Entanglement (http://www.entanglementgraph.com) Architecture Overview

Data integration for ARIES will ultimately require graphs containing hundreds of millions, if not billions, of graph entities. Entanglement has been shown to scale linearly with our initial ARIES datasets involving graphs with up to 50 million nodes and edges. Our results suggest that the system will scale to much larger graph sizes. Data storage capacity can be expanded by adding MongoDB servers to an existing cluster. Indexes required for efficient querying are designed to fit in memory, as long as enough machines are available to the cluster.

Entanglement is available under the Apache license at http://www.entanglementgraph.com.

Epigenomic Data Integration with Graphs: a Beginning

I am currently working in Prof. Neil Wipat’s group at Newcastle University on the ARIES project. This involves working with large amounts of epigenomics data from the ARIES project itself, as well as with all sorts of related information from external data sources.

As well as the production of an integrated data set for ARIES’ nascent genome track browser, an indispensable tool for this type of data, we’re working on something else very exciting: graph data. Specifically, we’re trying out all sorts of visualizations for relevant data sets (including the ARIES data). Here’s one I’ve been playing with recently.

littlegraphThe pink nodes represent the median beta values for methylation sites along the human genome. The lighter the pink, the less likely this particular point on the chromosome is methylated, and vice versa. At a glance, the user can see all integrated beta values (and therefore all experiments containing methylation information) for a particular chromosome location. This is a small, gene-centric graph (the gene is in green) intended for people who would like to see an overview of known experimental results for a particular gene of interest.

This is just the start; we have lots of other visualization ideas, as well as lots of ideas for the creation of novel –and interesting– types of subgraphs. Our database is huge, and the hairball of the entire thing (or just of one chromosome) is likely to not be as informative as subgraphs like this created around a particular area of interest.

But we’re not just working on the export of interesting subgraphs from our graph database: my colleague Keith Flanagan has been developing a highly scalable and incredibly neat graph database built on MongoDB.

You’ll probably see a lot of pictures like the one above in the coming weeks on this blog, as we experiment with visualizations and views of our data. If you’re into epigenomics, and have always wanted to view your data in a particular way, please leave a comment below. I’d also love to hear your ideas about this particular visualization type. Your input would be most welcome.

PhD Thesis: Table of Contents

Below you can find a complete table of contents for all thesis-related posts (you can also get to the posts via the “thesis” tag I have used for each). Enjoy!

Thesis Posts

And, if you’re interested in how I performed the conversion, I’ve written about that too.

Chapter 6: General Discussion (Thesis)

[Previous: Chapter 5: Rule-based mediation]
[Next: PhD Chapter Outline and Latex to WordPress Conversion Notes]

General Discussion

As the structure, description and interconnectedness of life science data has become more sophisticated, a shared formalisation of all data for systems biology has become easier to imagine. Tools are available, such as those used by the Semantic Web community, to capture any area of biology in the form of a semantically defined model. While challenges such as achieving tractable reasoning over highly expressive ontologies still remain to be addressed, a future where vast amounts of data and metadata are modelled according to a shared formalisation is not just fantasy as shown by the work described in this thesis. The research presented here demonstrates the transformation of existing systems biology data models into more computationally amenable and more semantically aware structures.

The work presented also enables a more computationally amenable approach to systems biology, ultimately aiding the progression of systems biology metadata into a more formal structure. There is a general trend towards “big data”, both with in-house private repositories and open data across all of the sciences, making maintenance and structuring of that data of primary importance [1, 2]. SyMBA (Chapter 2) aids the formalisation of systems biology knowledge by providing a common structure for experimental metadata and limiting the complexity of metadata input at the user interface. Traditionally, metadata input procedures are time consuming and often complex for large volumes of data. SyMBA was created to make it easy for researchers to add metadata even when large amounts of data are involved. In general the easier metadata entry becomes, the more metadata will be stored by users. By structuring systems biology metadata in a common format, it is made much more accessible to computational techniques and to the systems biology community as a whole.

Saint (Chapter 3) helps researchers find appropriate biological data for systems biology models by reusing existing data from disparate locations. Saint provides a method of integrating multiple Web-accessible resources using syntactic clues in the query model, providing prospective annotation to the modeller. This type of syntactic data integration is fast, providing a large amount of integrated data for relatively low cost and high scalability. However, syntactic integration does not resolve semantic differences in the data, nor does it allow a high level of expressivity between data sources or in the description of those data sources. The schema reconciliation in non-semantic approaches is generally hard-coded for the immediate integrative task, and not easily reused. Often, data is aligned by linking structural units such as XSD components or table and row names rather than the underlying biological components. Further, concepts between the source and target schema are often linked based on syntactic similarity, which does not necessarily account for possible differences in the meanings of those concepts.

Semantic data integration allows greater expressivity at the expense of scalability. Controlled vocabularies and ontologies are of primary importance in the resolution of heterogeneity via semantic means [3]. By using ontologies, entities can be integrated across domains according to their meaning. Ruttenberg and colleagues view the Semantic Web, of which both OWL and RDF are components, as having the potential to aid translational and systems biology research; indeed, any life science field where there are large amounts of data in distributed, disparate formats should benefit from Semantic Web technologies [4]. However, application of such techniques in bioinformatics is difficult, partly due to the bespoke nature of the majority of available tools [5]. MFO (Chapter 4) transforms existing SBML specification documents into a single model in OWL, allowing the storage both of SBML models and the restrictions on those models. While some of the SBML specification documents are already computationally accessible, one document is readable only by humans and each of the others are in different formats. MFO brings concepts from all of the SBML documents into a single computationally accessible format which can be used with a variety of Semantic Web technologies.

Rule-based mediation (Chapter 5) transforms multiple systems biology data models into a single, more computationally accessible model. This work examines the feasibility of semantic data integration within a systems biology context. To gain access to the underlying semantics of the data, a semantically-rich core ontology is utilised together with mappings to and from source ontologies representing the data sources of interest. New data sources can be easily inserted without modification of the biologically-relevant core ontology. Separation of syntactic integration and semantic description of the biology is robust to changes to both the source ontologies and the core ontology. The use of existing tools decreased development time and increased the applicability of this approach for future projects. While some hurdles–such as scalability and the creation of a simple user interface—remain, rule-based mediation has been shown to be an effective approach for model annotation.

The field of semantics and ontologies in the life sciences is continuing to mature. The traditional and highly useful role of the ontology as metadata and link provider is pervasive. In this role, a particular database entry structures the data according to a defined data format, and within that format links are provided to specific ontology terms. The data format is the primary structure, and the ontological terms are keywords and links out to more information and to other database entries sharing that term. Very recently, new research such as rule-based mediation has freed ontologies, reversing the roles of ontology and structured format [6, 7, 8, 9]. With semantic integration methodologies, ontologies—rather than the syntactic format—have become the backbone from which data hangs. Ontologies can therefore be used to describe the entirety of a data model or biological domain rather than just provide links to further information. The use of ontologies as a sophisticated semantic integration tool allows the creation of a methodology which not only follows cross-referenced links from one database entry to the next, but aligns information according to biological knowledge, limiting ambiguity and making that knowledge accessible to both humans and computers.

This work described in this thesis is one step towards the formalisation of biological knowledge useful to systems biology. Experimental metadata has been transformed into common structures, data appropriate to the annotation of systems biology models can be retrieved and presented to users and multiple data models have been formalised and made accessible to semantic integration techniques.

1 Simplicity versus complexity

When transforming data and metadata into a common structure or model, there are always compromises to be made between simplicity and complexity. Rather than there being a single correct answer, researchers investigating particular integration techniques choose points along this spectrum amenable to their requirements. The dichotomy of simplicity and complexity is appropriate for a range of topics relevant to systems biology data integration including metadata, simulatable model creation and expressivity. The question of expressivity versus scalability is discussed in more detail below as it has a direct effect on the performance of semantic data integration techniques such as rule-based mediation.

In general, researchers find it easier to create data than to annotate that data with large amounts of metadata; while simplistic metadata requirements allow quicker data deposition, complex metadata requirements can create a barrier to data entry. While standards are often developed once ’just-enough’ data description strategies are no longer enough (as described in Section 1.4), as the amount of metadata asked of researchers grows, the chance of receiving complete metadata shrinks.

Systems biology models themselves are another example of choosing between simplicity and complexity. This is not simply a question of the granularity of such models, as whole organism models and single pathway models of a similar level of complexity can be produced. However, simpler models with fewer species and less complex interactions can be more easily understood by people and simulated by computers. While complex models present a more complete picture, the resources required to simulate and understand will be correspondingly greater.

As a final example, formal models such as ontologies can be either simple and scalable or complex and expressive; there is always a balance to be made between these two choices. Expressivity is a measure of what it is possible to say with a language, and as such has a direct bearing on the inferences possible with that language; too little expressivity results in a lack of “reasoning opportunities” in a language which would otherwise not provide any benefits over existing languages [10]. Simple formal models are usually scalable and can easily be reasoned over but do not, by definition, contain a high amount of expressivity. Complex models have a higher level of expressivity and are capable of modelling more biological context, for example. With such models a more precise description of the domain of interest is created at the cost of reasoning time.

OWL-DL is an ontology formalism which is guaranteed to be decidable. The decidable subsets of OWL, such as OWL-DL, are intended to allow both expressivity and tractability for reasoning purposes; these requirements are typically in opposition to each other, and as such it is a goal of OWL to find a suitable balance [10]. Specifically, there are no guarantees made about the level of scalability of any given ontology written in OWL-DL. The more complex an ontology the less amenable it is to reasoning; a change of even a single axiom can result in a reasoner completing in days rather than seconds. Detailed studies of the complexity of ontological reasoning and how that complexity increases with the expressiveness of an ontology are available [11, Chapter 3], and are beyond the scope of this thesis. However, expressivity versus tractability is of vital importance in semantic data integration in general and in rule-based mediation specifically, where the core ontology is intended to be small but highly expressive.

Decidability and expressiveness are vital considerations when choosing a language for rule-based mediation; languages such as OBO were not originally created with these challenges in mind. Not only is it important to choose a language which supports a balance between expressivity and tractability, but it is important to create ontologies with those considerations in mind. Even if a decidable language such as OWL-DL is used, an ontology can be easily created which cannot be reasoned over in sensible time scales. Currently, in semantic data integration methods such as rule-based mediation, scalability issues occur before more general integration issues such as database churn become important. Therefore for many in the life sciences, ontology engineering involves keeping ontologies as simple as possible for the stated requirements, adding detail and expressivity only where necessary and in a non-disruptive way [12].

2 The future of data integration

The use of confidence values for particular mappings between ontologies has been described [13], although research in this area is still limited. However, the ability to modify the set of mappings between source and core ontologies based on the perceived quality of the mappings would be an interesting area of further research. Rule-based mediation makes use of an expressive core ontology, which currently places limitations on its scalability. As brute computational power and technology such as reasoners improve, methodologies such as rule-based mediation could go from precise, small-scale integrative tasks to much larger integrative challenges. Additional automation is also a necessary step if semantic methods such as rule-based mediation are to be implemented on a wider scale. However, existing and future collaborations should provide scope for these and other possibilities for this research. A collaboration of the CellML project with the Saint project started in 2009 and has continued in 2011 and 2012 with additional programmer support to extend the functionality of Saint with respect to CellML, and further aid the integration of the MIRIAM annotation with the CellML format and API. Rule-based mediation has also been identified by Hoehndorf and co-workers as a possible source of collaboration [9].

A much broader perspective on the future of data integration is not just which integrative methodology is best suited for a task, but whether data integration will even have a place in biological research. If all life science data were expressed in a homogeneous manner, data integration would be unnecessary. The rest of this section explores some of the ways in which data might become accessible to all, without the need for interfaces and integration layers among formats.

If standards become fully utilised within the life sciences, every researcher would be publishing data in recognised, common formats. However, this does not necessarily mean a single format for the entirety of the life sciences; more likely, a large number of standards would remain, as described in Section 1.4. While it is true that many community standards are already in common use (e.g. SBML and BioPAX), standards efforts have a history of slow development (e.g. OBI) and slow uptake (e.g. FuGE). Additionally, new questions will continue to be asked and experimental types will continue to be invented. By definition, these new experiment types will be created faster than the corresponding standards can be developed. Even as experiment types and, ultimately, community standards mature, new standards or combinations of standards might be required. Through experience gained as a developer of many community standards it has become clear that, irrespective of both the number of experimental types and the quality of standardised data formats, there will likely always be some overlap in the data descriptions among the formats. If there is even a small amount of information in one community standard that is of use to another community, data integration will continue to play a vital role in the life sciences. As such even widespread, endemic use of standards will not remove the need for high-quality data integration. As quickly as standards are made, new data is created and new descriptions of data are written which require standardisation.

Alternatively, existing data integration methods might become useless as newer methodologies are investigated. For instance, semantic methodologies are able to capture more knowledge about the data than syntactic methods. More complete, less ambiguous models of the domain of interest leads to less confusion and fewer mapping errors during the integration process. But as has been described in Section 1.6, many syntactic data integration methods remain popular and are being successfully developed throughout the life sciences. Emerging technologies such as reasoner-accessible semantic methods have been proven to be useful, but are not yet supplanting existing methods. Scalability and tractability are core issues when reasoners are applied, and the very employment of the expressivity which make ontology-based techniques useful quickly exposes such issues.

There is a growing movement within science towards openness of data. Open Data movements such as the Panton Principles (Note: http://pantonprinciples.org/) describe appropriate licensing for open experimental data, while many others have advocated open models of publishing papers as well as data (Note: More information can be found at http://cameronneylon.net/ and https://opencitations.wordpress.com/2011/08/04/the-plate-tectonics-of-research-data-publication/.). For instance arXiv (Note: http://arxiv.org/), an e-print archive for topics including physics, mathematics, computer science, quantitative biology and statistics, provides open access to both data and papers. Conferences are also becoming more open, with the publishing of papers and presentations in open access journals and with the use of social networking to provide live coverage of conferences [14, 15]. It may seem as if this increased availability and openness of data and publications might decrease the importance of data integration. However, open data is a prime example of why data reuse and integration will remain important. Larger amounts of data do not lessen the need of integration methodologies, as open data is not guaranteed to be in the same format.

Finally, if there is a move away from small, tightly-scoped experiments towards large-scale experiments, large amounts of data could be produced in a single format. For example, there would be no need to integrate disparate experimental data files the research could be reproduced cheaply and quickly. While large-scale experimentation would ease the burden of integration within one community or experimental type, it will not solve the integration problems across multiple communities. As such, this would not make data integration redundant.

Although there are a number of ways in which data will become more homogeneous, none will render data integration obsolete. If all of systems biology was formalised and structured according to a common data model, transformations on that data model would still be required. For instance, the model could go through rounds of optimisation or updating during schema evolution. Syntactic methods will most likely play a pivotal role in data integration for the foreseeable future, until the limitations of semantic methods are ameliorated and ontologies and other Semantic Web technologies gain much higher coverage within the life sciences. The ability of Semantic Web formats such as RDF to describe any type of data, and of Semantic Web languages such as ontologies to model any type of biology will ensure that semantic data integration will only increase in importance as scaling issues are addressed and the time taken to perform the integration decreases.

3 Guidance for modellers and other researchers

If data generators and systems biology modellers followed just a few guidelines when describing their information, data integration tasks could become much more straightforward and many existing methods of integration would immediately become more powerful. Small changes could result in substantive improvements in the way data is described and ultimately, the quality of the integration that can be performed. For instance, unambiguous naming would result in better hits against external data sources using a systems biology annotation application such as Saint (Chapter 3). The list below describes these guidelines, and how their use would aid data integration in general as well as help the integration methods presented in this thesis.

  1. Informative naming. In general, systems biology modellers name model entities using non-informative names and identifiers as well as non-intuitive naming schemes. A species might be called “A” or “species1”, providing no clue as to the identity of the species for prospective users of the model. Further, while an individual modeller might conform to their own personal naming scheme, there is no guarantee that others will know that the “_p” added to a species name means it is the phosphorylated form; it could equally mean gene product, probable name, or even possibly unknown. An informative naming methodology is simple and quick to implement while also being a great help to others. Even if no further information is provided about a model entity, educated guesses as to the identity of that entity can be made both by humans and computers if commonly-used names are provided.
  2. Consider standards from the start of the research. The life science community has many ways to store experimental data in a format recognisable to computers and to humans, and many helper applications to do most of the work for the researcher. SyMBA (Chapter 2) is an archival application capable of storing experimental data and metadata. The ISA Suite [16] assists with data management and standards-compliant storage while the experiments are being run, and does so in a spreadsheet-style environment comfortable to many researchers. Ensuring that the correct information is stored while the data is being gathered saves time for the researcher and eases submission of that data to public repositories prior to publication.
  3. Think about the future. As described by one twitter user, “metadata is a love note to the future” (Note: http://twitpic.com/6ry6ar). High quality metadata makes data integration easier, and enables greater reuse, sharing and re-purposing of the underlying information both for the creator of that information and for future interested researchers. The use of systems such as Saint (Chapter 3) and rule-based mediation (Chapter 5), which add annotation in a semi-automated way, can help with the addition of metadata, fostering understanding and allowing the research to reach a wider audience.

4 Conclusions

The early releases of life science databases such as EMBL/GenBank/DDBJ [17] and UniProtKB [18] provided data in a structured, text-based flat file format. As techniques, knowledge and computing power improved, the storage and exchange formats for these and other databases have become more complex. While EMBL/GenBank/DDBJ and UniProtKB can still be retrieved as flat files, internally the data is stored in relational databases, and UniProtKB provides both an XML and an RDF export. Newer life science databases such as BioModels [19] and Pathway Commons [20] never even produced a flat file format; instead, XML and OWL began to be used as the primary data format. The progression of the structure of data from text file to XML to OWL is a progression towards a formalisation of the representation of biology.

Semantic technologies are being heavily researched and used in the life sciences and in larger-scale commercial portals such as Google. However, not all semantic projects have been successful. For instance, the semantic database and publishing system Ambra (Note: http://www.topazproject.org/trac/wiki/Ambra), in use by the PLoS group of journals for a number of years (Note: http://blogs.plos.org/everyone/2009/05/13/all-plos-titles-now-on-the-same-publishing-platform/), was dropped in November 2011 for a faster system called New Hope (Note: http://blogs.plos.org/plos/2011/12/new-hope-the-new-platform-for-the-plos-journal-websites/). In contrast, both social networks and other large commercial portals have begun increasing their usage of semantic technologies. As of October 2011, Facebook has begun supporting linked data and RDF (Note: http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData). Google makes use of RDFa and microformats for their Rich Snippets (Note: http://support.google.com/webmasters/bin/answer.py?hl=enanswer=99170,
) technology. The massive amounts of data stored by companies such as Google and Facebook, or to a lesser degree by semantically-aware life science databases such as Bio2RDF [21] may provide the volume of metadata required to increase awareness and usage of the Semantic Web (Note: http://semanticweb.com/the-evolution-of-search-at-google_b25042). As the amount of life science data grows, improved semantic structure of that data will aid integration and querying tasks.

The research presented in this thesis standardises experimental metadata (Chapter 2) and integrates (Chapter 3, Chapter 5) and formalises (Chapters 45) systems biology knowledge. As a result, new biological annotation can be added to systems biology models, and information useful for systems biology has been transformed into more computationally amenable and semantically aware structures.

This research demonstrates the advantages gained with an increase in the formalisation of biological knowledge. The use of richer semantics allows the automation of many aspects of the annotation process for systems biology models as well as computational access to those models to rigorously assess their correctness. While this comes with a cost of a greater formality and standardisation for experimental data, the cost can be partially alleviated with the use of appropriate applications to make the input of the biological descriptions easier. As the life sciences become ever more data rich, this formalisation becomes less of an option and more of a necessity to ensure that the multitude of data resources remain accessible and manageable to researchers and computers.


Clifford Lynch. Big data: How do your data grow? Nature, 455(7209):28–29, September 2008.
Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Y. Rhee. Big data: The future of biocuration. Nature, 455(7209):47–50, September 2008.
Lee Harland, Christopher Larminie, Susanna-Assunta A. Sansone, Sorana Popa, M. Scott Marshall, Michael Braxenthaler, Michael Cantor, Wendy Filsell, Mark J. Forster, Enoch Huang, Andreas Matern, Mark Musen, Jasmin Saric, Ted Slater, Jabe Wilson, Nick Lynch, John Wise, and Ian Dix. Empowering industrial research with shared biomedical vocabularies. Drug discovery today, 16(21-22):940–947, September 2011.
Alan Ruttenberg, Tim Clark, William Bug, Matthias Samwald, Olivier Bodenreider, Helen Chen, Donald Doherty, Kerstin Forsberg, Yong Gao, Vipul Kashyap, June Kinoshita, Joanne Luciano, M. Scott Marshall, Chimezie Ogbuji, Jonathan Rees, Susie Stephens, Gwendolyn Wong, Elizabeth Wu, Davide Zaccagnini, Tonya Hongsermeier, Eric Neumann, Ivan Herman, and Kei H. Cheung. Advancing translational research with the Semantic Web. BMC bioinformatics, 8 Suppl 3(Suppl 3):S2+, 2007.
Allyson L. Lister. Semantic Integration in the Life Sciences. Ontogenesis, January 2010.
Emek Demir, Michael P. Cary, Suzanne Paley, Ken Fukuda, Christian Lemer, Imre Vastrik, Guanming Wu, Peter D’Eustachio, Carl Schaefer, Joanne Luciano, Frank Schacherer, Irma Martinez-Flores, Zhenjun Hu, Veronica Jimenez-Jacinto, Geeta Joshi-Tope, Kumaran Kandasamy, Alejandra C. Lopez-Fuentes, Huaiyu Mi, Elgar Pichler, Igor Rodchenkov, Andrea Splendiani, Sasha Tkachev, Jeremy Zucker, Gopal Gopinath, Harsha Rajasimha, Ranjani Ramakrishnan, Imran Shah, Mustafa Syed, Nadia Anwar, Ozgün Babur, Michael Blinov, Erik Brauner, Dan Corwin, Sylva Donaldson, Frank Gibbons, Robert Goldberg, Peter Hornbeck, Augustin Luna, Peter Murray-Rust, Eric Neumann, Oliver Reubenacker, Matthias Samwald, Martijn van Iersel, Sarala Wimalaratne, Keith Allen, Burk Braun, Michelle Whirl-Carrillo, Kei-Hoi H. Cheung, Kam Dahlquist, Andrew Finney, Marc Gillespie, Elizabeth Glass, Li Gong, Robin Haw, Michael Honig, Olivier Hubaut, David Kane, Shiva Krupa, Martina Kutmon, Julie Leonard, Debbie Marks, David Merberg, Victoria Petri, Alex Pico, Dean Ravenscroft, Liya Ren, Nigam Shah, Margot Sunshine, Rebecca Tang, Ryan Whaley, Stan Letovksy, Kenneth H. Buetow, Andrey Rzhetsky, Vincent Schachter, Bruno S. Sobral, Ugur Dogrusoz, Shannon McWeeney, Mirit Aladjem, Ewan Birney, Julio Collado-Vides, Susumu Goto, Michael Hucka, Nicolas Le Novère, Natalia Maltsev, Akhilesh Pandey, Paul Thomas, Edgar Wingender, Peter D. Karp, Chris Sander, and Gary D. Bader. The BioPAX community standard for pathway data sharing. Nature biotechnology, 28(9):935–942, September 2010.
The OBI Consortium. OBI Ontology.
Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through rule-based semantic integration. Journal of biomedical semantics, 1 Suppl 1(Suppl 1):S3+, 2010.
Robert Hoehndorf, Michel Dumontier, John Gennari, Sarala Wimalaratne, Bernard de Bono, Daniel Cook, and Georgios Gkoutos. Integrating systems biology models and biomedical ontologies. BMC Systems Biology, 5(1):124+, August 2011.
Web Ontology Working Group. OWL Web Ontology Language Use Cases and Requirements, February 2004.
Franz Baader, Diego Calvanese, Deborah Mcguinness, Daniele Nardi, and Peter Patel-Schneider, editors. The Description Logic Handbook – Cambridge University Press. Cambridge University Press, first edition, January 2003.
Stefan Schulz, Kent Spackman, Andrew James, Cristian Cocos, and Martin Boeker. Scalable representations of diseases in biomedical ontologies. Journal of Biomedical Semantics, 2(Suppl 2):S6+, 2011.
Jinguang Gu, Baowen Xu, and Xinmeng Chen. An XML query rewriting mechanism with multiple ontologies integration based on complex semantic mapping. Information Fusion, 9(4):512–522, October 2008.
Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live Coverage of Intelligent Systems for Molecular Biology/European Conference on computational biology (ISMB/ECCB) 2009. PLoS computational biology, 6(1):e1000640+, January 2010.
Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live Coverage of Scientific Conferences Using Web Technologies. PLoS Comput Biol, 6(1):e1000563+, January 2010.
Philippe Rocca-Serra, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar, Chris Taylor, Kimberly Begley, Dawn Field, Stephen Harris, Winston Hide, Oliver Hofmann, Steffen Neumann, Peter Sterk, Weida Tong, and Susanna-Assunta Sansone. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics, 26(18):2354–2356, September 2010.
Guy Cochrane, Ruth Akhtar, James Bonfield, Lawrence Bower, Fehmi Demiralp, Nadeem Faruque, Richard Gibson, Gemma Hoad, Tim Hubbard, Christopher Hunter, Mikyung Jang, Szilveszter Juhos, Rasko Leinonen, Steven Leonard, Quan Lin, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Sheila Plaister, Rajesh Radhakrishnan, Stephen Robinson, Siamak Sobhany, Petra T. Hoopen, Robert Vaughan, Vadim Zalunin, and Ewan Birney. Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Research, 37(suppl 1):D19–D25, January 2009.
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
Ethan G. Cerami, Benjamin E. Gross, Emek Demir, Igor Rodchenkov, Özgün Babur, Nadia Anwar, Nikolaus Schultz, Gary D. Bader, and Chris Sander. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Research, 39(suppl 1):D685–D690, January 2011.
F. Belleau, M. Nolin, N. Tourigny, P. Rigault, and J. Morissette. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):706–716, October 2008.

Chapter 5: Rule-based mediation (Thesis)

[Previous: Chapter 4: MFO]
[Next: Chapter 6: General Discussion]

Rule-based mediation for the semantic integration of systems biology data


The creation of accurate quantitative SBML models is a time-intensive manual process. Modellers need to know and understand both the systems they are modelling and the intricacies of the SBML format. However, the amount of relevant data for even a relatively small and well-scoped model can be overwhelming. Systems biology model annotation in particular can be time-consuming; the data required to perform the annotation is represented in many different formats and present in many different locations. The modeller may not even be aware of all suitable resources.

There is a need for computational approaches that automate the integration of multiple sources to enable the model annotation process. Ideally, the retrieval and integration of biological knowledge for model annotation should be performed quickly, precisely, and with a minimum of manual effort. If information is added manually, it is very difficult for the modeller to annotate exhaustively. Current systems biology model annotation tools rely mainly on syntactic data integration methods such as query translation. A review of data integration methodologies in the life sciences is available in Section 1.6, and the Saint syntactic integration system for model annotation is described in detail in Chapter 3.

While large amounts of data can be gathered and queried with syntactic integration, the integration process is limited to resolving formatting differences without attempting to address semantic heterogeneity, which occurs when a concept in one data source is matched to a concept in another data source which is not exactly equivalent. Semantic data integration methods can be used to resolve differences in the meaning of the data. This chapter describes rule-based mediation, a novel methodology for semantic data integration. While rule-based mediation can be used for the integration of any domain of interest, in this instance it has been applied to systems biology model annotation. Off-the-shelf Semantic Web technologies have been used to provide novel annotation and interactions for systems biology models. Existing tools and technology provide a framework around which the system is built, reducing development time and increasing usability.

Figure 1: An overview of rule-based mediation. One or more data sources sharing the same format (shown here as small circles) can be represented using a source ontology (shown as pentagons). Rules, or mappings, from the source ontology to the core ontology provide a route through which information from the data sources is mapped. The information within the core ontology can then be queried and reasoned over. Relevant knowledge for a particular systems biology model or models can then be exported from the core ontology via a chosen source ontology and ultimately saved in a underlying data source format such as SBML. In the upper left corner, the names of the core and source ontologies created or used for the research described in this thesis are listed.

There are three main parts to rule-based mediation: the source ontologies for describing the disparate data formats, the mapping rules for linking the source ontologies to a biologically-relevant core, and the core ontology for describing and manipulating semantically-homogeneous data (see Figure 1). In rule-based mediation, the data is materialised in the core ontology by using a set of rules to translate the data from the source ontologies to the core ontology, avoiding the need for complex query translation algorithms.

Rule-based mediation has a number of defining features:

  • the data from syntactically and semantically different data sources are materialised within a core ontology for easy access and stable querying;
  • the core ontology is a semantically-meaningful model of the biological domain of interest; and
  • implementation can be performed using off-the-shelf tools and mapping languages.

Integrating resources in this way accommodates multiple formats with different semantics and provides richly-modelled biological knowledge suitable for annotation of systems biology models. This work establishes the feasibility of rule-based mediation as part of an automated systems biology model annotation system.

Section 1 describes the type of integration performed in rule-based mediation as compared with existing data integration methodologies. The first step in rule-based mediation is the syntactic conversion of heterogeneous data sources into ontologies, as detailed in Section 2. These source ontologies are then aligned to a small core ontology by applying a rule base (Section 3). This core ontology is a model of the biological domain of interest which can be queried and reasoned over (Section 4). As demonstrated in Section 5 with two use cases, selected results from the core domain ontology can then be applied to systems biology models as a variety of types of biological annotation. Section 6 describes possible future directions and applications for rule-based mediation, and Section 7 provides a final discussion and summary of the methodology.


The rule-based mediation website is http://cisban-silico.cs.ncl.ac.uk/RBM/, and the ontologies and software toolset is available from Subversion (Note: http://metagenome.ncl.ac.uk/subversion/rule-based-mediation/telomere/trunk). The ontologies stored within the rule-based mediation project are licensed by Allyson Lister and Newcastle University under a Creative Commons Attribution 2.0 UK: England & Wales License (Note: http://creativecommons.org/licenses/by/2.0/uk/). The rule-based mediation software toolset is covered under the LGPL (Note: http://www.gnu.org/copyleft/lesser.html). For more information, see LICENSE.txt in the Subversion project.

The SWRL rules described in this chapter can be run programmatically or via the Protégé 3.4 editor (Note: http://protege.stanford.edu). Programmatic access requires the rule-based mediation software toolset, which in turn requires the OWL API. The SWRLTab plugin within Protégé 3.4 requires the installation of the Jess Rule Engine (Note: http://www.jessrules.com/) for Java. All mapping rules are available for viewing within swrl-mappings.owl in the Subversion distribution. Instructions for viewing the telomere ontology, the source ontologies, the SWRL mappings, SQWRL queries, and the SBML model before and after annotation in these use cases are available in the Subversion repository (Note: http://metagenome.ncl.ac.uk/subversion/rule-based-mediation/telomere/trunk/INSTALL.txt).

1 Design decisions

This section describes the design decisions made for rule-based mediation with respect to existing methodologies. An overview of this integration methodology is summarised in Figure 2 in the context of SBML model annotation. First, information is syntactically converted into OWL or retrieved, correctly formatted, in an pre-existing source ontology. Second, the information is mapped into a core ontology. Third, the core ontology is queried to answer specific biological questions. Finally, the new information is sent back to an external format, in this case SBML, via mappings from the core ontology to MFO. From a biological perspective, rule-based mediation produces an integrated view of information useful for modelling.

Figure 2: Rule-based mediation in the context of SBML model annotation. Non-OWL formats are first converted into syntactic ontologiesvia the XMLTab plugin for Protégé 3.4. Here, both UniProtKB and BioGRID data are in XML: UniProtKB has its own schema, while BioGRID uses PSI-MIF. BioPAX, the format used for Pathway Commons, is already in OWL and needs no conversion. Next, the instances present in the source ontologies are mapped to the core ontology using SWRL. Finally, querying is performed using SQWRL queries using only core ontology concepts. Information is then passed through a final syntactic ontology (MFO) into an SBML model.

1.1 Syntactic or semantic integration

The majority of systems biology model annotation tools such as those described in Section 1.3 rely on syntactic data integration methods. Taverna workflows can act as query translation services to pull annotation into models from data sources [1]. SemanticSBML [2] provides MIRIAM annotations, but does not create any other model elements. LibAnnotationSBML connects to and serves data from a large number of Web services useful for model annotation but has only a minimal user interface [3]. BioNetGen combines its own rules of molecular interactions with BioPAX-compliant data and the Virtual Cell editor to produce SBML [4]. However, BioNetGen requires some level of manual intervention and neither adds MIRIAM annotations nor integrates disparate data formats, instead accepting only BioPAX-formatted data. Chapter 3 describes Saint, which uses query translation and an intuitive user interface to pull information from multiple Web services and syntactically integrate the results, which the user can then select and apply to a pre-existing model [5].

However, while large amounts of data can be gathered and queried in this manner, the syntactic integration process is limited to resolving formatting differences without attempting to address semantic heterogeneity. While existing syntactic tools for model annotation are suitable for many tasks, semantic data integration methods can resolve differences in the meaning of the data and theoretically provide a more useful level of integration. A number of semantic integration approaches have been applied to many different research areas [6, Section 9], and are beginning to be used in the life sciences (see also Sections 1.6 and 1.5).

Formal ontologies are useful data integration tools, as they make the semantics of terms and relations explicit [7]. They also make knowledge accessible to both humans and machines, and provide access to automated reasoners. By modelling source data and the core domain of interest as ontologies, semantically-meaningful mapping can be used to link them. Such mappings are not possible with hard-coded syntactic conversions. The use of ontologies provides a rich integration environment, but can be costly with regards to the time taken to analyse and reason over the data. The primary cost is the scalability of ontology-based integration methods. As ontologies grow in size and complexity, the reasoning time rapidly increases.

1.2 Mapping type for the integration interface

As described in Section 1.6, neither information linkage nor direct mapping meets the needs of this project. Information linkage is not true integration, and direct mapping is not suitable for situations where more than two resources need to be integrated. EKoM [8] and SBPAX [9] are two semantic direct mapping approaches discussed in Section 1.6 which most closely match the goals of rule-based mediation. However, modification of either of these approaches is not feasible, as they were both developed to integrate only two resources.

The integrated data in EKoM [8] is presented in RDF rather than OWL. The high scalability of RDF compared with OWL comes at the expense of expressivity. Although large amounts of data can be efficiently stored and queried with RDF, the RDF query language SPARQL cannot make use of OWL, thus limiting the possible query space. There are other limitations to the EKoM method: the addition of new data in different formats would require the modification of the main ontology, and perhaps even the stored queries used to retrieve information from it; and there is no mention of any ability to export results using one of the import formats, as is possible with rule-based mediation.

SBPAX, originally developed to integrate SBML and BioPAX, seems to provide much of the required annotation functionality for systems biology models. Extension of this method to more than two ontologies by expansion of the role of the bridging ontology initially sounds feasible. However, SBPAX was not created to be a generic methodology and the latest release (SBPAX3) has a different purpose from previously published versions (Note: http://sbpax.org/sbpax3/index.html). Instead of being created for purely integrative purposes, SBPAX3 has moved away from ontology mapping and is instead a proposed extension to BioPAX Level 4 to incorporate quantitative information into that standard. While SBPAX3 does allow closer alignment between the SBML XML format and the BioPAX OWL format, its primary purpose has moved from data integration to extension of BioPAX functionality, making this methodology highly unlikely to incorporate additional data sources in future.

Because of the limitations with non-mediator methods, a mediator methodology has been used for rule-based mediation. Within the three mediator types described in Section , multiple mediator mapping was rejected due to its complexity; this mapping subtype requires that the users, whether machine or human, be aware of the multiple data models. As a result semantic solutions using this approach, such as linked data or RDF triple stores, are not integrated enough to be useful for this project.

A single unified view makes examination, querying and ultimately viewing the integrated dataset easier. This leaves either single mediator mapping or hybrid mediator mapping. Hybrid mediator mapping was rejected as it requires that each source ontology be aware of the global ontology, thus immediately making it impossible to utilise ontologies with no knowledge of rule-based mediation such as pre-existing community ontologies. Therefore single mediator mapping, which has the benefit of being a straightforward mapping method with minimal pre-requisites, was chosen.

There are three subtypes within single mediator mapping: global-as-view, local-as-view, and hybrid approaches. While the global-as-view approach makes the passing of queries to the underlying data sources simple, it has other restrictions which make it unsuitable for model annotation. For instance, defining the global ontology merely as a model of a set of data sources creates an ontology which is brittle with respect to the addition of new data sources and new formats. However, choosing local-as-view can also be problematic, as query complexity increases quickly with local-as-view approaches, and data sources need to be forced into the structure of the global ontology [10].

Both methods are limited in their ability to describe the nuances of the biological concepts being studied. When one layer (either the global or source ontologies) is a view of the other, conflation of format and biological domain concepts into a single model can occur. For global-as-view, the source ontologies are intended to model the data sources, information which is of no use in a global ontology intended to be a model of a biological domain of interest, as in rule-based mediation. For local-as-view, the source ontologies are merely views over a global ontology, restricting the description of the syntax of the sources.

Therefore, hybrid approaches which do not force one layer to be the view of another provide a more flexible solution. The BGLaV [10] project describes a hybrid mapping method where the global ontology is modelled completely separately from the source ontologies. However, its overall methodology depends upon complex query translation algorithms to pull information from the underlying sources [10]. Therefore rule-based mediation uses this hybrid mapping type but creates a new methodology using existing Semantic Web mapping strategies which does not have the query complexity of BGLaV.

1.3 The integration interface: warehousing or federation

The final refinement to the methodology is the integration interface itself, and the decision as to whether a warehousing or federated approach is appropriate for rule-based mediation. While warehousing methods (described fully in Section 1.6) are powerful and quick, they are not scalable and can create large resources which current ontological reasoning tools struggle to process. The Neurocommons integration methodology, which uses data encapsulation via multiple mediation mapping, partially solves these scalability issues [11]. While OWL is used to re-format the original data sources, Neurocommons minimises the reasoning cost by using plain RDF as the format for the main data store, which does not get reasoned over. Further, while Neurocommons makes use of Semantic Web technologies, they integrate at the level of the database entry itself [12] and do not try to model the biological concepts referenced by those entries. As the focus of rule-based mediation is on providing modellers with a method of targeted annotation of specific models, such a large triple store is not required, and a methodology generating a smaller, more semantically-rich dataset amenable to reasoning is required.

Another ontology-based model comparison system has recently been developed by Schulz and colleagues to integrate the semantic MIRIAM annotations in SBML across different models in the BioModels database [13]. While this method is useful for finding semantically similar entities in SBML, it is not intended as a large-scale, multiple data source integration methodology.

As described in Chapter 4, SBML Harvester allows the loading and querying of SBML models in OWL. Both rule-based mediation and SBML Harvester apply automated reasoning to enrich the knowledge available to an SBML model. SBML Harvester also makes use of an upper-level ontology that integrates in silico and in vivo information and includes a number of community ontologies [14]. However, the conflation of both structural and biological entities in a single ontology could make it more difficult to clearly isolate syntactic from semantic heterogeneity. Additionally, Hoehndorf and colleagues have stated that the existing biomedical ontologies used within the system often have a weak formalisation which limits the capabilities of SBML Harvester for verification and knowledge extraction [14]. The authors have expressed a desire to see if rule-based mediation and SBML Harvester can be combined in future [14].

Data federation (see Section 1.6) provides its own challenges. Federated approaches require constant calls to remote Web sources to retrieve the requested subset of information and are not guaranteed to be able to access source data. Older federated integration techniques such as TAMBIS use only a single ontology, mapping both the semantics and syntax of the underlying data sources into that ontology [15]. TAMBIS also assumes only one data source per format type. In contrast, rule-based mediation separates the tasks of resolving syntactic and semantic heterogeneity through the use of source ontologies as well as a core ontology.

Query translation systems such as ComparaGRID suffer from inefficiency during data retrieval. With local-as-view methods such as OntoFusion, input data can only be modelled as a view over the global ontology. As the global ontology is independent of the data sources, it may not have the expressivity to model all of the information in every format of interest. In PISCEL [16], class and query rewriting were used to pull information from each data source at the time of the query. However, in rule-based mediation the core ontology is a tightly-scoped ontology which benefits from having all of the relevant information stored directly within it, thus obviating the need for a query rewriting step.

Large-scale ontology alignment has very recently been performed using a variety of OBO Foundry ontologies by aligning each ontology to a upper-level philosophical ontology [7]. While this method has found numerous modelling errors, inconsistencies, and contradictions in the aligned ontologies, integration is limited to the knowledge within the ontologies themselves and is not attempted for any instance-level data or biological databases. It provides valuable error checking across ontologies which are intended to be orthogonal according to OBO Foundry principles (Note: http://www.obofoundry.org/wiki/index.php/FP_005_delineated_content), but is a methodology tailored to ontologies without instance data, making it unsuitable for this thesis.

The integrated data in rule-based mediation needs stability to ensure effective querying, but also needs to be lightweight enough that reasoning can be performed. As such, rule-based mediation has aspects of both warehousing and federation. In rule-based mediation, instance data is stored in the core ontology for as long as a user wishes, creating a warehouse of integrated data. However, the user selects only a subset of each data source to be loaded into the core ontology. The query results are then filtered through the source ontologies and mapped to the core. This query step can be performed repeatedly, adding or removing information from the global ontology as required, thus providing a federated quality to the methodology. Because the core ontology in rule-based mediation is a tightly-scoped ontology modelling a specific biological domain of interest, only the information required for a particular use-case needs to be added.

1.4 Ontology mapping rules

SWRL [17] is a Semantic Web rule language that extends OWL with the addition of a new if-then conditional and a number of convenience methods called built-ins [18]. The DL-safe version of SWRL is used within rule-based mediation to create mappings, also called rules, from one entity to another. SWRL has mature libraries and is interoperable with the existing OWL API. Although it is possible to hard code rules using a library such as the OWL API, SWRL rules allow the logical separation of the knowledgebase and the rulebase from any lower-level code, make knowledge shareable across ontologies, and make the knowledge easier to maintain [19].

SWRL rules are divided into a antecedent and a consequent. In the simple case below, both the antecedent (classA(?X)) and the consequent (classB(?X)) are classes, though SWRL is often used for more complex assertions involving properties, classes and even instances:


In SWRL, whenever the antecedent of the rule is true, the conditions in the consequent must also hold. The above mapping in plain English is “if X is a member of classA, then X is a member of classB”. With this mapping, all instances ?X of classA are also asserted as members of classB. classA and classB could be from the same or different ontologies. SWRL rules provide only instance-to-instance mapping, which results in the assignment of instance data to membership in another class. No classes or properties are ever modified by SWRL.

SWRL is not the only way to create rules in OWL. Two or more OWL ontologies can be linked together with a number of methods. In ontology alignments, classes can be marked as equivalent or joined together via other, more complex, axioms (see Section 1.6 for a discussion of ontology alignment and merging). This method does not require any languages other than OWL itself, and is easily visible within the asserted hierarchy. However, each aligned ontology must be imported into the same file; imports can cause problems if the ontologies are large or numerous, and it is impractical if only part of the ontology needs to be aligned. DL rules [20] are pure OWL and are a fragment of SWRL. However, the use of complete rule languages such as SWRL is often easier than OWL rules and can sometimes be a requirement for expressing particular mappings  (Note: http://answers.semanticWeb .com/questions/1354/why-do-we-need- swrl-and-rif-in-an-owl2-world, http://answers.semanticWeb .com/questions/2329/rifspin-needed-or-not-as-when-owl2-seems-sufficient).

Although multiple implementations for ontology mapping exist, ultimately only the SWRLTab [21] plugin for Protégé 3.4 and the OWL API were used within rule-based mediation. SWRLTab is a Protégé 3.4 plugin which can perform ontology mapping as well as query relational data sources or XML, and was used within rule-based mediation to create and execute SWRL rules. SWRL is used for the mapping rules and SQWRL [22] for querying those rules. While SWRL rules are written using OWL and are therefore easy to understand, there are drawbacks to this implementation. SWRLTab requires that each additional ontology to be mapped must be imported directly into the main ontology. Therefore when mapping many source ontologies to a single core ontology, the amount of imports could become unwieldy. Further, importing via the normal OWL import mechanism means that multiple dissimilar ontologies are combined in one logical area. However, once the SWRL rules have been run, a cleaner version of the final ontology—without the source ontologies or the rule file—may be viewed if desired.

Other ontology mapping applications were investigated and ultimately deemed inappropriate for rule-based mediation. Snoggle (Note: http://snoggle.semWeb central.org/) is a graphical, SWRL-based ontology tool for aiding alignment of OWL ontologies. Ontology alignment is mainly used for merging two ontologies and as such is not ideally suited to rule-based mediation, which retains the independence of each ontology. Similarly, Chimera (Note: http://www-ksl.stanford.edu/software/chimaera/) is intended for alignment and manipulation, but not mapping. COMA++ (Note: http://dbs.uni-leipzig.de/Research/coma.html) currently only supports OWL-Lite. While capable of limited OWL functionality, COBrA [23] is primarily for OBO files, specifically GO. Falconer and colleagues provide an overview of a number of other ontology mapping tools [24].

Closer to the requirements of this rule-based mediation are PROMPT [6] and Fluxion (Note: http://bioinf.ncl.ac.uk/comparagrid/). PROMPT is a Protégé plugin which performs automatic mapping and allows user-provided manual mappings. It is a mature application with the ability to map classes, relations, and instances. However, PROMPT could not provide any automated mappings between the BioGRID syntactic ontology and the telomere ontology. Fluxion is a standalone ontology mapping tool which has been used within the ComparaGrid system, and is a service-oriented architecture for federated semantic data integration which uses translation rules to map between the source ontologies and the core ontology (Note: http://bioinf.ncl.ac.uk/comparagrid/). However, Fluxion is not currently in development and cannot be easily integrated with the overall structure of rule-based mediation.

Querying an ontology

The simplest method of ontology querying is the creation of defined classes which function as queries over the instance data, as with Luciano and colleagues [25]. Alternatively, SPARQL queries can be constructed for ontologies which have an underlying RDF. However, both methods have their limitations. Defined classes created as queries are difficult to add on-the-fly, and have similar limitations the use of plain OWL does for the creation of rules. SPARQL is only compatible with RDF-formatted ontologies. SQWRL is a query language based on SWRL which is used for querying OWL ontologies and formatting the query response. Unlike SWRL, SQWRL does not modify the underlying ontology. Therefore, SQWRL queries can be useful for testing SWRL mappings as well as running queries that should not be stored in the ontology.

In general, queries over mapped ontologies can be performed either directly on data materialised and stored in the core ontology, or by reformulating a core ontology query as a set of queries over the data stored in the source ontologies. Within rule-based mediation, SQWRL queries are executed when subsets of the knowledge stored within the core ontology need to be extracted.

2 Source Ontologies

In rule-based mediation, all ontologies which move data in and out of the core ontology are called source ontologies. Information flow between source and core ontologies is bi-directional and can be transitive, as translation between data sources via the core ontology is also possible. This section describes the source ontologies used to implement rule-based mediation for the model annotation use cases in Section 5.

A source ontology can be either a purpose-built syntactic ontology or a pre-existing ontology. Pre-existing source ontologies have no expressivity restrictions; community developed or other pre-existing ontologies which represent a field only tangentially related to the domain of interest can be used without modification. Mapping an existing ontology requires only a set of SWRL rules to connect it to the core ontology.

Information from one or more non-OWL data sources sharing the same format can be recast as a syntactic ontology, isolating syntactic variability so the core ontology is free to model the semantics of the relevant biology. Most syntactic ontologies are intended to be a useful representation of a format in OWL. Such a representation should be semantically lightweight and simple and quick to build. Strictly, it might be argued that syntactic ontologies are not true ontologies, being more precisely defined as a list of axioms. This is because syntactic ontologies are built as a simple conversion of a non-OWL format into OWL, with no additional semantics. There is little reason to enrich the syntactic ontologies when the majority of the inference, reasoning and querying will take place within the core ontology. Syntactic ontologies are a type of application ontology intended, not to be a reference for a particular biological domain of interest, but to facilitate a larger bioinformatics application.

Four data sources were used for the implementation of rule-based mediation presented in this chapter: BioGRID [26], exported in PSI-MIF [27]; the Pathway Commons [28] BioPAX Level 2 export; BioModels [29], in SBML format; and UniProtKB [30] in XML format. For each data format, a suitable source ontology was identified or syntactic ontology was created. BioModels was the data source for the SBML models to be annotated, while the other three resources were used as general data inputs to the integration system. Basic information on the data formats as well as the numbers of classes, relations and mapping rules in their source ontologies is available in Table 1.

Data Source Data Format Classes Object Properties Data Properties DL Expressivity Rules
UniProtKB UniProtKB XML 27 24 34 ALEN(D) 11
BioGRID PSI-MIF 26 24 15 ALEN(D) 17
SBML SBML 648 56 44 SHOIN(D) 14*
Pathway Commons BioPAX 41 33 37 ALCHN(D) 11
Table 1: Basic information about the source ontologies and their mapping rules to the core ontology. Pathway Commons differs from the other sources as an OWL format is already available. For the SBML syntactic ontology (MFO), the numbers are a summary of the metrics for both SBO and MFO. The Rules column lists the number of SWRL mapping statements used to link each source ontology with the core ontology. MFO was used for export only, and therefore its rules are marked with “*” to show that these are export rules. Statistics generated using Protégé 4.

2.1 Pathway Commons and BioPAX

Two data sources, UniProtKB and Pathway Commons, provide a data export in a format accessible to SWRL and OWL. For BioPAX, the creation of a specific syntactic ontology was not required. For the UniProtKB, XML rather than RDF was chosen for the reasons outlined in Section  below.

Pathway Commons integrates nine pathway and interaction databases (Note: http://www.pathwaycommons.org/pc/dbSources.do). Information on network neighbours of any given entity can be accessed as a BioPAX Level 2 document, written in OWL. Pathway Commons provides both binary interaction data and pathway data. BioPAX can store information about which species are reactants, products or modifiers, but the data coming from Pathway Commons does not always provide this information. As this information is available in BioPAX, no additional syntactic ontology needs to be created; the populated BioPAX ontology returned from the nearest neighbour query is the source ontology.

Figure 3: Screenshot of Protégé 3.4 showing part of a Pathway Commons interaction set in BioPAX Level 2 format.

2.2 Novel syntactic ontologies

BioModels and BioGRID do not export their data in a format amenable to the application of SWRL rules, and therefore syntactic ontologies were created based on the export format to which each data source was best suited. This required the use of MFO for BioModels entries and the creation of a syntactic ontology based on PSI-MIF. Additionally, the UniProtKB XML format needed to be converted to OWL as the RDF export was deemed inappropriate for the use cases.

The use of existing tools to implement rule-based mediation increases its usability for other researchers as well as decreases the development time. Protégé 3.4 and 4 were used to edit and view all ontologies used in this project, while the syntactic ontologies were generated using the XMLTab plugin [31] for Protégé 3.4. Once the syntactic ontologies are created, further editing may be performed as needed. Specifically, MFO was modified beyond the simple conversion created by XMLTab, as described in Chapter 4.

The XMLTab plugin takes only a few seconds to create syntactic ontologies. Additionally, if an XML file is provided instead of an XSD, then both classes and instances are generated: the classes represent structural elements and attributes present in the XML file, while the actual data is created as instances within the ontology. The resulting OWL file is an exact duplicate of the XML structure, which is acceptable for the rule-based mediation methodology as semantic heterogeneity is resolved when the data is mapped from the syntactic ontology to the core ontology.

However, XMLTab does not scale well, and cannot load multiple XML files into a single syntactic ontology. Additionally, it is not in active development. However, tools such as the SWRLAPI’s XMLMapper [32] or Krextor [33] may provide viable alternatives. Quality assurance and normalisation is not relevant with most syntactic ontologies, as they are direct syntax conversions and do not bear the greatest reasoning load in rule-based mediation. As such, the most important feature is that they are quick to generate and easy to produce, rather than being highly expressive or optimized for reasoning.

Model Format OWL

MFO is an ontology representing constraints from SBML, SBO and the SBML Manual [34] (see also Chapter 4). MFO is used as one of the source ontologies, and has a dual purpose for the presented use cases: firstly, it acts as the format representation for any data stored as SBML such as entries from the BioModels database, and secondly it is used to format query responses as SBML.

BioModels is an example of a single data source that may be loaded into a core ontology from more than one source ontology. BioModels entries are available in SBML format as well as in BioPAX and CellML. However, BioModels uses SBML as its native format, and conversion from SBML to other formats is not lossless. For example, quantitative data such as rate laws are not supported in BioPAX. Therefore when manipulating BioModels data, SBML—and therefore MFO—is the most suitable.


The PSI-MIF syntactic ontology was created from its native XML. BioGRID stores 24 different types of interactions. For the purposes of this work, its physical protein-protein interactions were used. BioGRID data is available in PSI-MI format (known as PSI-MIF) [27], and can easily be converted into a simple syntactic ontology. The PSI-MIF file exported by BioGRID was used to generate a simple syntactic ontology using XMLTab in Protégé 3.4. Figure 4 shows an overview of the classes created for the PSI-MIF syntactic ontology, while Figure 5 shows a selection of the axioms applied to the PSI-MIF interaction class.

Figure 4: Overview of the PSI-MIF syntactic ontology. This syntactic ontology can be used for BioGRID data, and was created by the Protégé  XMLTab plugin using the PSI-MIF-formatted result of a query over BioGRID.

Figure 5: Axioms created for the PSI-MIF syntactic ontology’s interaction class.

As this work progressed, Pathway Commons began importing BioGRID data and can therefore can be used to retrieve BioGRID data in BioPAX format rather than requiring the PSI-MIF syntactic ontology. However, Pathway Commons does not store the genetic interactions available from BioGRID. The incorporation of BioGRID data into Pathway Commons does not reduce the effectiveness of the PSI-MIF syntactic ontology; the PSI-MIF syntactic ontology can be used to integrate any other PSI-MIF-based data sources, and it can also be used to integrate those portions of BioGRID which are not stored within Pathway Commons. The integration discussed in the use cases in Section 5 made use of the PSI-MIF syntactic ontology, while a Web application currently under development uses the BioPAX output from Pathway Commons.


The UniProtKB is a comprehensive public protein sequence and function database, consisting both of manually-curated and automatically-annotated data. Adding UniProtKB as a data source provides access to a wealth of information about proteins, although the reaction data is more limited. As evidenced by the non-curated models stored within the BioModels database, new systems biology models are often created with only a small amount of biological information such as species names and skeleton reactions. In these cases, even the simple addition of cross-references to UniProtKB is useful. Unambiguous assignment of a UniProtKB primary accession to a new species in a systems biology model also provides the core ontology with high-quality knowledge of protein localization and identification.

While the original intent was to make use of the UniProtKB RDF format for rule-based mediation, reasoning over the RDF was slow and there was a large number of ontology and RDF files to import. Instead, the UniProtKB XML format was used for the initial work. At the beginning of this research, UniProtKB Release 14.8 XML files appropriate for the use cases were downloaded (Note: ftp://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/). The UniProtKB syntactic ontology was created using the XMLTab plugin for Protégé 3.4. Figure 6 shows an excerpt of this ontology. The UniProtKB syntactic ontology was successfully used for the initial use cases detailed in Section 5.

A rule-based mediation Web application is being developed, where the set of relevant entries will be expanded from that required for the use cases to the entire UniProtKB database. As these larger sets of XML files can no longer easily be manually converted via the XMLTab plugin, the UniProtKB RDF syntax was revisited as a data source for the rule-based mediation Web application. The UniProtKB RDF format, together with the structure provided by the UniProtKB OWL ontology, is suitable for the application. The RDF syntax is accessible to SWRL rules without modification, and the slow reasoning times are irrelevant as reasoning over syntactic ontologies is not required with rule-based mediation. However, limitations remain in the UniProtKB RDF format.

Figure 6: Overview of the UniProtKB syntactic ontology. This syntactic ontology can be used for UniProtKB data, and was created by the Protégé  XMLTab plugin using an XML file representing a UniProtKB entry. Screenshot taken from Protégé 3.4.

Firstly, the core UniProtKB OWL file is not guaranteed to be decidable. This is because its expressivity level is OWL Full and not one of the less expressive, decidable profiles such as OWL DL. If an ontology is not decidable, there is no guarantee that a reasoner can successfully be run over the ontology. The expressivity level, or the OWL profile, of the core UniProtKB OWL file (Note: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/core.owl, file downloaded and test performed on 16 October 2011.) was tested using both the Manchester OWL Validator (Note: http://owl.cs.manchester.ac.uk/validator/) and the WonderWeb OWL Ontology Validator (Note: http://www.mygrid.org.uk/OWL/Validator). The former stated that, while the UniProtKB OWL file conformed to OWL 2, it did not conform any decidable profiles of OWL 2. The WonderWeb validator produced a similar result, stating that the file was OWL Full.

Secondly, the full model of the UniProtKB structure is stored in core file plus a further nine files, all ten of which must be imported to have a complete model (Note: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/README). The nine non-core files describe those parts of the UniProtKB structure which might be reused across multiple UniProtKB entries, and include topics such as citations, keywords and pathways. As each of these files contain all of the entities required by the entire UniProtKB dataset, many are quite large. This design requires the import of all ten files to completely view or process just one UniProtKB entry, creating a heavyweight method of modelling. For example, the RDF document containing all citations is 1.4 Gigabytes in size, making it difficult for many users to even load the document into memory for access by a reasoner.

3 Mapping between source and core ontologies

Once a source ontology has been created, it must be mapped to the core, biologically-relevant ontology. The core ontology (Section 4) is the mediator for rule-based mediation, and holds all of the biologically-meaningful statements that need to be made about the data to be integrated.

The mappings between the source ontologies and the core ontology are not as straightforward as the conversion from a data format to its syntactic ontology, and they are not created in the same way. The mapping in rule-based mediation is defined with the OWL-based SWRL rule language, which divides rules into a antecedent for describing the initial conditions of the rule and a consequent for describing the target of the rule (see Section 1.4). When mapping from the source ontologies to the core ontology, as a general rule the antecedent of the rule comes from a source ontology while the consequent corresponds to the mapped concept within the core ontology. The antecedent and consequent can be named classes or non-atomic classes such as those expressed by properties. In each of the rules presented in this section, the right arrow is a convenience rather than standard OWL notation, and the namespaces for each entity denote which ontology or library they belong to. For example, in the rule below, the antecedent is the protein class from the UniProtKB syntactic ontology, and the consequent is the Protein class from the telomere ontology (Note: The “tuo” namespace for the telomere ontology is a result of an earlier naming scheme, where the telomere ontology was called the telomere uncapping ontology.):

UP_00001 : upkb:protein(?protein) \Rightarrow tuo:Protein(?protein)

Many of these rules could be merged together, having one rule perform much of the mapping currently executed by many. However, that would reduce the flexibility now present in the system. With each rule performing just one part of the mapping, each mapping step may be run in isolation. Further, having each rule accomplish just one task aids understanding and readability of the rules for humans. Readability is also aided with the addition of comments for each rule within the SWRL mapping file.

Once instances have been integrated into the core ontology, the knowledge contained within the core ontology can be reasoned over and queried. SQWRL is used to query the core ontology, and the query response can then be formatted according to a suitable source ontology. Queries are used to illustrate interesting, but only indirectly-relevant queries without cluttering the core ontology. More precise mapping rules use SWRL to allow storage of the results within the core ontology. The mapping between source and core ontologies can be bi-directional, thus allowing output of information as well as input. The quality of the data exported to a source ontology and from there to a base format is dependent upon both the quality of the mappings to the source ontology and its scope: if either the mappings are incomplete, or the underlying data format has no way of describing the new data, the export will be lacking.

3.1 Rules for the UniProtKB syntactic ontology

This section describes the SWRL rules of the UniProtKB syntactic ontology and what sort of information those rules map to the telomere ontology. The UniProtKB rule set is different from the other data sources’ rules. While UniProtKB contains a large amount of information, it is protein-centric rather than reaction-centric. More importantly for these use cases rich information about a protein, described in other databases with much less detail, is available to the telomere ontology via this data source. Specifically, information regarding protein localization within the cell as well as synonyms of gene and protein names are available. This rule set is simple, with many of the classes required for the use cases having direct equivalents in telomere ontology, and with many cardinalities of the properties exactly matching those in the telomere ontology itself. While the UniProtKB does contain some limited information on reactions via the comment and cross-reference sections, these are not currently modelled by the mapping rules for the UniProtKB syntactic ontology.

Protein identification

The first three rules in the UniProtKB syntactic ontology populate the Protein class with the protein instance itself as well as with a primary name and synonym:

UP_00001 :
upkb:protein(?protein) →

UP_00002 :
upkb:recommendedNameSlot(?protein, ?name) ∧
upkb:fullName(?name, ?fullName) →
tuo:recommendedName(?protein, ?fullName)

UP_00003 :
upkb:entry(?entry) ∧
upkb:proteinSlot(?entry, ?protein) ∧
upkb:geneSlot(?entry, ?gene) ∧
upkb:nameSlot(?gene, ?geneName) ∧
upkb:Text(?geneName, ?geneNameText) →
tuo:synonym(?protein, ?geneNameText)

As it is a primary data source for protein information, UniProtKB is the only source ontology which populates the recommendedName property within the telomere ontology. All other data sources are mapped to the synonym data property. In UP_00003, the first part of the rule contains only the statement entry(?entry). From a computational point of view, this extra statement is superfluous, as the next part of the rule (proteinSlot(?entry, ?protein)) ensures the correct type of ?entry instance is used. However, the extra statement in this and other rules aids readability of those rules.

Protein localisation

If the location of a protein is the nucleus, then the instance is mapped directly to the Nucleus class. Otherwise, the instance is mapped to the more generic BiologicalLocalization class:

UP_00004 :
upkb:locationSlot(?subcellularLocation, ?location) ∧
upkb:Text(?location, ?locationText) ∧
swrlb:equal(?locationText, “Nucleus”) →

UP_00011 :
upkb:locationSlot(?subcellularLocation, ?location) ∧
upkb:Text(?location, ?locationText) ∧
swrlb:notEqual(?locationText, “Nucleus”) →

The two rules above use SWRL built-ins. The swrlb:equal and swrlb:notEqual built-ins test whether two arguments are equivalent. The test for equality is satisfied if and only if the first argument and the second argument are identical, while the swrlb:notEqual is the simple negation of swrlb:equal [17].

As well as creating instances for each biological location, linking those locations to a specific protein also occurs. In UP_00006, the subcellular location of the protein is mapped directly onto the isLocated property within the telomere ontology.

UP_00006 :
upkb:entry(?entry) ∧
upkb:proteinSlot(?entry, ?protein) ∧
upkb:commentSlot(?entry, ?comment) ∧
upkb:subcellularLocationSlot(?comment, ?subcellularLocation) ∧
upkb:locationSlot(?subcellularLocation, ?location) →
tuo:isLocated(?protein, ?location)

Taxonomic identification and other database cross-references

The next three rules describe how the cross references of a protein are mapped from UniProtKB to the telomere ontology. UP_00005 creates an instance of TaxonomicSpecies with a specific taxonomic identifier. Conversely, UP_00007 is used to direct all other UniProtKB references to the more generic DatabaseReference from the telomere ontology. Both swrlb:equal and swrlb:notEqual have been used in these rules to determine whether or not a particular reference is a taxonomic identifier. At this stage, the information is not yet connected to a protein. UP_00008 provides this functionality, linking a particular protein with its taxonomic reference.

UP_00005 :
upkb:dbReference(?reference) ∧
upkb:_type(?reference, ?referenceType) ∧
swrlb:equal(?referenceType, “NCBI Taxonomy”) ∧
upkb:_id(?reference, ?id) →
tuo:TaxonomicSpecies(?reference) ∧
tuo:ncbiTaxId(?reference, ?id)

UP_00007 :
upkb:dbReference(?reference) ∧
upkb:_type(?reference, ?referenceType) ∧
upkb:_id(?reference, ?id) swrlb:notEqual(?referenceType, “NCBI Taxonomy”) →
tuo:DatabaseReference(?reference) ∧
tuo:databaseName(?reference, ?referenceType) ∧
tuo:accession(?reference, ?id)

UP_00008 :
upkb:entry(?entry) ∧
upkb:proteinSlot(?entry, ?protein) ∧
upkb:organismSlot(?entry, ?organism) ∧
upkb:dbReferenceSlot(?organism, ?reference) →
tuo:hasTaxon(?protein, ?reference)

UP_00009 has a structure virtually identical to UP_00008, but rather than mapping the organism references for a particular protein instance, it maps all associated database references to their equivalents in the telomere ontology.

UP_00009 :
upkb:entry(?entry) ∧
upkb:proteinSlot(?entry, ?protein) ∧
upkb:dbReferenceSlot(?entry, ?reference) →
tuo:hasDatabaseReference(?protein, ?reference)

UP_00010 uses a SWRL built-in called swrlx:makeOWLThing to create a new DatabaseReference for each UniProtKB entry. Rather than being part of the W3C Submission, built-ins with the namespace swrlx are experimental convenience methods. swrlx:makeOWLThing takes an empty variable as its first argument, and will create and bind a new instance to that variable for every instance of the second argument matched within the rule (Note: http://protege.cim3.net/cgi-bin/wiki.pl?SWRLExtensionsBuiltIns). The SWRL built-in does not fill the value of the first argument with the actual instance or instances present in later arguments; it simply uses the second and later arguments as a statement of how many new instances to create. The associated UniProtKB accession number is used as the value for the new instance’s accession data property, while the database reference name is given as “UniProtKB”.

UP_00010 :
upkb:entry(?entry) ∧
upkb:proteinSlot(?entry, ?protein) ∧
upkb:accession(?entry, ?accessionValue) ∧
swrlx:makeOWLThing(?reference, ?entry) ∧
swrlb:equal(?referenceValue, “UniProtKB”) →
tuo:hasDatabaseReference(?protein, ?reference) ∧
tuo:DatabaseReference(?reference) ∧
tuo:accession(?reference, ?accessionValue) ∧
tuo:databaseName(?reference, ?referenceValue)

3.2 Rules for the PSI-MIF syntactic ontology

This section describes the SWRL rules of the PSI-MIF syntactic ontology, and what sort of information those rules map to the telomere ontology. This ontology is centred around binary interactions, with some extra information provided regarding taxonomy and database cross-references. Unlike BioPAX, BioGRID provides only binary interactions, and as such the information modelled in the BioGRID syntactic ontology is only of this interaction type.

One-to-one mappings

As with BioPAX, there is a high degree of convergence between the telomere ontology and BioGRID. BioPAX, BioGRID and the telomere ontology all model interactions in a similar way. This is exemplified by the five classes within the BioGRID syntactic ontology which are mapped directly to classes within the telomere ontology. PSIMIF_00001 is shown here, and PSIMIF_00003-6 are described in later subsections.

PSIMIF_00001 :
psimif:interactor(?interactor) →

The convergence is further visible in rules PSIMIF_00002, PSIMIF_00010 and PSIMIF_00012-14 (see subsections below), where the classes and the properties described have direct one-to-one mappings with classes and properties in the telomere ontology. Although the one-to-one mapping rules could be accomplished with standard OWL axioms, it is useful to keep all cross-ontology mapping logically separate from the telomere ontology itself. Not only does this allow changes to the rules without changing the telomere ontology itself, but it also collects all mappings in one location for improved readability.


Two rules in the BioGRID syntactic ontology populate the Protein class with instance data as well as with synonyms. Unlike UniProtKB, which only stores protein entities, BioGRID can have a number of different types of interactor. Therefore, in order to only retrieve those interactors which are proteins, PSIMIF_00015 is restricted to those interactor instances which are of the “protein” type. Note that no recommendedName is populated in the telomere ontology via the BioGRID syntactic ontology; only the UniProtKB syntactic ontology provides information for that property. All other source ontologies provide information for the synonym data property only.

PSIMIF_00015 :
psimif:fullName(?name, ?nameText) ∧
swrlb:equal(?nameText, “protein”) ∧
psimif:interactor(?interactor) ∧
psimif:interactorTypeSlot(?interactor, ?interactorType) ∧
psimif:namesSlot(?interactorType, ?name) →

PSIMIF_00011 :
psimif:interactor(?interactor) ∧
psimif:namesSlot(?interactor, ?names) ∧
psimif:shortLabel(?names, ?nameText) →
tuo:synonym(?interactor, ?nameText)

Taxonomic identification and other database cross-references

The two rules below assign the taxonomy instance from BioGRID to the telomere ontology TaxonomicSpecies, and then assign the mapped taxonomic instance to the appropriate interactor. In PSIMIF_00002, as in other rules, object and data properties for all PSI-MIF interator instances are mapped rather than just those for protein instances, as in PSIMIF_00015. While this will generate more properties within the telomere ontology then are required, no extraneous PhysicalEntity instances are created. General-purpose rules such as these are simpler, and do not contribute to a large increase in computational complexity.

PSIMIF_00002 :
psimif:organism(?organism) ∧
psimif:_ncbiTaxId(?organism, ?taxId) →
tuo:TaxonomicSpecies(?organism) ∧
tuo:ncbiTaxId(?organism, ?taxId)

PSIMIF_00010 :
psimif:interactor(?interactor) ∧
psimif:organismSlot(?interactor, ?organism) →
tuo:hasTaxon(?interactor, ?organism)

All non-taxonomic cross-references are identified in PSI-MIF with the primaryRef or secondaryRef classes. These, and the names of the databases themselves, are mapped to the telomere ontology’s DatabaseReference class in rules PSIMIF_00005-6 and PSIMIF_00012. Finally, in PSIMIF_00013-14 the identifiers of each of the individual cross-references are mapped to the accession property in the telomere ontology.

PSIMIF_00005 :
psimif:primaryRef(?reference) →

PSIMIF_00006 :
psimif:secondaryRef(?reference) →

PSIMIF_00012 :
psimif:_db(?reference, ?referenceName) →
tuo:databaseName(?reference, ?referenceName)

PSIMIF_00013 :
psimif:primaryRef(?primaryReference) ∧
psimif:_id(?primaryReference, ?id) →
tuo:accession(?primaryReference, ?id)

PSIMIF_00014 :
psimif:secondaryRef(?secondaryReference) ∧
psimif:_id(?secondaryReference, ?id) →
tuo:accession(?secondaryReference, ?id)

After the cross-references are mapped, they need to be tied to the individual PhysicalEntity instances which reference them. The link between a cross-reference and a PSI-MIF entity occurs with the xrefSlot property and either the primaryRefSlot or the secondaryRefSlot. These properties are simplified and mapped across to the hasDatabaseReference property in the telomere ontology.

PSIMIF_00016 :
psimif:xrefSlot(?entity, ?crossRef) ∧
psimif:primaryRefSlot(?crossRef, ?primaryRef) →
tuo:hasDatabaseReference(?entity, ?primaryRef)

PSIMIF_00017 :
psimif:xrefSlot(?entity, ?crossRef) ∧
psimif:secondaryRefSlot(?crossRef, ?secondaryRef) →
tuo:hasDatabaseReference(?entity, ?secondaryRef)

Binary interactions

Although less sophisticated than BioPAX, BioGRID defines interactions in a similar way. The participant in a binary interaction and the physical entity that is referenced by that participant are two separate entities. However, unlike BioPAX, BioGRID reactions are only expressed as binary interactions of varying types, with no concept of complex reactions containing multiple inputs or outputs. Additionally, there is no way to determine which entity is a product or modifier of the interaction. Therefore, PSIMIF_00003 maps all participants in the binary interactions to the telomere ontology’s Reactant class. In PSIMIF_00004 all interactions, of whatever type, are mapped to the telomere ontology’s PhysicalProcess class. In PSIMIF_00007 the instances of the “ReconstitutedComplex” interactions are then further mapped to the ProteinComplexFormation class in the telomere ontology.

PSIMIF_00003 :
psimif:participant(?participant) →

PSIMIF_00004 :
psimif:interaction(?interaction) →

PSIMIF_00007 :
psimif:interaction(?interaction) ∧
psimif:interactionTypeSlot(?interaction, ?interactionType) ∧
psimif:namesSlot(?interactionType, ?interactionName) ∧
psimif:shortLabel(?interactionName, ?nameText) ∧
swrlb:equal(?nameText, “ReconstitutedComplex”) →

The link between the interactor and its participant in BioGRID is modeled with the interactorRef property. In the telomere ontology, this connection is made with the plays property:

PSIMIF_00008 :
psimif:interactor(?interactor) ∧
psimif:_id(?interactor, ?id) ∧
psimif:participant(?participant) ∧
psimif:interactorRef(?participant, ?id) →
tuo:plays(?interactor, ?participant)

Finally, PSIMIF_00009 is used to link an interaction to its participants:

PSIMIF_00009 :
psimif:participantListSlot(?interaction, ?list) ∧
psimif:participantSlot(?list, ?participant) →
tuo:hasReactant(?interaction, ?participant)

3.3 Rules for BioPAX Level 2

One-to-one mapping and interactions

As described in Section 2.1, Pathway Commons provides information in BioPAX Level 2 format. The simplicity of the mapping between BioPAX Level 2 and the telomere ontology is due to the similar modelling of physical processes within them. As described in Section 3.2 with respect to other simple mappings, rules such as BP_00001-5 could be written in pure OWL but are written in SWRL to keep rules separate from ontology modelling. These rules identify the direct equivalents in the telomere ontology for each relevant class and property within BioPAX. Specifically, BP_00001-5 provide mappings for BioPAX proteins, reaction participants and interactions.

BP_00001 :
bp:protein(?protein) →

BP_00002 :
bp:physicalEntityParticipant(?participant) →

BP_00004 :
bp:physicalInteraction(?interaction) →

BP_00003 :
bp:PHYSICAL-ENTITY(?participant, ?physicalEntity) →
tuo:playedBy(?participant, ?physicalEntity)

BP_00005 :
bp:PARTICIPANTS(?interaction, ?participant) →
tuo:hasReactant(?interaction, ?participant)

Protein identification

Rules BP_00008-9 map the name and synonym data properties present within a BioPAX protein to the telomere ontology synonym data property. As described in Section 3.1, only UniProtKB entries populate the recommendedName data property; all names and synonyms from other source ontologies map directly to the synonyms within the telomere ontology.

BP_00008 :
bp:protein(?protein) ∧
bp:NAME(?protein, ?value) →
tuo:synonym(?protein, ?value)

BP_00009 :
bp:protein(?protein) ∧
bp:SYNONYMS(?protein, ?value) →
tuo:synonym(?protein, ?value)

Taxonomic identification and other database cross-references

BP_00006 fills the TaxonomicSpecies class as well as the ncbiTaxId data property within the telomere ontology. In order to fill this value, only those unificationXref classes with a database whose name is “NCBI” are used. The identifier for this unificationXref is then used to fill the ncbiTaxId property. BP_00007 then links a Protein in the telomere ontology to its taxonomic species.

BP_00006 :
bp:unificationXref(?xref) ∧
bp:DB(?x, ?value) ∧
bp:ID(?x, ?id) ∧
swrlb:equal(?value, “NCBI”) →
tuo:TaxonomicSpecies(?xref) ∧
tuo:ncbiTaxId(?xref, ?id)

BP_00007 :
bp:protein(?protein) ∧
bp:ORGANISM(?protein, ?organism) ∧
bp:TAXON-XREF(?organism, ?ref) →
tuo:hasTaxon(?protein, ?ref)

BP_00010 fills the DatabaseReference class in the telomere ontology as well as the properties connecting the DatabaseReference to a database name and accession. Taxonomic identifiers are specifically not included in this rule, as they have already been parsed by BP_00006-7. BP_00011 then links the protein to the mapped DatabaseReference class via the hasDatabaseReference property. unificationXref instances are the only ones mapped from BioPAX, as they show equivalence between the BioPAX entity and the database reference specified. There are other reference types within BioPAX, however such cross references link to data that is related, but not identical, to the entity in question.

BP_00010 :
bp:unificationXref(?xref) ∧
bp:DB(?xref, ?dbvalue) ∧
bp:ID(?xref, ?idvalue) ∧
swrlb:notEqual(?dbvalue, “NCBI”) →
tuo:DatabaseReference(?xref) ∧
tuo:databaseName(?xref, ?dbvalue) ∧
tuo:accession(?xref, ?idvalue)

BP_00011 :
bp:protein(?protein) ∧
bp:XREF(?protein, ?xref) ∧
bp:unificationXref(?xref) →
tuo:hasDatabaseReference(?protein, ?xref)

3.4 Rules for MFO

There is a clear delineation in most systems biology formalisms between a protein as an abstract concept and a protein as a specific interactor in a reaction. Within SBML, this is defined using the speciesType or species elements for the concept of a protein, and speciesReference for the participant in the reaction. Within BioPAX Level 2, the biological entity / participant pairing is modelled with physicalEntity and physicalEntityParticipant. Within BioGRID, interactor and participant elements are used.

A similar dichotomy exists within the telomere ontology, and when converting the information retrieved in Use Case 1 back to SBML, these considerations need to be taken into account. One solution is to link the telomere ontology’s Protein class (and its parent class PhysicalEntity) to the SBML speciesType element. However, most SBML models do not use speciesType, and future versions of SBML will have a different way of modelling such concepts [35]. Therefore, to align with future versions of SBML as well as to make the mapping simple, the instances of the telomere ontology PhysicalEntity class are instead linked to the appropriate attributes of an SBML species element.

Biological annotation

MIRIAM annotation is added to MFO by mapping a variety of database cross-references from the telomere ontology. The rules used to create these annotations introduce two additional SWRL built-ins, swrlb:stringEqualIgnoreCase and swrlb:stringConcat. The swrlb:stringEqualIgnoreCase built-in is satisfied if and only if the two arguments match in a case-insensitive manner, while the first argument in swrlb:stringConcat is filled with a concatenation of all further arguments [17].

In the first two rules for mapping information from the telomere ontology to MFO (TUO_MFO_00001 and TUO_MFO_00001_synonym), all Rad9 instances are enhanced with two biological annotations: firstly, a MIRIAM URI for the UniProtKB entry associated with the Rad9 protein and secondly, the addition of a link to the SBO “macromolecule” term (SBO_0000245) (Note: For brevity, the second rule (TUO_MFO_00001_synonym) is not shown. It is identical to the first rule other than the swrlb:stringEqualIgnoreCase compares against “uniprotkb” rather than “uniprot”. Different source datasets use different names for this database.). This association is double-checked by ensuring that the “rad9” species name is already present. This is a rule which returns information—required for the use cases—on the Rad9 protein only. While this is a highly specific rule, it can easily be generalised to run over Protein, the parent class of Rad9. In such a general case, shown later for TUO_MFO_00006 and TUO_MFO_00006_synonym, all of the references to specific MFO species instances have been removed.

TUO_MFO_00001 :
tuo:Rad9(?protein) ∧
tuo:hasDatabaseReference(?protein, ?crossref) ∧
tuo:databaseName(?crossref, ?dbname) ∧
tuo:accession(?crossref, ?accession) ∧
swrlb:stringEqualIgnoreCase(?dbname, “uniprot”) ∧
swrlb:stringConcat(?miriamIs, “urn:miriam:uniprot:”, ?accession) ∧
sbo:SBO_0000245(?term) ∧
mfo:Species(?mfoSpecies) ∧
mfo:name(?mfoSpecies, ?nameText) ∧
swrlb:stringEqualIgnoreCase(?nameText, “rad9”) →
mfo:MiriamAnnotation(?crossref) ∧
mfo:miriamAnnotation(?mfoSpecies, ?crossref) ∧
mfo:bqbIs(?crossref, ?miriamIs) ∧
mfo:sboTerm(?mfoSpecies, ?term)

Additional MIRIAM annotations are added with TUO_MFO_00002, which adds any SGD cross-references to MFO for Rad9 instances. As with TUO_MFO_00001 and TUO_MFO_00001_synonym, this rule could easily be made more generic to allow all MFO species to be annotated with their relevant cross-references. As MIRIAM URIs require a prefix identifying the referenced database (e.g. “urn:miriam:uniprot:”), each database requires its own rule to correctly build the URI. Rules TUO_MFO_00003-4 (not shown) differ from TUO_MFO_00001-2 only in the database used to create the new MIRIAM annotation; TUO_MFO_00003 for Intact and TUO_MFO_00004 for Pathway Commons.

TUO_MFO_00002 :
tuo:Rad9(?protein) ∧
tuo:hasDatabaseReference(?protein, ?crossref) ∧
tuo:databaseName(?crossref, ?dbname) ∧
tuo:accession(?crossref, ?accession) ∧
swrlb:stringEqualIgnoreCase(?dbname, “sgd”) ∧
swrlb:stringConcat(?miriamIs, “urn:miriam:sgd:”, ?accession) ∧
mfo:Species(?species) ∧
mfo:name(?species, ?nameText) ∧
swrlb:stringEqualIgnoreCase(?nameText, “rad9”) →
mfo:MiriamAnnotation(?crossref) ∧
mfo:miriamAnnotation(?species, ?crossref) ∧
mfo:bqbIs(?crossref, ?miriamIs)

In TUO_MFO_00006 and TUO_MFO_00006_synonym, for every Protein in the telomere ontology, a UniProtKB MIRIAM annotation is created. (Note: TUO_MFO_00006_synonym is not shown in this section as it is identical to TUO_MFO_00006 except for the string comparison swrlb:stringEqualIgnoreCase, which instead uses the literal “uniprotkb”). In addition to the UniProtKB MIRIAM annotations within MFO for each telomere ontology Protein, TUO_MFO_00007-9 add MIRIAM annotations for SGD, Intact and Pathway Commons. Due to their similarity with TUO_MFO_00006, for brevity they are not shown.

TUO_MFO_00006 :
tuo:Protein(?mfoSpecies) ∧
mfo:Species(?mfoSpecies) ∧
tuo:hasDatabaseReference(?mfoSpecies, ?crossref) ∧
tuo:databaseName(?crossref, ?dbname) ∧
tuo:accession(?crossref, ?accession) ∧
swrlb:stringEqualIgnoreCase(?dbname, “uniprot”) ∧
swrlb:stringConcat(?miriamIs, “urn:miriam:uniprot:”, ?accession) ∧
sbo:SBO_0000245(?term) →
mfo:MiriamAnnotation(?crossref) ∧
mfo:miriamAnnotation(?mfoSpecies, ?crossref) ∧
mfo:bqbIs(?crossref, ?miriamIs) ∧
mfo:sboTerm(?mfoSpecies, ?term)

TUO_MFO_00011-12 provide name properties for the Species instances within MFO. As SWRL rules will not overwrite existing name values, this rule will only modify the mapped name property if it is empty. This ensures that existing values are not overwritten, and further ensures—if the rules are run in sequence—that names from the recommendedName property are not overridden by those from the synonym property. If pre-existing names from the MFO model should be re-written, they can simply be deleted prior to or during conversion from SBML to MFO.

TUO_MFO_00011 :
tuo:Protein(?spec) ∧
mfo:Species(?spec) ∧
tuo:recommendedName(?spec, ?recName) →
mfo:name(?spec, ?recName)

TUO_MFO_00012 :
tuo:Protein(?spec) ∧
mfo:Species(?spec) ∧
tuo:synonym(?spec, ?synonym) →
mfo:name(?spec, ?synonym)

Rad9-Rad17 Interaction (TUO_MFO_00005)

Specific interactions relating to Rad9 were retrieved from the telomere ontology and mapped to MFO. Rad9Rad17 is described in this section, and Rad9Mec1 is described in the next. TUO_MFO_00005 maps the Rad9Rad17 interaction, and is tailored to populating a single SBML model. Otherwise, if multiple models are stored within MFO, incorrect listOfReactions and listOfSpecies could be matched. Where necessary, this limitation can easily be resolved by extending the rule to specify the ListOf__ instances associated with a particular systems biology model.

As TUO_MFO_00005 is a long and complex rule, it is presented in sections rather than all at once. Further, in contrast to the native ordering in the OWL files themselves, portions of the rule have been reordered slightly for increased readability. Firstly the reaction to be mapped, the reactants involved, and the proteins associated with those reactants within the telomere ontology are identified:

TUO_MFO_00005 :
tuo:ProteinComplexFormation(?reaction) ∧
tuo:Rad9Rad17Interaction(?reaction) ∧
tuo:hasReactant(?reaction, ?reactant) ∧
tuo:playedBy(?reactant, ?protein) ∧

Next, the already-extant ListOfReactions and ListOfSpecies instances from MFO are stored in SWRL variables for later use:

mfo:ListOfReactions(?listOfReaction) ∧
mfo:ListOfSpecies(?listOfSpecies) ∧

SWRL built-ins are then called to create seven new variables. Firstly, ?nameVariable is filled with the value “Rad9Rad17Complex” using swrlb:stringConcat; this is a string literal only, and is not an instance itself. ?nameVariable is used later in the rule to create a value for the name data property within MFO. Six calls to swrlx:makeOWLThing are then run. These calls create the appropriate number of MFO instances for storing the information related to the reactions and proteins matched within this rule. For instance, if there are X instances of the Protein class matched within TUO_MFO_00005, then X new instances are created and bound to the variable ?speciesId. As the Rad9Rad17 interaction is a protein complex formation, one product—the complex itself—is created for each reaction. The newly-created instance variable ?product describes the new product of the reaction. The other new variables create additional parts of an SBML model which are not directly modelled within the telomere ontology.

swrlb:stringConcat(?nameVariable, “Rad9Rad17Complex”) ∧
swrlx:makeOWLThing(?reactantList, ?reaction) ∧
swrlx:makeOWLThing(?reactionSId, ?reaction) ∧
swrlx:makeOWLThing(?speciesSId, ?protein) ∧
swrlx:makeOWLThing(?product, ?reaction) ∧
swrlx:makeOWLThing(?prodSpecSid, ?reaction) ∧
swrlx:makeOWLThing(?prodRef, ?reaction) ∧
swrlx:makeOWLThing(?productList, ?reaction)

These SWRL built-ins mark the end of the antecedent for TUO_MFO_00005, which has created or identified all variables needed to map the reaction data to MFO. The consequent begins with assignment of all of the identifiers to SId instances in MFO.

mfo:SId(?reactionSId) ∧
mfo:SId(?speciesSId) ∧
mfo:SId(?prodSpecSid) ∧

Then, each of the links to those SId instances is built using the id object property. Additionally, the name property is filled for the ?product instance:

mfo:id(?reaction, ?reactionSId) ∧
mfo:id(?protein, ?speciesSId) ∧
mfo:id(?product, ?prodSpecSid) ∧
mfo:name(?product, ?nameVariable) ∧

Next, each of the instances named in the first argument of the id property are mapped to their appropriate MFO classes. Additionally, the Reaction instance gets a mapped list of reactants and products and both the Species and Reaction instances are linked to their appropriate ListOf__ instances.

mfo:Reaction(?reaction) ∧
mfo:Species(?protein) ∧
mfo:Species(?product) ∧
mfo:listOfReactants(?reaction, ?reactantList) ∧
mfo:listOfProducts(?reaction, ?productList) ∧
mfo:reaction(?listOfReaction, ?reaction) ∧
mfo:species(?listOfSpecies, ?protein) ∧
mfo:species(?listOfSpecies, ?product) ∧

The ListOfSpeciesReferences class in MFO is then populated with the appropriate new instances from the swrlx:makeOWLThing built-in, and each individual SpeciesReference is mapped and added to its list:

mfo:ListOfSpeciesReferences(?reactantList) ∧
mfo:ListOfSpeciesReferences(?productList) ∧
mfo:ProductSpeciesReference(?prodRef) ∧
mfo:ReactantSpeciesReference(?reactant) ∧
mfo:speciesReference(?reactantList, ?reactant) ∧
mfo:speciesReference(?productList, ?prodRef) ∧

Rad9-Mec1 Interaction (TUO_MFO_00010)

There are two interactions between Rad9 and Mec1 present within the telomere ontology. In this rule, one of these interactions has been arbitrarily chosen to send to MFO. This is a choice the modeller can make in future, after looking at the information contained in each interaction choice. Unlike TUO_MFO_00005, there is no extra information from the data sources which allows the interaction to be defined as being a protein complex formation. As such, a likely product for the reaction cannot be created. The lack of knowledge about the interaction means that TUO_MFO_00010 is a much simpler rule than TUO_MFO_00005. Except for those sections of the rule dealing with products and lists of products, TUO_MFO_00010 is the same as TUO_MFO_00005. As a result the reaction has reactants only, and no products.

TUO_MFO_00010 :
tuo:Rad9Mec1Interaction(’psimif:interaction_143’) ∧
tuo:hasReactant(’psimif:interaction_143’, ?react) ∧
tuo:playedBy(?react, ?protein) ∧
mfo:ListOfReactions(?listOfReaction) ∧
swrlx:makeOWLThing(?reactantList, ’psimif:interaction_143’) ∧
swrlx:makeOWLThing(?reactSId, ’psimif:interaction_143’) ∧
swrlx:makeOWLThing(?specSId, ?protein) →
mfo:ListOfSpeciesReferences(?reactantList) ∧
mfo:SId(?reactSId) ∧
mfo:SId(?specSId) ∧
mfo:ReactantSpeciesReference(?react) ∧
mfo:speciesReference(?reactantList, ?react) ∧
mfo:Reaction(’psimif:interaction_143’) ∧
mfo:id(’psimif:interaction_143’, ?reactSId) ∧
mfo:reaction(?listOfReaction, ’psimif:interaction_143’) ∧
mfo:Species(?protein) ∧
mfo:id(?protein, ?specSId) ∧
mfo:listOfReactants(’psimif:interaction_143’, ?reactantList)

4 The core ontology

Often, mediator-based approaches build a core ontology as a union of source ontologies rather than as a semantically-rich description of the research domain in its own right [36, 16, 37]. While such ontologies thoroughly describe the associated data formats, a description of the structure of data is not a description of the biology of the data. A union of source ontologies would not produce a description of the biological domain of interest, but simply a description of the combined formats imported into the system. Further, if a core ontology is defined as merely an ontology which models a set of data sources, the core ontology becomes brittle with respect to the addition of new data sources and new formats.

In contrast, for rule-based mediation a core ontology should be a biologically-relevant, tightly-scoped, logically-rigorous description of the semantics of the research domain. Although the definition of a core ontology shares much with that of a reference ontology, a core ontology is not required to stand as a reference model for a biological domain of interest, although that may be one of its additional purposes. As originally written about BioPAX, a useful representation of the research domain should be detailed enough to express richness where required, but general enough to maintain the overall framework when few details are available [38]. The core ontology may be created for rule-based mediation or drawn from existing ontologies. However, pre-existing biomedical ontologies should be chosen with care, as many do not have sufficiently formalised semantics and as such may not be useful for automated reasoning tasks in a data integration context [7]. A core ontology is not intended to capture all of biology; instead, it is scoped tightly to its purpose, which is modelling the research domain of interest.

By creating an core ontology which is more than the entailment of a set of source ontologies, and which stands on its own as a semantically-rich model of a research domain, the core ontology becomes much more flexible with respect to changes in data sources. Because the core ontology is abstracted away from data formats, importing new data sources is made easier. Furthermore, the process of adding to the core ontology is simplified: each new mapping, class, or data import is incremental, without the need for large-scale changes. The richness of the core ontology depends on the type of biological questions that it has been created to answer and the amount of resources available; a detailed ontology may have higher coverage of the research domain, but may take longer to develop and reason over.

In contrast to global-as-view or local-as-view approaches, where either the target or source ontologies are entirely described using views of the other, the methods used in rule-based mediation provide a way to decouple syntactic and semantic heterogeneity and allow the core and source ontologies to remain distinct. Additionally, rule-based mediation maps and loads data directly into the core ontology, allowing simplified querying via the use of standard tools, mapping and querying languages. Being a semantically-rich model of a research domain, a core ontology could also be used as a standalone ontology for marking up data or as a reference model of its biological domain.

It is possible for a core ontology to incorporate other domain ontologies or upper-level ontologies. Hoehndorf and colleagues included a custom-built upper-level ontology to aid the integration of the in vivo and in silico aspects of SBML models [14]. However, as rule-based mediation uses the source and core ontologies to deliberately keep the biology and data structure separate, such an upper-level ontology is not required. Upper-level ontologies are often created to aid in the merging of ontologies with distantly-connected topics. As such, they are not as important for core ontologies, which are tied to one specific subject.

4.1 The Telomere Ontology

For the use cases presented in this chapter the core ontology, referred to as the telomere ontology, models a portion of the telomere binding process in S.cerevisiae. Specifically, it models selected parts of the biology relevant to the Proctor and colleagues model [39] of telomere uncapping. Figure 7 shows an overview of this ontology.

Figure 7: An overview of the telomere ontology. The telomere ontology is the core ontology for the use cases presented in this chapter.

The majority of the reasoning and querying in rule-based mediation is run over the integrated data in the core ontology, allowing appropriate annotations to be queried for and selected. The results of those queries can then be exported, via a source ontology, to a non-OWL format for that data source. The use cases described in Section 5 illustrate the retrieval of knowledge from the telomere ontology and its application to systems biology model annotation via the MFO syntactic ontology.

telomere ontology would be only be of the type DirectedReaction, this constraint does not need to be explicitly stated. If required, inferences about the data source itself could be made by examination of its instances in telomere ontology.

4.2 Rules and stored queries within the telomere ontology

There are a number of rules where both the antecedent and the consequent contain entities only from the telomere ontology. These are generally convenience methods to make statements about the data coming into the ontology which otherwise would be difficult to do in plain OWL. For instance, TUO_00001-2 further specialise instances of Protein to be instances of Rad9 if they have “rad9” somewhere in the recommended name or the synonym list. These rules are used within Use Case 1 to store the results of a query against the integrated dataset which places all RAD9 proteins within the Rad9 class (see Section 5.1). While these rules work for relatively small datasets, the false positive rate might increase as the datasets grow. Further work on the telomere ontology could replace these rules with more precise axioms defining the classes they are modifying. Optionally, automated instance-to-class mapping (see Section 6.4) would eliminate the need for specific rules such as these, as would modifying the way proteins are handled such that each protein (e.g. Rad9) is not given its own class.

TUO_00001 :
tuo:Protein(?someEntity) ∧
tuo:synonym(?someEntity, ?synonym) ∧
swrlb:containsIgnoreCase(?synonym,“rad9”) →

TUO_00002 :
tuo:Protein(?someEntity) ∧
tuo:recommendedName(?someEntity, ?name) ∧
swrlb:containsIgnoreCase(?name, “rad9”) →

TUO_00004-11 are similar rules for four other subclasses of Protein: Rad53, Chk1, Mec1 and Rad17. These include matching specific name such as “rad53” in TUO_00004, as well as other known synonym matches such as “YPL153C” in TUO_00004_Systematic. Only two of these rules are shown, with the others not included for brevity.

TUO_00004 :
tuo:Protein(?someEntity) ∧
tuo:synonym(?someEntity, ?synonym) ∧
swrlb:containsIgnoreCase(?synonym, “rad53”) →

TUO_00004_Systematic :
tuo:Protein(?someEntity) ∧
tuo:synonym(?someEntity, ?synonym) ∧
swrlb:containsIgnoreCase(?synonym, “YPL153C”) →

Other rules of this type were created specifically to perform the same task as the axioms present in the defined class referenced in the rules (see Use Case 2 in Section 5.2 for more information), thus allowing users of the ontology to see the results of the inference without needing to realise the ontology with a reasoner (see Section 1.5 for a description of realisation). An example of this is TUO_00012, which populates the Rad9Rad17Interaction and associated properties without needing to infer the placement of instances via a reasoner. Rules TUO_00013-15 (not shown) are virtually identical to TUO_00012, instead mapping the information required for Rad9Rad53Interaction, Rad9Mec1Interaction and Rad9Chk1Interaction.

TUO_00012 tuo:Rad9(?rad9instance) ∧
tuo:plays(?rad9instance, ?rad9participant) ∧
tuo:Rad17(?rad17instance) ∧
tuo:plays(?rad17instance, ?rad17participant) ∧
tuo:hasParticipant(?process, ?rad9participant) ∧
tuo:hasParticipant(?process, ?rad17participant) →

Finally, there are two stored SQWRL queries in the telomere ontology. TUO_SQWRL_00001 selects those entities with a taxonomic identifier of 4932 and which are instances of the Rad9 class, which itself is filled via TUO_00001 and TUO_00002.

TUO_SQWRL_00001 :
tuo:Rad9(?someEntity) ∧
tuo:hasTaxon(?someEntity, ?x) ∧
tuo:ncbiTaxId(?x, ?y) ∧
swrlb:equal(?y, 4932) →

TUO_SQWRL_00004 retrieves only those reactions that involve Rad9, and is used within Use Case 2 to retrieve specific interactions with Rad9 instances (see Section 5.2).

TUO_SQWRL_00004 :
tuo:Rad9(?rad9instance) ∧
tuo:plays(?rad9instance, ?participant) ∧
tuo:hasParticipant(?process, ?participant) →
sqwrl:select(?rad9instance, ?process)

5 Two use cases demonstrate rule-based mediation

Rule-based mediation has been implemented for the two use cases presented in this section. A more generic Web implementation of rule-based mediation is currently under development. The use cases show how rule-based mediation extends a quantitative SBML model as a proof-of-principle for the methodology.

The context for the use cases is based on the Proctor and colleagues’ S.cerevisiae telomere model previously described in this thesis [39]. In this way the integration results for the use cases can be compared against an already-extant, curated model. One of the key proteins within this model, RAD9, was chosen as the basis for the two use cases. In the use cases, various types of information about this query protein are retrieved. These examples highlight how rule-based mediation works in a systems biology setting and provide insight into how such a system might be extended in the future to provide automated model annotation.

Each use case shows a different aspect of the integration process. Use Case 1 shows how an existing entity within an SBML model can be augmented, and Use Case 2 describes the addition of near neighbours to the selected model species. The use cases begin with the same initial conditions. The telomere ontology described in Section 4.1 was used as the core ontology for the sources described in Section 2. Three standard Semantic Web tools (XMLTab [31], SWRLTab and SQWRLQueryTab [40]) were then used to implement the integration step via SWRL mappings. The information retrieved from the core ontology allowed the annotation of an SBML model, including the provisional identification of information not previously present in the model. These use cases demonstrate the feasibility of the approach, the applicability of the technology and its utility in gaining new knowledge.

The following subsections describe two methods for enriching the telomere model. In Use Case 1, a single SBML species is annotated with information relevant to the gene RAD9. Adding information to existing SBML elements at an early stage aids model development and extension prior to publication or deposition. In Use Case 2, possible protein-protein interactions are retrieved involving the RAD9 protein. This approach results in the identification of model elements as well as a putative match for an enzyme that was not identified in the original curated model. These examples show how rule-based mediation works in a systems biology setting. These use cases also show that rule-based mediation successfully integrates information from multiple sources using existing tools, and that it would be useful to expand the implementation of this method to additional biological questions and automated model annotation.

Prior to performing the queries for each use case, all mapping rules from the source ontologies to the telomere ontology were run (BP_*, PSIMIF_*, and UP_*, as described in Section 3). These rules populated the telomere ontology with the instance data from the source ontologies. Next, the rules to further classify the new instance data within the telomere ontology (TUO_000*) were executed. Once the execution of the SWRL rules was complete, the telomere ontology can be queried with SQWRL. To get the final results, the queries and rules specific to the use cases (TUO_MFO_* and TUO_SQWRL_*) are executed.

Protégé 3.4 and the built-in Pellet 1.5.2 reasoner were used to reason over the entire set of integrated ontologies as well as the SBML model within MFO. When running on a dual-CPU laptop running Ubuntu 11.10 with 4 Gigabytes of memory, consistency checks for the set of integrated ontologies take approximately 10-15 seconds, while computing inferred types requires approximately 9-12 minutes. Reasoning over just MFO with the annotated model from the use case takes less than 10 seconds for both consistency checks and the computation of inferred types. Finally, the MFO software toolset was used to convert the newly-annotated model from OWL to SBML.

Once instance data is loaded into telomere ontology from all of the source ontologies via the rules described in Section 3, reasoning as well as the application of telomere ontology specific rules can be performed for the merging of semantically identical instances, logical inconsistencies and the inference of instances to their most specific location in the hierarchy. For each use case, a query term and a taxonomic restriction has been provided. The query results are mapped from the telomere ontology to either a new or existing SBML model in MFO.

5.1 Use Case 1: Augmentation of existing model species

In Use Case 1, the SBML species ’rad9’ is investigated. The only information provided by the modeller is the query term ’rad9’ and the taxonomic species S.cerevisiae. In contrast to a taxonomic species, an SBML species is any physical entity that can be part of an SBML reaction. There are a number of motivations for such a query: many models are initially created as skeletons, with all species and reactions present but without accompanying data; alternatively, modellers might be searching for annotation at an even earlier stage, and want to see what is available for just one particular species. By adding information to SBML components at an early stage, modellers will not only have a more complete model, but such extra information could also aid them in developing and extending the model prior to publication or deposition in a database such as BioModels [29]. This use case showcases the ability of the methodology to add basic information to the species such as cross-references and other annotations.

Therefore in Use Case 1, the telomere ontology was queried for information about the S.cerevisiae gene RAD9. This use case demonstrates the addition of basic information to the species such as cross-references, SBO annotations, compartment localisations and a recommended name. The query for proteins containing ’rad9’ as a name or synonym are stored within the telomere ontology as SWRL rules TUO_00001-2 (described initially in Section 4.2). Running the rules collects all matching telomere ontology Protein instances and stores them as instances of Rad9, a subclass of Protein.

The antecedent of TUO_00001 queries the telomere ontology synonym property within Protein for a string matching “rad9”, and the consequent of the rule fills Rad9 with any matched instances. TUO_00002 works in the same way, with recommendedName as the queried field. Running these SWRL rules places the following two instances under Rad9, each from a different data source:


The next step is to discover equivalent instances elsewhere in the integrated schema and mark them as such. This is done by mimicking the behaviour of the OWL 2 construct owl:hasKey (see also Section 6). The end result of this procedure is that all instances with the same SGD accession will be marked as identical using the owl:sameAs construct. In future, this can happen automatically as the relation linking a telomere ontology Protein to its SGD accession number can be classed as the key for the Protein class in an owl:hasKey construct.

Until the owl:hasKey construct is used, those instances having the same SGD accession can be identified using a SQWRL query such as that shown below, and then manually adding the owl:sameAs assertions. A SQWRL is more appropriate, as SWRL rules modify the target ontology.

tuo:Protein(?someEntity) ∧
tuo:hasDatabaseReference(?someEntity, ?dbref ) ∧
tuo:databaseName(?dbref, ?name) ∧
swrlb:containsIgnoreCase(?name, “SGD” ) ∧
tuo:accession(?dbref, ?acc) ∧
swrlb:equal(?acc, “S000002625” ) →

There are already two instances of Rad9 as a result of TUO_00001-2. After running the SQWRL query above and viewing the results (upkb:protein_0 and psimif:interactor_59), the owl:sameAs construct can be applied between those two instances. The new instance, psimif:interactor_59, does not contain ‘rad9’ in its name or synonyms, but does contain a matching SGD accession. After inferring the placement of all individuals in the ontology using a reasoner, psimif:interactor_59 is inferred as an instance of Rad9, bringing the total number of Rad9 instances to three:


The final missing owl:sameAs equivalences were then declared for these three instances, allowing the knowledge contained within each of them to be accessible as a single logical unit. The final step of this use case is to restrict the instances of Rad9 to the specified NCBI taxonomic identifier for S.cerevisiae. The SQWRL query TUO_SQWRL_00001 (see Section 4.2) performs this function.

The information contained within these three instances is then sent out, via MFO, to a new version of the telomere model. In order to export the relevant information to MFO, a number of mapping rules with telomere ontology classes in the antecedent and MFO classes in the consequent were created (see Section 3.4). To create MIRIAM cross-references and SBO terms, rules TUO_MFO_00001 through TUO_MFO_00004 were run. For instance, in TUO_MFO_00001, SBO terms and all UniProtKB cross-references for Rad9 were retrieved and reformatted according to MIRIAM guidelines, and added to the appropriate MFO Rad9 species.

The original curated telomere model contained a single reference to UniProtKB, which was confirmed by the information contained within the telomere ontology. In addition, an SBO term and a number of other MIRIAM annotations were exported into the model for the two RAD9 species in the model. As described in Section 3.4, Protein classes from the telomere ontology are mapped directly to species elements in SBML. In Use Case 1, mapping from the telomere ontology back to MFO links annotation from the relevant Rad9 instances back to MIRIAM annotations within the MFO Species class. The XML snippet below shows one of the Rad9 species after the results of this use case were exported:

    <species metaid="_044468" id="Rad9A" name="Rad9" compartment="nucleus"
    initialAmount="0" hasOnlySubstanceUnits="true" sboTerm="SBO:0000245">
                <rdf:li rdf:resource="urn:miriam:uniprot:P14737"/>
                <rdf:li rdf:resource="urn:miriam:intact:EBI-14788"/>
                <rdf:li rdf:resource="urn:miriam:uniprot:Q04920"/>
                <rdf:li rdf:resource="urn:miriam:pathwaycommons:92332"/>
                <rdf:li rdf:resource="urn:miriam:sgd:S000002625"/>
                <rdf:li rdf:resource="urn:miriam:intact:P14737"/>

All three data sources contributed to the set of information retrieved in Use Case 1. By successfully mapping data from the source ontologies into the core telomere ontology, information was retrieved regarding proteins with ’rad9’ as one of their names. This information was then added to an SBML file by mapping from the telomere ontology back to MFO.

The use of the owl:sameAs construct for identifying equivalent instances is helpful, but caused some problems for this use case. For instance, a bug the SQWRL implementation used in this work prevented the use of any instances marked with owl:sameAs axioms, and these axioms were removed prior to running the queries. In addition, there is the potential for false positives in defining two instances as identical using owl:sameAs. For instance, BioPAX has multiple types of cross-references including unificationXref (for cross-references linking to semantically identical entries in the other data resources) and relationshipXref, which implies a less rigorous link. If the wrong type of key is created (e.g. using relationshipXref rather than unificationXref), or an imprecise rule is used, then two instances could be incorrectly marked as identical. Mapping rules must therefore be created carefully, or they could lead to false positives with respect to related, but not identical, instances.

5.2 Use Case 2: Augmentation by addition of near neighbours

Use Case 2 shows how information about possible protein-protein interactions involving RAD9 can be retrieved from the telomere ontology, confirming interactions already present in the curated model as well as discovering novel interactions. The retrieved information results in the proposal of new model elements as well as the putative identification of the enzyme responsible for activation of RAD9. The results of Use Case 1 are used as part of the input in this use case. TUO_SQWRL_00004 is the SQWRL query used to display, but not store, all interactions involving Rad9 (see also Section 4.2).

In TUO_SQWRL_00004, the antecedent of the SQWRL query identifies interactions involving Rad9 instances via the Participant class. The consequent displays the name of the instance, and the process in which it is involved. The use of SQWRL here rather than SWRL allows viewing of the more than 270 returned interactions. These results are not saved in the asserted ontology as there are too many, some of which are not meaningful for the model under study. If necessary, such unwanted results could be excluded with a small modification to the query or to the output mapping rules for MFO. Here, the interactions retrieved in TUO_SQWRL_00004 were filtered according to those species already present in the model, many of which did not have any prior involvement with RAD9. Before such relevant interactions can be found, the identity of the proteins involved must be determined. For example, just as Rad9 instances were identified in Use Case 1, so should all other proteins of interest be identified. While these rules are currently manually generated, they could easily be automatically created and run on-the-fly. TUO_00001-11 are the SWRL rules which populates the class for these proteins of interest (see Section 4.2).

The combined results from queries TUO_00001-11 result in the appropriate proteins for this model being identified and classified. Once this is complete, specific interactions of interest can also be identified. Below are the OWL axioms on the telomere ontology defined class Rad9Rad53Interaction. This is one of four such defined classes which filter the Rad9 interactions. When reasoning over instance data is performed, the appropriate PhysicalProcess instances which meet the criteria set out in these necessary and sufficient axioms are inferred to be children of Rad9Rad53Interaction.

Class: Rad9Rad53Interaction
        (hasParticipant some (playedBy some Rad53))
         and (hasParticipant some (playedBy some Rad9))    

Table 2 shows a summary of the inference results from the defined interaction classes
Rad9Rad53Interaction, Rad9Chk1Interaction, Rad9Mec1Interaction and Rad9Rad17Interaction. The results may also be obtained by running TUO_00012-15 (see Section 4.2), or by loading the asserted ontology and re-running the reasoning. There are a total of four interactions involving RAD9 present in the curated model. Two of these, with RAD53 and CHK1, were confirmed in this use case. The RAD53 interaction was found in both BioGRID and in Pathway Commons, while the CHK1 interaction was only in Pathway Commons.

Discovered Interactor with RAD9 (P14737) Proctor et al. BioGRID Pathway Commons
Serine/threonine-protein kinase RAD53 (P22216) yes yes yes
Serine/threonine-protein kinase CHK1 (P38147) yes yes
Serine/threonine-protein kinase MEC1 (P38111) yes
DNA damage checkpoint control RAD17 (P48581) yes
Rad9Kin (*) yes
ExoX (*) yes
Table 2: Summary of interactions retrieved for Use Case 2 against the telomere ontology. All discovered interactions already present in the curated model (RAD53 and CHK1) are shown, together with example interactions from the curated model that were not discovered (with Rad9Kin and ExoX) and discovered interactions that were not present in the model (with MEC1 and RAD17). SBML species shown with an asterisk (*) are those which are placeholder species, and therefore cannot have a match to a real protein. Some interactions are false positives inherent in the data source, while others are out of scope of the modelling domain of interest and should not be not included.

The other two RAD9 interactions are placeholder species, created by the modeller to describe an unknown protein which could not be matched further. In addition to confirming the RAD53 and CHK1 interactions with RAD9, one of those placeholder species, marked as ’Rad9Kin’ in the model, was provisionally identified as the protein MEC1 using data originating in BioGRID. In the curated model, the ’rad9Kin’ species does not have a UniProtKB primary accession, as the model author did not know which protein activated RAD9. However, MEC1 is shown in the telomere ontology as interacting with RAD9 and is present elsewhere in the curated model reactions. Further, UniProtKB reports it as a kinase which phosphorylates RAD9. From this information, the model author now believes that MEC1 could be the correct protein to use as the activator of RAD9 [41]. RAD17 was added to Table 2 as another example of a protein present in the curated model, but not as an interacting partner of RAD9. This interaction may have been unknown to the modeller, and it may be that the curated model could be improved by the addition of a RAD9/RAD17 reaction.

After the information was retrieved within the telomere ontology, SWRL rules TUO_MFO_00005-12 were run to export the information from the telomere ontology into MFO and, ultimately, into the original SBML model. The XML snippet below shows the new SBML species added to the model as part of the new reaction with MEC1:

     <species metaid="_3a9a2e27" id="interactor_24_id" name="YBR136W"
          <rdf:Description rdf:about="#_3a9a2e27">
                <rdf:li rdf:resource="urn:miriam:sgd:S000000340"/>
                <rdf:li rdf:resource="urn:miriam:uniprot:P38111"/>

Here, the id is based on the name of the telomere ontology Mec1 instance, the name is that instance’s recommendedName or synonym, and the MIRIAM annotations are drawn from the list of DatabaseReference instances. MIRIAM annotations, names and SBO terms were added to the SBML species referenced in the use cases, and two example reactions were added: RAD9 with RAD17, and RAD9 with MEC1. The former was chosen as an example of how a ProteinComplexFormation reaction is mapped back to MFO, and the latter was chosen as an example of the mapping of a simpler interaction. Further, as MEC1 is a candidate for the value of the ’Rad9Kin’ species in the curated model, creating a skeleton reaction in the annotated version of the curated model is the first step in updating the model. An excerpt of the RAD9/RAD17 reaction in SBML is included below. Further work by the modeller can flesh out the base reaction, species and annotation created by the integration method to include parameters, initial values and rate constants.

     <reaction id="interaction_160_id">
        <speciesReference species="interactor_59_id"/>
        <speciesReference species="interactor_8_id"/>
        <speciesReference species="product_of_rad9rad17interaction"/>
    <species metaid="_720ffc30" id="interactor_8_id" name="YOR368W"
          <rdf:Description rdf:about="#_720ffc30">
                <rdf:li rdf:resource="urn:miriam:sgd:S000005895"/>
                <rdf:li rdf:resource="urn:miriam:uniprot:P48581"/>
    <species id="product_of_rad9rad17interaction" name="Rad9Rad17Complex"/>

For Use Case 2, while BioGRID contained the most interactions, it was UniProtKB and Pathway Commons which together contributed the majority of the other types of information. Curated RAD9 reactions with RAD53 and CHK1 were recovered, while of the many other reactions, one (with RAD17) was with an existing species in the model and another (with MEC1) putatively identified a placeholder species in that model.

Within the telomere ontology, the product of a protein complex formation must be a protein complex as described in the ProteinComplexFormation class. Specifically the RAD9/RAD53 and RAD9/RAD17 reactions are part of a BioGRID ‘Reconstituted Complex’. This reaction type can be presented for RAD9/RAD53 as rad9 + rad53 \leftrightarrow rad9rad53Complex. While not identical to the curated reaction, which shows activation of RAD53 by RAD9 (rad9Active + rad53Inactive \rightarrow rad9Active + rad53Active), the proposed interaction is potentially informative.

Use Case 2 addresses a modeller’s need to learn of possible new reactions involving the query species retrieved from Use Case 1. In this scenario, the modeller is only interested in protein-protein interactions. The query for this use case would be beneficial during model creation or extension of an already-extant model.

While BioGRID contained the most interactions for the second use case, it was UniProtKB and Pathway Commons which together contributed the majority of the information beyond those interactions. Combining data sources in a semantically-meaningful, rule-based mediation methodology is a useful way of retrieving useful information for SBML model annotation.

6 Future directions

6.1 Data provenance

There are limitations in describing data provenance using the SBML standard notation. When new MIRIAM annotations are added to an SBML model, there is no way of stating what procedure added the annotations, or the date they were added. At a minimum, the addition of the software type, version number and date on which the information was added should be included. At the moment, SBML is unable to make statements about annotations. The proposed annotation package for SBML Level 3 will address this issue, among others [42]. Until this issue is resolved, no provenance information is included in the newly-annotated models.

6.2 Mapping and the creation of syntactic ontologies

When mapping the telomere ontology to MFO, a number of limitations were encountered with the technology used. For instance, in order to map a telomere ontology PhysicalProcess such as ProteinComplexFormation correctly, a number of new instances were created when the rule was run (see rule in Section 3.4). Then, when multiple reactions make use of a single protein instance (such as, in these use cases, any Rad9 instance), a new MFO identifier, or SId, is created. The ultimate result is that more than one SId was created for the same protein instance, which if left alone would result in an inconsistent ontology, as only one SId is allowed per MFO Species. Also, because SWRL rules do not add information to a data property slot unless that slot is already empty, running a rule which adds a telomere ontology recommendedName value to an MFO species will not work if a similar rule using the synonym value has previously been run. These are all limitations that can be ameliorated through more sophisticated usages of OWL and SWRL tools. One such tool already used within the MFO project is the OWL API [43], which allows finer control over ontologies and their mapping and reasoning. Greater use of the SWRLAPI [32] might also address these limitations.

While extensive use of pre-existing applications, plugins, and libraries has been made, in the very near future the limit of some of these technologies could be reached. Particularly, improvements will need to be made to the way the XML is converted into OWL. Further, if this methodology is to be used in a larger-scale semantic data integration project, the correspondingly large number of instances required will most likely necessitate the use of a database back-end for the ontologies. Some technologies are available for storing instances from ontologies within a database schema such as that followed by Sahoo and colleagues [8], the Protégé database back-end (Note: http://protegewiki.stanford.edu/wiki/Working_with_the_Database_Backend_in_OWL) or DBOWL [44].

6.3 OWL 2

The use cases presented in this chapter, while performed with existing tools, required manual intervention. Improvements can be made to the implementation of this methodology through the use of additional OWL 2 features. In order to assert equivalences between instance data from the various data sources, the owl:sameAs construct was manually applied. Tool and reasoning support for OWL 2, now a W3C Recommendation [45], is growing, and with OWL 2, some advances such as the owl:hasKey [46] construct allows more automation in this area. These keys, or inverse functional datatype properties, allow the definition of equivalence rules based on properties such as database accessions, which in the current system was achieved by manual mapping. For example, by applying the owl:hasKey axiom to a data property linking to UniProtKB accession numbers, any instances containing matching UniProtKB accessions will be marked as identical.

6.4 Classes versus instance modelling for biological entities

In ontologies such as BioPAX Level 3, a specific protein is modelled as an instance of a class rather than as a class in its own right. For instance, RAD9 would be an instance of the BioPAX Protein class. Conversely in the telomere ontology, every protein—including Rad9—has been given its own class. This allows for expressive statements to be made about the protein and about particular interactions the protein is involved in. While this type of mapping allows greater expressivity, there are challenges to the broader usefulness of such mapping. As Demir has highlighted (Note: https://groups.google.com/forum/?hl=en#!topic/biopax2model/01XfxFfLxhA), the these hurdles are social as well as technological in nature.

Technologically, the use of expressive classes for specific biological entities such as Rad9 makes the information difficult for a “traditional” bioinformatician to use, as reasoners must be applied to get the most out of such ontologies. Reasoners are difficult for people unused to them, as their purpose is often unclear to new users and they tend to have esoteric error messages. From a social perspective, allowing users to create an unlimited number and type of entities such as Rad9 can lead to a lack of standard methods of creating and using those entities.

However, these challenges are all associated with the creation of an ontology which is expressly developed as a community, or shared, ontology. Core ontologies in rule-based mediation may be used by multiple groups, but are primarily intended to address a specific and possibly short-term biological question, and as such do not need to fulfil such a wide remit. While the initial development of the source and core ontologies would require an ontologist, the populating of the resulting knowledgebase within the core ontology is intended to be accessible via a Web application currently under development and, in future, the querying of that knowledgebase via Saint (Chapter 3). Reasoners are an intrinsic part of rule-based mediation, and the number of ontologists building the core ontology would normally be small. Therefore, as the issues raised by Demir are valid for large-scale ontologies intended for public use, they do not apply to the primary role of rule-based mediation.

In the use cases described in this chapter, instance-to-instance SWRL mappings assign membership of instances from source ontologies to classes from the core ontology. For example, the rad9A instance from the MFO version of the telomere model becomes an instance of the Rad9 class in the telomere ontology. It would be useful to be able to use mappings to automatically create core ontology classes such as Rad9. This instance-to-class mapping is theoretically possible, but no programmatically-accessible rule language has this feature. Such rules could instead be hard-coded using a library such as the OWL API. This would allow the automated population of a core ontology with all of the necessary classes for a particular pathway or systems biology model of interest.

7 Discussion

In rule-based mediation, a set of source ontologies are individually linked to a core ontology, which describes the domain of interest. Similar or even overlapping data sources and formats can be handled, as each format will have its own source ontology. Although some researchers question the conversion from a closed-world syntax such as XML to a open-world language such as OWL (Note: https://groups.google.com/forum/?hl=en#!topic/biopax2model/01XfxFfLxhA), the use of syntactic ontologies allows the separation of the closed-world source syntax from the biological domain modelling which takes place in the core ontology. In essence, the OWL in a syntactic ontology is “abused” in order to allow the conversion of the underlying source data to the more expressive model present in the core ontology. Unlike previous integration methods where one ontology type is created as a view of the other, rule-based mediation allows both the straightforward addition of new source ontologies as well as the maintenance of the core ontology as an independent entity. Using a core ontology allows new data sources and data formats to be added without changing the way the core model of the biological domain is structured.

Rule-based mediation was shown to be a valid methodology for two use cases relevant to the biological modelling community. Two syntactic ontologies were created to represent PSI-MIF and UniProtKB XML entries in OWL format. BioPAX has also been adopted as one of the source ontologies, showing how pre-existing ontologies can successfully be used as a source ontology without modification. The telomere ontology, which describes the research domain surrounding the use cases, can be reasoned over and is able to store information mapped from multiple source ontologies. The source ontologies and telomere ontology are available for download, pre-filled with the use case instances. In future, a Web application will allow the user to query the source databases and retrieve a tailor-made dataset which is then automatically mapped to the telomere ontology. A semantic, rule-based data integration methodology has been created, and proof-of-principle for that methodology in the context of SBML model annotation has been shown.

use it, or would there be no need for instances at all? We will need instances of rad9rad53Complex to show that a particular data source does represent this reaction.)


Peter Li, Tom Oinn, Stian Soiland, and Douglas B. Kell. Automated manipulation of systems biology models using libSBML within Taverna workflows. Bioinformatics (Oxford, England), 24(2):287–289, January 2008.
Falko Krause, Jannis Uhlendorf, Timo Lubitz, Marvin Schulz, Edda Klipp, and Wolfram Liebermeister. Annotation and merging of SBML models with semanticSBML. Bioinformatics, 26(3):421–422, February 2010.
Neil Swainston and Pedro Mendes. libAnnotationSBML: a library for exploiting SBML annotations. Bioinformatics, 25(17):2292–2293, September 2009.
M. L. Blinov, O. Ruebenacker, and I. I. Moraru. Complexity and modularity of intracellular networks: a systematic approach for modelling and simulation. IET systems biology, 2(5):363–368, September 2008.
Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
Natalya F. Noy and Mark A. Musen. The PROMPT suite: interactive tools for ontology merging and mapping. Int. J. Hum.-Comput. Stud., 59(6):983–1024, December 2003.
Robert Hoehndorf, Michel Dumontier, Anika Oellrich, Dietrich Rebholz-Schuhmann, Paul N. Schofield, and Georgios V. Gkoutos. Interoperability between Biomedical Ontologies through Relation Expansion, Upper-Level Ontologies and Automatic Reasoning. PLoS ONE, 6(7):e22006+, July 2011.
Satya S. Sahoo, Olivier Bodenreider, Joni L. Rutter, Karen J. Skinner, and Amit P. Sheth. An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence. Journal of biomedical informatics, 41(5):752–765, October 2008.
O. Ruebenacker, I. I. Moraru, J. C. Schaff, and M. L. Blinov. Integrating BioPAX pathway knowledge with SBML models. IET Systems Biology, 3(5):317–328, 2009.
Li Xu and David W. Embley. Combining the Best of Global-as-View and Local-as-View for Data Integration. In Anatoly E. Doroshenko, Terry A. Halpin, Stephen W. Liddle, Heinrich C. Mayr, Anatoly E. Doroshenko, Terry A. Halpin, Stephen W. Liddle, and Heinrich C. Mayr, editors, ISTA, volume 48 of LNI, pages 123–136. GI, 2004.
Alan Ruttenberg, Jonathan A. Rees, Matthias Samwald, and M. Scott Marshall. Life sciences on the Semantic Web: the Neurocommons and beyond. Briefings in bioinformatics, 10(2):193–204, March 2009.
Michael Bada, Kevin Livingston, and Lawrence Hunter. An Ontological Representation of Biomedical Data Sources and Records. Ontogenesis, June 2011.
Marvin Schulz, Falko Krause, Nicolas Le Novere, Edda Klipp, and Wolfram Liebermeister. Retrieval, alignment, and clustering of computational models based on semantic annotations. Molecular Systems Biology, 7(1), July 2011.
Robert Hoehndorf, Michel Dumontier, John Gennari, Sarala Wimalaratne, Bernard de Bono, Daniel Cook, and Georgios Gkoutos. Integrating systems biology models and biomedical ontologies. BMC Systems Biology, 5(1):124+, August 2011.
R. Stevens, P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. W. Paton, C. A. Goble, and A. Brass. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics (Oxford, England), 16(2):184–185, February 2000.
Marie C. Rousset and Chantal Reynaud. Knowledge representation for information integration. Inf. Syst., 29(1):3–22, 2004.
Ian Horrocks, Peter F. Patel-Schneider, Harold Boley, Said Tabet, Benjamin Grosof, and Mike Dean. SWRL: A Semantic Web Rule Language Combining OWL and RuleML. http://www.w3.org/Submission/SWRL/, May 2004.
Bijan Parsia. Understanding SWRL (Part 1), August 2007.
Pascal Hitzler. OWL and Rules. In OWLED 2011, June 2011.
Markus Krötzsch, Sebastian Rudolph, and Pascal Hitzler. Description Logic Rules. In Proceeding of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence, pages 80–84, Amsterdam, The Netherlands, The Netherlands, 2008. IOS Press.
Martin O’Connor, Holger Knublauch, Samson Tu, Benjamin Grosof, Mike Dean, William Grosso, and Mark Musen. Supporting Rule System Interoperability on the Semantic Web with SWRL. pages 974–986. 2005.
M. J. O’Connor and A. K. Das. SQWRL: a Query Language for OWL. In OWL: Experiences and Directions (OWLED), Fifth International Workshop, 2009.
Aitken Stuart, Korf Roman, Webber Bonnie, and Bard Jonathan. COBrA: a bio-ontology editor. Bioinformatics, 21(6):825–826, March 2005.
S. M. Falconer, N. F. Noy, and M. A. Storey. Ontology Mapping – A User Survey. In The Second International Workshop on Ontology Matching at ISWC 07 + ASWC 07, 2007.
Joanne S. Luciano and Robert D. Stevens. e-Science and biological pathway semantics. BMC bioinformatics, 8 Suppl 3(Suppl 3):S3+, 2007.
Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(suppl 1):D535–D539, January 2006.
Henning Hermjakob, Luisa Montecchi-Palazzi, Gary Bader, Jérôme Wojcik, Lukasz Salwinski, Arnaud Ceol, Susan Moore, Sandra Orchard, Ugis Sarkans, Christian von Mering, Bernd Roechert, Sylvain Poux, Eva Jung, Henning Mersch, Paul Kersey, Michael Lappe, Yixue Li, Rong Zeng, Debashis Rana, Macha Nikolski, Holger Husi, Christine Brun, K. Shanker, Seth G. Grant, Chris Sander, Peer Bork, Weimin Zhu, Akhilesh Pandey, Alvis Brazma, Bernard Jacq, Marc Vidal, David Sherman, Pierre Legrain, Gianni Cesareni, Ioannis Xenarios, David Eisenberg, Boris Steipe, Chris Hogue, and Rolf Apweiler. The HUPO PSI’s molecular interaction format–a community standard for the representation of protein interaction data. Nature biotechnology, 22(2):177–183, February 2004.
Ethan G. Cerami, Benjamin E. Gross, Emek Demir, Igor Rodchenkov, Özgün Babur, Nadia Anwar, Nikolaus Schultz, Gary D. Bader, and Chris Sander. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Research, 39(suppl 1):D685–D690, January 2011.
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
Michael Sintek. XML Tab. http://protegewiki.stanford.edu/wiki/XML_Tab, June 2008.
M. J. O’Connor, R. D. Shankar, S. W. Tu, C. I. Ny, and A. K. Dasulas. Developing a Web-Based Application using OWL and SWRL. In AAAI Spring Symposium, 2008.
Christoph Lange. Krextor — An Extensible XML->RDF Extraction Framework. In Chris Bizer, Sören Auer, and Gunnar A. Grimnes, editors, 5th Workshop on Scripting and Development for the Semantic Web, Colocated with ESWC 2009, May 2009.
A. L. Lister, M. Pocock, and A. Wipat. Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models. Journal of Integrative Bioinformatics, 4(3):80+, 2007.
Michael Hucka. SBML Groups Proposal (2009-09). http://sbml.org/Community/Wiki/SBML_Level_3_Proposals/Groups_Proposal_%282009-09%29, September 2009.
Holger Wache, T. Vögele, Ubbo Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information — a survey of existing approaches. In H. Stuckenschmidt, editor, Proceedings of the IJCAI’01 Workshop on Ontologies and Information Sharing, Seattle, Washington, USA, Aug 4-5, pages 108–117, 2001.
Jinguang Gu, Baowen Xu, and Xinmeng Chen. An XML query rewriting mechanism with multiple ontologies integration based on complex semantic mapping. Information Fusion, 9(4):512–522, October 2008.
J. S. Luciano. PAX of mind for pathway researchers. Drug discovery today, 10(13):937–942, July 2005.
C. J. Proctor, D. A. Lydall, R. J. Boys, C. S. Gillespie, D. P. Shanley, D. J. Wilkinson, and T. B. L. Kirkwood. Modelling the checkpoint response to telomere uncapping in budding yeast. Journal of The Royal Society Interface, 4(12):73–90, February 2007.
Martin O’Connor. SQWRLQueryTab. http://protege.cim3.net/cgi-bin/wiki.pl?SQWRLQueryTab, February 2010.
C. J. Proctor. Personal Communication, 2009.
Dagmar Waltemath, Neil Swainston, Allyson Lister, Frank Bergmann, Ron Henkel, Stefan Hoops, Michael Hucka, Nick Juty, Sarah Keating, Christian Knuepfer, Falko Krause, Camille Laibe, Wolfram Liebermeister, Catherine Lloyd, Goksel Misirli, Marvin Schulz, Morgan Taschuk, and Nicolas Le Novère. SBML Level 3 Package Proposal: Annotation. Nature Precedings, (713), January 2011.
Matthew Horridge, Sean Bechhofer, and Olaf Noppens. Igniting the OWL 1.1 Touch Paper: The OWL API. In Proceedings of OWLEd 2007: Third International Workshop on OWL Experiences and Directions, 2007.
Mar\'{\i }a del Mar d. e. l. . M. Roldán-Garc\'{\i }a, Ismael Navas-Delgado, Amine Kerzazi, Othmane Chniber, Joaqu\'{\i }n Molina-Castro, and José F. Aldana-Montes. KA-SB: from data integration to large scale reasoning. BMC bioinformatics, 10 Suppl 10(Suppl 10):S5+, 2009.
The W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview. http://www.w3.org/TR/owl2-overview/, October 2009.
Christine Golbreich and Evan K. Wallace. hasKey, in OWL 2 Web Ontology Language: New Features and Rationale. http://www.w3.org/TR/owl2-new-features/#F9:a_Keys, October 2009.

Chapter 4: MFO (Thesis)

[Previous: Chapter 3: Saint]
[Next: Chapter 5: Rule-based mediation]

Model Format OWL integrates multiple SBML specification documents


The creation of quantitative SBML models that simulate the system under study is a time-consuming manual process. Currently, the rules and constraints of model creation, curation, and annotation are distributed over at least three separate specification documents: the SBML XSD, SBO, and the SBML Language Specification for each SBML level (known hereafter as the SBML Manual). Although the XSD and SBO are computationally accessible, the majority of the SBML usage constraints are contained within the SBML Manual, which is not amenable to computational processing.

MFO, a syntactic conversion of the SBML standard to OWL, was created to integrate the three specification documents and contains a large amount of the information contained within them [1]. MFO is a a fit-for-purpose ontology performing both structural and constraint integration; it can also be reasoned over and used for model validation. SBML models are represented as instances of OWL classes, resulting in a single machine-readable resource for model checking. Knowledge that was only accessible to humans is now explicit and directly available computationally. The integration of all structural knowledge into a single resource creates a new style of model development and checking. Inconsistencies in the computationally inaccessible SBML Manual were discovered while creating MFO, and non-conformance of SBML models to the specification were identified while reasoning over the models.

Section 1 describes the rules and constraints of the SBML standard, while Section 2 provides a detailed description of MFO as well as how each section of the SBML standard was modelled. Large-scale conversion of SBML entries as well as the reasoning times for the populated ontologies are described in Section 3. Finally, a discussion of similar work and further applicability is in Section 4.


The MFO website is http://cisban-silico.cs.ncl.ac.uk/MFO/, and MFO itself is available for download (Note: MFO is available from http://metagenome.ncl.ac.uk/subversion/MFO/trunk/src/main/resources/owl/MFO.owl. In order for the import statements to work, a copy of SBO must be in the same directory. A release of SBO guaranteed to work with MFO is available from http://metagenome.ncl.ac.uk/subversion/MFO/trunk/src/main/resources/owl/SBO_OWL.owl). The website contains general information about the project, links to the Subversion repository, associated publications and instructions for use. MFO is licensed by Allyson Lister and Newcastle University under a Creative Commons Attribution 2.0 UK: England & Wales License (Note: http://creativecommons.org/licenses/by/2.0/uk/). The MFO Software Toolset is covered under the LGPL (Note: http://www.gnu.org/copyleft/lesser.html). For more information, see LICENSE.txt in the MFO Subversion project (Note: http://metagenome.ncl.ac.uk/subversion/MFO/trunk/).

1 The SBML standard

Both biological and structural knowledge are stored within an SBML document. Biological knowledge describes and identifies the biological entities themselves, while the structural information defines the capturing of biological knowledge in well-formed documents suitable for processing by machines. The structural knowledge required to create an SBML model is tied up in three main locations:

  • The SBML [2] XSD, describing the range of XML documents considered to be in SBML syntax;
  • SBO [3], containing a hierarchy of unambiguous terms useful as a biologically-meaningful semantic layer of annotation for systems biology models; and
  • The SBML Manual [4], detailing the many restrictions and constraints upon SBML documents, the context within which SBO terms can be used, and information on the correct interpretation of documents conforming to the SBML specification.

The SBML specification is in separate documents because each document has been created to match a different type of consumption. The portion of the knowledge codified in the XSD transmits only the information needed to parametrise and run a computational simulation of the system. The knowledge in SBO is intended to aid human understanding of the model and to aid conversion between formats. Finally, the SBML Manual is aimed at modellers as well as tool developers who need to ensure their developed software is fully compliant with the specification.

Only two of the three documents are directly computationally amenable. Firstly, the SBML XSD describes the range of legal XML documents, the required and optional elements and attributes, and the constraints on certain values within those entities. Secondly, SBO provides a term hierarchy of human-readable descriptions and labels together with machine-readable identifiers and inheritance information. Unfortunately, these two documents contain little information on how they interact, the usage of the XML entities or SBO terms, or what a conforming SBML document means to an end user. The majority of such compliance information is in the SBML Manual in human-readable English. With traditional programming techniques, the implementation of these English descriptions must be manually hard-coded.

Many libraries are available for the manipulation and validation of SBML models, with libSBML [5] being the most prominent example. LibSBML is an extensive library for manipulating SBML. It is also a powerful constraint validator, reporting any errors in an SBML document by incorporating the constraints found in all three SBML specification documents. The SBML website provides an online XML validator (Note: http://sbml.org/Facilities/Validator/), and the SBML Toolbox [6] is also capable of XML validation via libSBML. The online validator uses only the SBML XSD to check the consistency of a model against the schema. The purpose of the SBML Toolbox is to integrate SBML models with the MATLAB environment, but it does have some validation features that check for the correct combination of components. The SBML Test Suite (Note: http://sbml.org/Software/SBML_Test_Suite) provides a set of test- cases meant to be used in software development, and can also be used to compare different software applications.

However, such manual encoding provides scope for misinterpretation of the specification, and has the potential to produce code that accepts or creates non-compliant documents due to silent bugs. In practice, these problems are ameliorated by regular SBML Hackathons (Note: http://sbml.org/Events) and the widespread use of libSBML. However a formal, machine-readable description of the information contained in the SBML Manual will only become more relevant as the community grows beyond the point where developers and users can be adequately served by face-to-face meetings and informal representations of constraints.

2 Model Format OWL

While OWL provides the vocabulary and structure required to describe entities in a semantically-rich and computationally-expressive way, such usage is not mandated by the language. Simpler ontologies can be created in OWL which do not make full use of its semantics, trading expressivity for faster reasoning. A reduction in complexity does not necessarily mean a concomitant reduction in clarity; irrespective of expressivity level, OWL has a number of features unavailable to OBO such as access to reasoners and complex reasoner-based querying [7]. Additionally, when using fit-for-purpose modelling to create computer science ontologies, a rich ontology is often not needed and would generate a high overhead for little gain (see Section 1.5). Semantically simpler ontologies include those adhering to OWL profiles and those performing direct, syntactic conversion to OWL from a non-semantic format. The three official subsets, or profiles, of OWL are optimised for large numbers of classes (OWL-EL), large numbers of instances (OWL-QL), or rules usage (OWL-RL) [8]. Syntactic conversion of a data format into OWL allows reconciliation of the format in preparation for the application of more advanced data integration methods.

MFO is within the OWL 2 DL profile, but makes use of a number of expressions which prevent it from conforming to any of the other OWL profiles. MFO captures the entirety of the SBML structure and the SBO hierarchy plus a selection of the human-readable structural rules and semantic constraints from the SBML Manual. This method of constraint integration improves the accessibility of the specification for machines. Since MFO was originally published [1], reasoning and inference times have been dramatically reduced, and more aspects of the three main sources of rules and constraints in SBML have been incorporated.

Detailed descriptions of the modelling of these documents is available in Sections 2.32.5. These sections use the progression of SBase from XSD complex type to OWL class as a desciption of how MFO models information from each of the three SBML specification documents.

The mapping between SBML documents and the OWL representation is bi-directional: information can be parsed as OWL individuals from an SBML document, manipulated and studied, and then serialised back out again as SBML. Standard Semantic Web tools can be used to validate and reason over SBML models stored in MFO, checking their conformance and deriving any conclusions that follow from the facts stated in the document, all without manual intervention. In Chapter 5, we use MFO as the basis for a method of automatically improving the annotation of SBML models with rich biological knowledge.

2.1 Comments and annotations in MFO

Comments contained within the SBML XSD are converted to OWL along with the rest of the document. Such comments are identified within the XSD by the use of the xsd:documentation element, and are copied to OWL and identified using the rdfs:comment field. This field is also used provide attribution for other facts using the phrases below.

  • Structural Element of SBML: The classes with this comment have a direct association to a specific element or attribute of the SBML format, and generally have identical names to the original XML.
  • Structural Element of XML: Classes with this comment have a direct association to a specific part of an XML document. The only OWL class currently containing this statement is XMLNamespace, which holds any namespace information from a given SBML document.
  • Documented Constraint: The comments which begin with this phrase describe documented constraints on their containing classes. They are generally taken from the SBML Manual, and include the version of the specification supplying the constraint.

SBO classes do not contain these comment types because they are neither based on a structural XML element nor drawn from the SBML Manual. However, many MFO classes have more than one of these comments. The most common combination is “Structural Element of SBML” combined with “Documented Constraint”, indicating that a specific SBML component is defined not only in the XSD, but also in the SBML Manual. One example is the MFO class Reaction, which restricts its SBO terms to those found in the occurring entity representation hierarchy. This documented constraint is found in Table 6, Section 5.2.2 of Level 2 Version 3 Release 1 of the SBML Manual, while the XML element itself is defined in the XSD.

2.2 Metrics

Table 1 contains selected metrics for MFO and SBO. Definitions of commonly-used metrics are available in Section 1.5. As an ontology whose primary purpose is as a controlled vocabulary to facilitate the annotation and reuse of systems biology models [3], SBO is high in classes and in high in annotations. It is also low in instance data and contains only the transitive object property part of in addition to the standard subsumption property is a. This lack of complex properties and instances combined with a high number of classes and high level of annotation reflects its purpose as a vocabulary.

Model Format OWL (MFO) Systems Biology Ontology (SBO)
Classes 88 583
Equivalent classes 13 0
Subclass axioms 264 633
Disjoint statements 692 0
Object properties 55 1
Functional object properties 35 0
Data properties 44 0
Functional data properties 26 0
Instances 35 0
Annotation axioms 63 1319
Table 1: Selected ontology metrics for MFO and SBO. The first column of data is the metrics for MFO alone, although SBO is a direct import within MFO. A summary of metrics for SBO is provided in the second column. Metrics created using Protégé 4.

In contrast, MFO only needs enough classes to model the various entities present in the SBML XSD and their set of constraints. Therefore the number of classes is smaller in comparison to a vocabulary-based ontology such as SBO. Due to the interconnectedness of the entities and constraints within the SBML specification documents, the total number of class axioms and property types is high relative to the number of classes. Even though MFO is primarily a syntactic ontology and not a model of a biological domain, the syntax of SBML combined with the constraints present in the specification create an interconnected structure.

2.3 Modelling the SBML XML Schema

SBML is a good candidate for representation in OWL as the large amount of rules and constraints are described in detail, and the interlocking nature of those rules makes integrating them worthwhile. The XMLTab plugin [9] for Protégé 3.4 reads in either an XSD or XML file and converts that file into a representation in OWL. More information on the use of the XMLTab plugin is available in Chapter 5. This plugin was applied to the SBML XSD, resulting in the creation of an initial set of classes and relations. This creation step was virtually instantaneous, and did not require human intervention. However, limitations of the plugin required some manual additions. For instance, although the enumeration of allowed UnitSId values is available within the XSD, it is not converted by the XMLTab plugin. Instead, it was manually created as a set of 35 instances (shown in Table 1) of the corresponding UnitSId class within MFO.

In the creation of MFO, SBML elements became OWL classes and SBML attributes became OWL properties (either data or object properties, as appropriate). When specific SBML models are loaded, their data is stored as instances. An example of the conversion is shown in Figure 1, using a small portion of the SBML SBase complex type. SBase is useful as an example because nearly every SBML type inherits from this complex type.

Figure 1: A portion of the SBML SBase element for Level 2 Version 4, as defined by the XSD. For brevity, the complete complex type has not been shown. Source: http://sbml.org/Special/xml-schemas/sbml-l2v4-schema/sbml.xsd

The XSD provides information on the characteristics of each complex type. Such information includes the optionality of attributes, how often attributes can be used and the typing of attributes (e.g. boolean). While not easy for humans to read, XML is easily parseable by computers. Figure 2 shows how constraints within the XSD for SBase are fully converted into OWL.

Figure 2: A portion of the MFO SBase class when converted directly from the XSD using the XMLTab plugin. Each of the elements and attributes shown in Figure 1 are shown converted into OWL axioms. The Manchester OWL Syntax has been used for readability. For brevity, the complete axiom list has not been shown.

A number of XSD constraints have direct equivalents in OWL. For instance, if an attribute is optional and can only be present once, the max 1 construct is used, as with the sboTerm property:

 sboTerm max 1 Thing

Using an axiom from the Species class as an example, the conversion of required attributes makes use of the exactly construct:

 compartment_id exactly 1 SId

If the attribute can only be filled with a particular type of entity, then that can be specified as well:

 sboTerm only SBOTerm

While the OWL shown in Figure 2 is a direct conversion of the SBML XSD, closer inspection reveals the limitations of XSDs, as very little rule and constraint information is contained within it. For example, the relationship between SBase and SBOTerm cannot be adequately described without SBO itself, and some usages of the SId class within Species cannot be correctly modelled without access to the additional specification information found in the SBML Manual. These and other examples of how MFO is able to model aspects of the SBML specification documents are described in detail in the next two sections.

2.4 Modelling SBO

SBO is a hierarchy of terms developed by the systems biology modelling community to assist compliance with MIRIAM, ensure unambiguous understanding of the meaning of the annotated entities and foster mapping between annotated elements from multiple formats making use of the ontology [10]. By adding SBO terms, modellers take an essential step towards such compliance [3]. Each term in SBO is related to its direct parent with an is a subsumption relationship such as catalyst is a modifier. For a full description of SBO itself and its relationship to other systems biology standards, see Section 1.4.

While SBO is natively stored as a relational database, it can be exported either as OWL or OBO (Note: http://www.ebi.ac.uk/sbo/main/download.do). The OWL export of SBO is regularly downloaded and imported into MFO. The current version of SBO in MFO is the auto-generated OWL export downloaded on 13 December 2011. While the import could directly reference the online version of SBO from the EBI servers, a local copy ensures that MFO works offline and that unexpected changes in SBO do not cause any adverse effects.

Figure 3: The SpeciesPreL2V4 and SpeciesAtAndAfterL2V4 classes whose union is the full necessary and sufficient definition of Species, as shown in Figure 4. These classes allow the reasoner to check that the appropriate SBO terms are used for the appropriate SBML model versions. The Manchester OWL Syntax has been used for readability, and only part of the SpeciesAtAndAfterL2V4 class is shown for brevity.

In SBML, SBase provides an optional sboTerm attribute which creates a link to a specific term in SBO. Most elements in SBML inherit from SBase and therefore have access to the sboTerm attribute. In MFO, this is represented as the object property sboTerm. Figure 3 provides examples of sboTerm usage within MFO. The SBOTerm simple type, which is the type for the sboTerm attribute used within SBase, defines a string pattern restriction of “(SBO:)([0-9]7)” rather than referencing SBO directly. As such, the XSD constraint is only an approximation rather than a complete representation of the correct values of the sboTerm attribute. MFO improves upon this situation by linking the sboTerm attribute directly to SBO itself. Figure 3 shows how the material entity and physical entity representation classes from SBO are linked directly to the MFO classes SpeciesAtAndAfterL2V4 and SpeciesPreL2V4, respectively.

There are both formal and informal restrictions on how SBO terms should be used within an SBML model. The formal restrictions are listed in the SBML Manual and covered in the next section. The informal restrictions are concerned with those entities which inherit from SBase but which do not have any specific rules governing how SBO should be used. For instance, while the ListOf__ elements inherit from SBase, the SBML Manual does not provide the range (if any) of allowed SBO terms for these elements [4], making it difficult to model them correctly in OWL. Therefore in MFO, ListOf__ classes have no restrictions on the use of SBO terms. While this is an exact reproduction of the rules and constraints in the specification, a lack of clarity within the specification leads to possibly incomplete modelling.

2.5 Modelling the SBML Manual

The SBML Manual contains the majority of the rules and constraints to which valid models must conform. For instance, depending on the value of the constant and boundaryCondition attributes of the species element, corresponding speciesReference elements may or may not be products or reactants in reactions [4, Section 4.8.6, Table 5].

A new document is released for each new version of SBML. New versions of the manual may create new rules, invalidate old ones or publish changes to existing rules. For example, an explicit statement of the SBO terms allowed for modifierSpeciesReference was removed in later versions of the SBML Manual. The Level 2 Version 4 Manual states that modifierSpeciesReference elements should have SBO terms from the modifier hierarchy [11, Section 4.13.2]. In Level 3 Version 1 this constraint is no longer explicitly stated, though it is implied [4, Section 4.11.4]. Determining the correct restriction to use in cases such as this becomes impossible without direct consultation with the developers of the specification. Such a lack of precision is a serious pitfall of specification documents which are only human readable.

In addition to the inexact wording possible with an English specification, new versions of the SBML Manual also change which SBO term hierarchies can be used for particular elements [11, 4, Table 6]. For example, as shown in Figure 3, material entity and its children are the only SBO terms that can be used with Species classes at or after Level 2 Version 4. However, Species classes have different constraints on their SBO terms if they are before Level 2 Version 4.

To solve this problem and allow the reasoner access to this logic, the Species class was given two additional children, SpeciesAtAndAfterL2V4 and SpeciesPreL2V4. These child classes are together marked as equivalent to the Species class (Figure 4), and are fully modelled based on the level and version type of the SBML document (Figure 3). These are some of the many constraints placed on the use of SBO terms documented within the SBML Manual, and not captured in either the SBML XSD or the SBO OBO file.

Figure 4: Example constraints added to MFO based on information contained in the SBML Manual. The constraints listed here are in addition to those already added by XMLTab. As some constraints are dependent upon the level and version of the SBML model, subclasses of Species were created (see also Figure 3). Additionally, the constraints present in the SBML Manual allowed the correct linking of properties such as speciesType_id and compartment_id to their owning classes via a restriction on the SId used in those properties.

The SId class models all identifiers internal to the SBML model, and is heavily used in the XML to provide links from one part of the document to another. It and its child class, UnitSId, constrain the type of identifier allowed for entities within SBML [4, Section 3.1.7]. The conversion to OWL from the XSD does create an SId class, but a lack of further information means that the relevant parts of the SBML Manual must also be used to correctly model SIds within MFO. Simple statements using the id object property require no changes to the axioms created by the conversion process:

 id only SId

The statement above, taken from the Species class, states that the Species identifier must be of type SId. This matches the behaviour defined in the SBML Manual, and no further modifications to the OWL are required. However, other cases where SIds are used within the Species class are not completely defined in the XSD. These cases are when a SId is used as the link between Species and another class. For instance, compartment_id is the object property which, ultimately, links a given Species to its containing Compartment, but the XSD does not specify this, so the conversion to OWL results in:

 compartment_id exactly 1 SId

While this statement tells us that compartment_id must be of type SId, it does not let us know that it must connect to an SId belonging to a Compartment. Therefore, the addition of this information from the SBML Manual results in the following modification to the statement (also shown in Figure 4):

 compartment_id only (id_of only Compartment),
 compartment_id exactly 1 Thing

Figure 5: Example constraints added to MFO based on information contained in the SBML Manual. A new class called MiriamAnnotation was created and linked to SBase, allowing the full and complete modelling of the MIRIAM structure. The constraints listed here are in addition to those already added by XMLTab.

On its own, the conversion of SBase into OWL results in the creation of a generic annotation data property which is filled by a string literal (Figure 2). However, a major part of the annotation of an SBML entity is its MIRIAM annotation. Therefore to align more closely with the SBML Manual and to aid further integration work in this thesis, classes describing MIRIAM annotation in SBML have been added. Figure 5 shows the MiriamAnnotation class and the miriamResource data property in the context of SBase, while Figure 6 shows the MIRIAM qualifiers within MFO. The MIRIAM biological qualifiers are held within the bqb property hierarchy, and the model qualifiers are listed within the bqm hierarchy. MIRIAM-compliant annotation elements are important to rule-based mediation (Chapter 5), as these elements contain information such as cross references and taxonomic information.

Figure 6: The hierarchy of MIRIAM qualifiers in OWL, as shown in Protégé 4. miriamResource is the top-level data property, and the biology qualifier (modelled here as bqb) and model qualifier (modelled here as bqm) hierarchies follow those found in the SBML Manual.

3 Working with MFO

On its own, MFO integrates the rules and constraints present in the three SBML specification documents, making those rules available to machines and presenting them in a single view for humans. Further, by adding data from SBML models themselves, the information stored within the model is exposed to reasoners and semantic querying. This section describes the population and exporting of MFO using the entirety of the BioModels database as an example, and shows the benefits of reasoning over even a relatively simple syntactic ontology such as MFO. Chapter 5 then describes the larger semantic integration project and how MFO can be used in such a situation.

3.1 Creating instance data

Figure 7: The species “DL” from BioModels entry BIOMD0000000001 in the native XML format. Please note that multiple sections of the XML have been removed to aid readability.

In Figure 7, an SBML species is shown in its original XML format, while its corresponding instance data in MFO is shown in Figure 8. Each attribute of Species is filled with the data provided from the XML. Data properties point directly to the value for that property. For example, the name data property points to “DesensitisedACh”. Object properties can be followed to the instances being referenced. For example, a miriamAnnotation object property points to
BIOMD0000000001_000012_bqbIsVersionOf_0_0, an instance of the class MiriamAnnotation. This instance has a bqbIsVersionOf data property whose value is “urn:miriam:interpro:IPR002394”. Additionally, the sboTerm attribute points, not to a string literal as described in the XSD, but to an actual SBOTerm, as required by the SBML Manual.

Figure 8: The BIOMD0000000001_DL instance of the MFO Species class from BioModels entry BIOMD0000000001. In addition to BIOMD0000000001_DL itself, a small number of object properties are expanded to show their actual values.

3.2 Populating MFO

When converting data from SBML to MFO, first the XML is parsed and then corresponding OWL is created. Each of these steps is aided by existing libraries coordinated with the MFO Software Toolset. LibSBML 5.0.0 [5] converts the XML structures into plain Java objects. Each Java object is then assigned to a particular MFO entity and saved as OWL using the OWL API version 3.2.3 [12]. After conversion back to SBML, roundtrip comparisons of the XML both before and after processing were performed to ensure that all data is retained.

The BioModels [13] database release of 15 April 2011 contains 326 curated and 373 non-curated models. Within these two datasets, a total of six models could not be loaded into libSBML at all, and a further nine did not pass the roundtrip test from XML to MFO and back again. There was only a single curated model which failed loading into libSBML: BIOMD0000000326 contains a notes element which is in an incorrect format and was rejected. An additional five entries from the non-curated set failed loading for the same reason (Note: MODEL0911120000, MODEL1011020000, MODEL1012110001, MODEL1150151512, MODEL6655501972). Table 2 shows the number of instances and property axioms created when both the curated and non-curated BioModels datasets were converted into MFO.

Curated BioModels Non-Curated BioModels
Instances 149,912 1,217,067
Object Property Assertions 246,679 2,225,935
Data Property Assertions 221,738 1,884,250
Table 2: The number of created instances and related axioms when the BioModels database is converted into MFO. Curated and non-curated datasets are shown separately.

As SBML models are standalone and cannot be imported into each other, when downloading the BioModels database, each entry is stored in its own file. Figure 9 summarises individual creation times for MFO models of all parseable curated and uncurated entries in the BioModels database. The conversion to MFO generally takes less than one second, although uncurated entries take longer, on average, to be converted. (Note: All timings in this chapter were performed on a Dell Latitude E6400 computer, unless otherwise stated.). Conversion times were longer for larger SBML models such as the yeast molecular interaction network model [14] present in the non-curated dataset (Note: MODEL3883569319).

When roundtrip testing was performed, two curated models (BIOMD0000000169 and
BIOMD0000000195) had errors in their original MIRIAM annotation resulting in libSBML dropping some of the MIRIAM URIs. These have been confirmed as errors in the SBML model by the BioModels curators  (Note: http://sbml.org/Forums/index.php?t=tree&goto=7097&rid=0). A third curated model, BIOMD0000000228, gains a superfluous sbml namespace on most elements when converting from MFO to SBML, causing an error within libSBML. A total of six non-curated models failed the roundtrip process. Four of these models (Note: MODEL2463683119, MODEL4132046015, MODEL1009150002, MODEL1949107276) had segmentation faults from libSBML’s C++ core during the comparison of the original and the newly-converted SBML, and the two biggest non-curated models (Note: MODEL3883569319, MODEL6399676120) could not feasibly be compared in memory.

Figure 9: Summary of the conversion times for all MFO models stemming from the complete BioModels database. Each dot corresponds to a single BioModels entry, and the conversion times are plotted on a logarithmic scale. The extreme outlier, MODEL3883569319, is not included in this graph to aid readability (conversion time 2401 seconds).

Although each BioModels entry is stored within the MFO Subversion project in its own file to match how those database entries are provided, there is no reason why the entire database cannot be stored in a single OWL file. Within the MFO Subversion project, OWL files are available for download both singly and as a concatenation of all curated data.

3.3 Reasoning over MFO

Reasoning over an ontology can provide a number of benefits. Inconsistencies in the logic or in the instance data can be identified, and inference of more precise locations for classes and instances can create new hierarchies and pull out otherwise hidden knowledge. By using MFO, constraints can be automatically enforced that currently are only described textually in the SBML Manual or hard-coded into libSBML.

Libraries such as libSBML capture the process of validating constraints, while MFO explicitly captures the constraints themselves, thus identifying where the SBML Manual is not explicit or appears to contradict itself. Examples of this constraint capture and verification were already described in Section 2.5, where a lack of precision in the English language definitions as well as changes in correct annotation for sboTerm attributes result in logical inconsistencies. These inconsistencies were discovered while MFO was being created, as the axioms for the relevant parts of the specification were being added. However, while it is not easy to manually spot such errors it is also not required that they be noticed at this stage. If two incompatible SBO term branches of the SBO hierarchy were both stated to be the only branches allowed for a particular MFO entity, the reasoner would catch this inconsistency even if the ontologist did not.

The ongoing process of incorporating constraints into MFO will improve the quality of the SBML Manual through identification of these weaknesses. Once captured in MFO, these constraints can be investigated systematically by running a reasoner. This allows many aspects of models to be validated without the need for hard-coded checking libraries. Furthermore, the addition of new constraints requires the release of a new version of a library such as libSBML, but would only require retrieval of the new MFO file; no changes are required to the supporting MFO Toolset.

Factors affecting reasoning time are complex, and are not just tied to ontology, ABox or TBox size. Such a discussion is beyond the scope of this thesis, but Motik and colleagues provide a thorough discussion and comparison of current reasoners [15]. Figure 10 shows a plot of the reasoning times for curated BioModels, providing a general overview of how long reasoning takes over MFO models. Although, very broadly, larger models took longer to reason over, Figure 10 shows that size is not a highly useful indicator of reasoning time. Only the 290 models which were both consistent and satisfiable were included in the plot. The vast majority of models in this set took less than one second to reason over. Two extreme outliers took much longer to reason over, most likely due to their sizes, as they are two of the largest models in the curated BioModels dataset (Note: BIOMD0000000235 and BIOMD0000000255).

Figure 10: Reasoning times over curated BioModels modelled in MFO. Each dot corresponds to a single BioModels entry, and the reasoning times are plotted on a logarithmic scale. The reasoning time is plotted against the size of each model. Two extreme outliers (BIOMD0000000235: 924 seconds, BIOMD0000000255: 370 seconds) have not been included in the graph. HermiT [15] version 1.3.3 was used as the reasoner.

As a syntactic ontology, the simple structure allows fast reasoning, and the benefits gained include SBO validation and inference of more specific subtypes of classes. Of the 35 inconsistent curated BioModels entries, a sample were examined manually to determine the causes of the inconsistencies. For both BIOMD0000000055 and BIOMD0000000216, the inconsistencies stemmed from errors in the use of SBO terms. In BIOMD0000000055, the SBO term messenger RNA was used within a species. As this model is Level 2 Version 4, and messenger RNA is not part of the material entity hierarchy (see Section 2.5), the use of this term makes the ontology inconsistent. Similarly, in BIOMD0000000216 the incorrect SBO term simple chemical is used within a parameter. Therefore, inconsistencies in models represented in MFO can be traced to errors in the underlying SBML documents.

Reasoners can also place instances in more specific sub-classes than the class in which they were originally asserted. For instance, how a species’ quantity can be changed depends upon the values of the boundaryCondition and constant attributes [4, Table 5]. If both are true, the quantity of the species cannot be altered by the reaction system and therefore its value will not change during a simulation run. MFO models all four possible combinations of these booleans as four distinct sub-types of Species. The placement in the Species hierarchy of a given instance can be inferred using the values of these properties. As an example, if an instance is asserted to be a Species and both boundaryCondition and its constant are true, reasoners will correctly classify this individual as a ConstantBoundarySpecies.

4 Discussion

4.1 Similar work

The CellML Ontological Framework [16] has parallels with MFO in that it aims to represent CellML models in OWL. This project was published two years after MFO, and uses, among other things, graph reduction algorithms to represent CellML models in a biological “view” which shows the biology of the model while hiding the complexity of the full CellML document. However, the purpose of the CellML Ontological Framework is to aid readability of CellML models, not to integrate or validate.

SBML Harvester is a tool similar to MFO, allowing the conversion of SBML models into OWL [17]. Like MFO, multiple SBML documents can be loaded into a single OWL file, reasoned over, and queried. SBML Harvester can be exported into the OWL-EL profile, and incorporates a minimal, custom-made upper-level ontology which itself makes use of the Relation Ontology [18]. Unlike MFO, which models only the SBML specification and not the underlying biology, SBML Harvester incorporates both biological concepts and the model concepts into the same ontology. This conflates two completely different domains into a single ontology. Further, SBML Harvester is not a complete model of the SBML specification. It currently only covers a limited subset of the SBML document: compartments, species, reactions and the model element itself. This means that it is not capable of performing a full round-trip of the SBML documents.

In theory BioPAX could be used to model all SBML documents, as there are tools available for converting from one format to another. However, this solution is not feasible for a number of reasons. Most importantly, BioPAX does not store quantitative information or the logical restrictions and axioms that are otherwise only stored in either SBO or the SBML Manual. As such, BioPAX cannot be easily used to perform the tasks possible with MFO. Le Novère stated that current converters result in “pathological SBML or BioPAX, and the round-tripping is near impossible” (Note: https://groups.google.com/d/topic/biopax-discuss/X2wkwvLJEhg/discussion). Few modellers in the target group for MFO use BioPAX natively. Their simulators accept SBML, and they develop and test their models in SBML. For those familiar with SBML, MFO is likely to be a much more accessible view of models than BioPAX. Finally, the export of data from the integration project described in Chapter 5 needs the SBML format, which can be exported losslessly with MFO but not with BioPAX.

4.2 Applicability

MFO represents the information required to build and validate an SBML model in a way accessible to machines. The three separate sources of this information meet the needs of distinct user groups and, while optimal for the existing range of human-driven applications, this disparate arrangement is not suitable for computational approaches to model construction and annotation. This is primarily because much of the knowledge about what constitutes valid and meaningful SBML models is tied up in a non-machine-readable format. Once available to machines, the model is exposed to reasoners which can be used to check for internal consistency and correct structure.

MFO is not intended to be a replacement for any of the APIs or software programs available to the SBML community. Rather, MFO is a complementary approach that facilitates the development of the rule-based mediation integration strategy (Chapter 5) for model improvement and manipulation by drawing upon internal consistency and inference capabilities. With MFO, inconsistencies both in the specification documents and in the models themselves can be identified. It addresses the very specific need of a sub-community within SBML that wishes to be able to express their models in OWL for the purpose of reasoning, validation and querying.

The scope of MFO does not extend to mapping other domains or integrating other data sources. These integrative tasks have a logically distinct purpose. MFO can be used on its own or as one of the source ontologies integrated via rule-based mediation. Chapter 5 goes beyond the standalone usage described here to explain how rule-based mediation utilises MFO with other source ontologies to integrate and query systems biology data. In rule-based mediation, MFO has a dual role: firstly, it acts as the format representation for SBML models such as those stored within the BioModels database, and secondly it exports query results as new annotation on SBML documents.


A. L. Lister, M. Pocock, and A. Wipat. Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models. Journal of Integrative Bioinformatics, 4(3):80+, 2007.
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
Nicolas Le Novère. Model storage, exchange and integration. BMC neuroscience, 7 Suppl 1(Suppl 1):S11+, 2006.
Michael Hucka, Michael Hucka, Frank Bergmann, Stefan Hoops, Sarah Keating, Sven Sahle, James Schaff, Lucian Smith, Darren Wilkinson, Michael Hucka, Frank T. Bergmann, Stefan Hoops, Sarah M. Keating, Sven Sahle, James C. Schaff, Lucian P. Smith, and Darren J. Wilkinson. The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core. Nature Precedings, (713), October 2010.
Benjamin J. Bornstein, Sarah M. Keating, Akiya Jouraku, and Michael Hucka. LibSBML: an API Library for SBML. Bioinformatics, 24(6):880–881, March 2008.
Sarah M. Keating, Benjamin J. Bornstein, Andrew Finney, and Michael Hucka. SBMLToolbox: an SBML toolbox for MATLAB users. Bioinformatics (Oxford, England), 22(10):1275–1277, May 2006.
Mikel Aranguren, Sean Bechhofer, Phillip Lord, Ulrike Sattler, and Robert Stevens. Understanding and using the meaning of statements in a bio-ontology: recasting the Gene Ontology in OWL. BMC Bioinformatics, 8(1):57+, February 2007.
Diego Calvanese, Jeremy Carroll, Giuseppe De Giacomo, Jim Hendler, Ivan Herman, Bijan Parsia, Peter F. Patel-Schneider, Alan Ruttenberg, Uli Sattler, and Michael Schneider. OWL2 Web Ontology Language Profiles, October 2009.
Michael Sintek. XML Tab. http://protegewiki.stanford.edu/wiki/XML_Tab, June 2008.
Melanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novere. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011.
Nicolas Le Novère, Michael Hucka, Stefan Hoops, Sarah Keating, Sven Sahle, Darren Wilkinson, Michael Hucka, Stefan Hoops, Sarah M. Keating, Nicolas Le Novère, Sven Sahle, and Darren Wilkinson. Systems Biology Markup Language (SBML) Level 2: Structures and Facilities for Model Definitions. Nature Precedings, (713), December 2008.
Matthew Horridge, Sean Bechhofer, and Olaf Noppens. Igniting the OWL 1.1 Touch Paper: The OWL API. In Proceedings of OWLEd 2007: Third International Workshop on OWL Experiences and Directions, 2007.
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
Tommi Aho, Henrikki Almusa, Jukka Matilainen, Antti Larjo, Pekka Ruusuvuori, Kaisa-Leena Aho, Thomas Wilhelm, Harri Lähdesmäki, Andreas Beyer, Manu Harju, Sharif Chowdhury, Kalle Leinonen, Christophe Roos, and Olli Yli-Harja. Reconstruction and Validation of RefRec: A Global Model for the Yeast Molecular Interaction Network. PLoS ONE, 5(5):e10662+, May 2010.
Boris Motik, Rob Shearer, and Ian Horrocks. Hypertableau Reasoning for Description Logics. Journal of Artificial Intelligence Research, 36:165–228, 2009.
S. M. Wimalaratne, M. D. B. Halstead, C. M. Lloyd, E. J. Crampin, and P. F. Nielsen. Biophysical annotation and representation of CellML models. Bioinformatics, 25(17):2263–2270, September 2009.
Robert Hoehndorf, Michel Dumontier, John Gennari, Sarala Wimalaratne, Bernard de Bono, Daniel Cook, and Georgios Gkoutos. Integrating systems biology models and biomedical ontologies. BMC Systems Biology, 5(1):124+, August 2011.
Barry Smith, Werner Ceusters, Bert Klagges, Jacob Kohler, Anand Kumar, Jane Lomax, Chris Mungall, Fabian Neuhaus, Alan Rector, and Cornelius Rosse. Relations in biomedical ontologies. Genome Biology, 6(5):R46+, 2005.

Chapter 3: Saint (Thesis)

[Previous: Chapter 2: SyMBA]
[Next: Chapter 4: MFO]

Saint: a Web application for model annotation utilising syntactic integration


Quantitative modelling is at the heart of systems biology. Model description languages such as SBML [1] and CellML [2] allow the relationships between biological entities to be captured and the dynamics of these interactions to be described mathematically. Without a biological context to these dynamic models, they are only easily understandable by their creators. Interested third parties must rely on extra documentation such as publications or possibly-incomplete descriptions of species and reactions included in an element’s name or identifier to determine the relevance of a model to their work. Many dynamic models include only the mathematical information required to run simulations, and do not explicitly contain the full biological context. However, the efficient exchange, reuse and integration of models is aided by the presence of biological information in the model. If biological information about a model is added consistently and thoroughly, the model becomes useful not just for simulation, but also as an input in other computational tasks and as a reference for researchers.

Saint, an automated annotation integration environment, aids the modeller and reduces development time by providing extra information about a model in an easy-to-use interface [3]. Saint is a lightweight Web application which enables modellers to rapidly mark up models with biological information derived from a range of data sources. It supports the addition of basic annotation to SBML and CellML entities and identifies new reactions which may be valuable for extending a model. While the addition of biological annotation does not modify the behaviour of the model, the incorporation of new reactions or species adds new features that can later be built upon to potentially change the model’s output. Saint is a syntactic integration solution to the annotation of systems biology models, developed both to explore the data sources relevant to modellers and to provide a practical examination of data integration for the purpose of model annotation.

Section 1 describes the importance of linking biology to models, and why the syntax of those links is vital to their usefulness. Section 2 describes the technology used to implement Saint, while Section 3 focuses on how Saint retrieves model annotation and how it can be used to pick and choose the annotation to keep. Section 4 provides practical examples of annotating models using Saint. Finally, Section 5 describes the future plans for Saint, the collaborations that have been developed, and how Saint can be used with rule-based mediation (see Chapter 5).

1 Availability

Saint is freely available for use on the Web at http://saint.ncl.ac.uk. The Web application is implemented in GWT, with all major browsers supported. Saint has a Sourceforge.net project website (Note: http://saint-annotate.sourceforge.net) and wiki (Note: http://sourceforge.net/apps/mediawiki/saint-annotate/index.php?title=Main_Page). There is also a Newcastle University project page (Note: http://cisban-silico.cs.ncl.ac.uk/saint.html). The Sourceforge.net project website provides access to the Subversion repository (Note: http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/) as well as various documents for programmers. The Newcastle University website and the Sourceforge.net wiki are both user-oriented, and contain general information about the project, associated publications, screenshots and instructions for use.

Saint is licensed by Allyson Lister, Morgan Taschuk and Newcastle University under the LGPL (Note: http://www.gnu.org/copyleft/lesser.html). For more information, see LICENSE.txt in the Saint Subversion project. Installing Saint on a Web server requires an installation of LibSBML [4] and the CellML API [5]. Installation and running of Saint has been tested on 32-bit Ubuntu 8.10 and higher.

1 Linking biology to models

Figure 1: Diagram of glycolysis using the SBGN Process Description language. Taken from http://www.sbgn.org/Documents/PD_L1_Examples.

Figure 1 is one of a set of examples provided by SBGN to demonstrate the standard. Specifically, it is a diagrammatic representation of glycolysis. There are two layers to systems biology models such as this one: the mathematical model, where a term like “hexokinase” is the name of a variable in a model to be simulated; and the real, underlying biological system where hexokinase is a molecule inside a biological compartment which physically interacts with other entities. The mathematical model is self contained, and will work perfectly without information regarding hexokinase’s biological importance. However, mathematical models are created to aid the understanding of biology therefore links to the biology of the model should be created and maintained. In systems biology models, these links are added via annotations.

Model annotations are necessary to describe how the model has been generated and to define the meaning of the components that make up the model in a computationally-accessible fashion. The addition of biological annotations to a model is usually a manual, time-consuming process; there is no single resource encompassing all suitable data sources. A modeller usually has to visit many websites, applications and interfaces in order to identify relevant information, and may not be aware of all potentially useful databases. Because modellers add information manually, it is very difficult to annotate exhaustively.

However, without computationally-amenable biological annotations, it is difficult to create software tools that determine the biological pathway or reaction being modelled. The manual curation of such semantic information to non-curated models is a complex process. For instance, the advice for the curation of new BioModels database entries includes checks that all entities are what they say they are (e.g. that a species is a species and not a parameter), the creation of meaningful names for SBML entities and the association of the model with at least one publication (Note: http://biomodels.caltech.edu/curationtips). For many submitted models, the quality of the biological annotation is low, making it difficult to programmatically determine what biological pathway or pathways a model is describing.

While annotations are vital for model sharing and reuse, they do not contribute to the mathematical content of a model and are not critical to its successful functioning. The addition of biological knowledge must be performed quickly and easily in order to make annotation worthwhile to a modeller. A large number of tools (Note: http://sbml.org/SBML_Software_Guide) are available for the construction, manipulation and simulation of models, but there is currently a limited number of tools to facilitate rapid and systematic model annotation. While websites and applications specialising in integrating disparate data sources exist such as BioMart [6] and Pathway Commons, few are designed to put information directly into a model.

SBML and CellML can store both the mathematical model and the links to the biological information necessary to interpret the model. The difficulty lies in retrieving appropriate biological information quickly and with a minimal amount of effort. In the glycolysis example, UniProtKB [7] states that hexokinase is a protein with an enzymatic role in glycolysis (Note: http://www.uniprot.org/uniprot/P19367). STRING [8] and Pathway Commons provide a list of interactions and pathways in which hexokinase is involved. MIRIAM, SBO [9] and GO [10] provide Saint with the vocabulary needed to label hexokinase and describe its role in the model.

MIRIAM annotations use vocabularies and ontologies to document the biological knowledge in a systems biology model. In SBML, biological annotation is added to annotation elements according to the MIRIAM specification [11]. MIRIAM annotations link external resources such as ontologies and data sources to the model in a standardised way. MIRIAM also defines an annotation scheme, accessible via Web services, which specifies the format and set of standard data types which should be used for these URIs [12]. The use of the MIRIAM format provides a standard structure for explicit links between the mathematical and biological aspects of an SBML model.

2 Implementation

2.1 Integration methodology

Saint makes it easy to find information related to the domain of interest and save that information back to SBML. The biological annotation of models is facilitated by using query translation to present an integrated view of data sources and suggested ontological terms. As described in Section 1.6, query translation uses a mediator to distribute a single query across known data sources. When the results are retrieved, they are formatted according to the set of global Saint objects and sent to the user interface.

Within Saint, the query for each species is translated into a set of queries over Web services provided by each data source. The results are imported into global objects, which can then be further manipulated before presentation: querying and integration occur together. This method is suitable for a single-purpose application that runs pre-specified queries and then stores the query results generically. The combined query results are then displayed in the Web browser.

Query translation removes the need for a local database of information, and allows the easy addition of new data sources. However, changes to the source Web services or data structures require equivalent changes in the Saint query system. As changes to services are relatively infrequent and the size of the available datasets very large, query translation is the best option for this type of integration.

Saint performs syntactic integration by matching data to a particular identifier, name or MIRIAM URI through syntactic equivalence of the query term and the external data source. No semantic integration occurs with the standard version of Saint, although Section 5 discusses how Saint can be connected to the rule-based mediation knowledgebase, thus adding a semantically-aware data source.

2.2 Saint back-end

On the client side, Saint is a simple Web application which can be used immediately, without creating an account or requesting access. Behind the user interface, Saint is implemented in GWT (Note: http://code.google.com/webtoolkit/) and hosted on a Tomcat (Note: http://tomcat.apache.org/) server. For more information on GWT, see Chapter 2. GWT was chosen for a number of reasons:

  • Asynchronous data transfer. Remote procedure calls can be used to query the server-side of the GWT application. As soon as the query is dispatched, control is handed back to the client. A notification is sent back to the client once the query to the server completes. This allows for multiple calls to be performed simultaneously and asynchronously. For instance, new annotation can be retrieved for many SBML species at the same time.
  • Abstraction of the details of creating a user interface. Although the programmer writes code describing the look and feel of the panels and forms, the specific details of the user interface are hidden. The default skin of GWT Web applications also creates a clean and lightweight feel with very little cost in development time.
  • Speed of development. GWT only requires a knowledge of standard packages and libraries in Java, so there is a low initial overhead for creating a working Web application. The majority of bioinformaticians are not experts in the creation of user interfaces, and by using GWT more time can be spent on functionality rather than the user interface.

3 Data retrieval and model annotation

This section describes both the structure of Saint and its annotation methodology using the curated BioModels entry BIOMD0000000087 [13] by Proctor and colleagues as an example, hereafter referred to as the telomere model. This model describes telomere uncapping in S.cerevisiae, a process implicated in ageing.

Figure 2: An overview of how Saint queries remote data sources and retrieves new annotation. The image of the database cylinders is LGPL (Image source: http://commons.wikimedia.org/wiki/File:Crystal_Clear_app_database.png). The image of the user is licensed under the CC-BY-SA 2.5 (Image source: http://commons.wikimedia.org/wiki/File:User_icon_2.svg).

Figure 2 shows an overview of the Saint integration infrastructure. As soon as the user session begins, the full complement of official MIRIAM URIs are downloaded via EBI Web services. These official URIs are used within MIRIAM-compliant models to provide cross-references to many different data sources. By retrieving the URIs for each data source at the beginning of a session, Saint will remain up-to-date with respect to any changes made by MIRIAM developers.

A model is uploaded by copying either a CellML or SBML model into the input box and pressing the “Load Model” button. Saint then validates the model with LibSBML for SBML or the CellML API for CellML documents. If the model is valid the SBML species or the CellML variables are retrieved and displayed, together with any MIRIAM annotation already present. Obsolete MIRIAM annotations are updated, and Saint then uses a table to display the elements available for annotation (see Figure 3).

It is possible to restrict which types of annotation are retrieved by Saint. The “Advanced Annotation Options” shown at the bottom of Figure 3 can be disabled as necessary prior to starting the annotation process. For instance, a modeller who has a skeleton model might be interested in adding all types of annotation, while one who has already defined the reactions of interest might want to deliberately exclude the addition of new reactions. A model that has already been submitted to a database such as BioModels will already have annotation for each species. However, a modeller might want to extend such a model, and therefore might only be interested in new reactions. Additionally, modellers might not add GO annotation to a species if the species is already annotated with a UniProtKB accession, as that information can be found in the UniProtKB entry itself. However, if the curated SBO term is quite high level, providing a GO term with a more specific description could be helpful.

Figure 3: SBML species and CellML variables are loaded into a tabular structure, allowing the user to pick entities of interest and delete irrelevant entities. This screenshot shows how the telomere model is displayed.

Both SBML and CellML models can be read by Saint, although the CellML export is not yet implemented as the method used to store MIRIAM annotation within CellML has not yet been finalised. The user can then remove any items they are not interested in annotating. For instance, some terms such as “sink” are modelling artefacts and do not correspond to genes or proteins. Therefore, the user would normally wish to delete this species from the search space to prevent any possible matches with actual biological species of a similar name. Deletion of a row in the table is accomplished by clicking on the “Hide Row” button for the appropriate row. Figure 3 shows a portion of the main table in Saint after the telomere model has been loaded.

Each of the species from the telomere model is added as a new row to the table, although the screenshot does not show the entire list. Those that already contain annotation will have a collapsible list allowing the user to browse the pre-existing annotation. For instance, Figure 4 shows how the Ctelo species has two items of evidence, the GO terms chromosome, telomeric region (GO:0000781) and telomere cap complex (GO:0000782). At this stage, the only tab visible within this row is the “Known” tab containing the pre-existing items of evidence.

Figure 4: The expanded display of the Ctelo species row in Saint, showing both the id of the species and its pre-existing MIRIAM annotation. This species corresponds to the first line of the table shown in Figure 3.

Once the user is satisfied with the list of items to be annotated, the model is submitted using the “Annotate” button at the bottom of the table. Pressing “Clear” will instead unload the current model and return the user to the Saint homepage. As annotation queries complete, summaries are added to the main table. Each row of the table is submitted for annotation separately, which allows the query response to be sent back to the user as soon as it is received, even if other rows have not yet returned information.

In the example, the focus is on the annotation of Ctelo, Cdc13 and Mec1. After all other species are hidden from view, the “Annotate” button is pressed. Once annotation is added, a status message appears in the second column of the table, and an “Inferred” tab is added next to “Known” (see Figure 5). Clicking on the Inferred tab allows the user to view and modify the newly-suggested annotation. When hovering over a new item of annotation with the mouse, the evidence for that piece of annotation will appear. Information on each of the data sources where the annotation was found is displayed. The results can be customised by the user, who can examine the evidence and then discard any individual piece of inappropriate annotation. The new model is kept in sync with the choices being made, and can be exported at any time.

Figure 5: How Saint looks immediately after querying for information on the chosen species. No annotation is found for Ctelo, but annotation has been identified for the other two species. If new annotation is provided by Saint, a new “Inferred” tab containing the suggestions is made visible.

In systems biology models, separate species are often used to describe the various states of a single biological entity. For instance, in the telomere model the enzyme Exo1 is represented by Exo1A for the active state and Exo1I for the inactive. For the modeller, it is useful to name multiple species in this way to allow for differing amounts of protein present in each state. Such naming schemes caused difficulties for Saint because the provided name or identifier does not match the canonical name.

Naming ambiguities have been partially solved by parsing commonly-used systematic naming schemes into more likely query candidates. Systematic naming schemes are conventions used by researchers for naming species in alternative states. If no results are retrieved for the original name or identifier, Saint removes any appended “_I”, “_A”, “active” or “inactive” literals and tries the query again. In some cases, species names or identifiers are created as a hybrid of two protein names, e.g. “proteinA_proteinB”. In such cases, Saint suggests annotation based around both proteins. Having a name attribute as well as the id attribute helps, as generally the name used by modellers is a more precise identification of the biological entity.

Figure 6: A selected subset of the annotation suggested by Saint for Cdc13. SBO and GO terms are available, and as the name attribute was missing, a variety of alternative names are listed. Finally, interactions are available to select and will be added as skeleton reactions to the new model.

As shown in Figure 5, in the telomere model example Saint found new annotation for Cdc13 and Mec1 but not for Ctelo. Figures 6 and 7 show a selected subset of the suggested annotation for Cdc13 and Mec1, respectively. Ctelo is the modeller’s private name for the capped state of a telomere, just as Utelo represents the uncapped telomere. As such, no hits are found in the source data. However, both Cdc13 and Mec1 are correctly identified, and the new annotation is displayed to the user. The first section of the results shows which ontology terms are suggested for the species. For SBO, a suggested term is provided as the default value in a pull-down list, allowing the user to either accept, change or delete the suggestion. Saint suggests the SBO term ‘macromolecule’ (SBO:0000245) as the best SBO match to a protein. This term was suggested both because ‘cdc13’ was found within UniProtKB and because the interaction set identified the species as a protein.

Because multiple MIRIAM annotations such as GO and UniProtKB cross-references are allowed within the annotation element, each annotation has its own row and may be individually removed. Each of these rows also shows the relationships available to link the annotation to the model. While SBO terms are added directly to SBML elements, GO terms and UniProtKB accessions are added to the MIRIAM annotation section.

As well as annotations, extensions to the model are also suggested. Binary interactions retrieved from both Pathway Commons and STRING can be made into basic SBML reactions. Also, alternative names for the species or variables are presented to the user by parsing the description lines of all retrieved UniProtKB entries. For each row, the depth of the green background is dependent upon the number of sources of evidence; the greater the sources, the darker the green. Reactions and associated species are added directly to the SBML model, while the majority of the other annotation is added as MIRIAM-compliant URIs.

Once the new annotation has been approved by the user, the model can be saved as a new SBML document. Clicking on the “Get Annotated Model Code” tab in Saint adds all approved annotation and presents the new SBML model for viewing and download.

Figure 7: A selected subset of the annotation suggested by Saint for Mec1. SBO and GO terms are available, and as the name attribute was missing, a variety of alternative names are listed. Finally, interactions are available to select and will be added as skeleton reactions to the new model.

4 Comparing Saint-annotated models with pre-existing annotation

Saint uses the name and identifier of an element as well as the existing resource URIs as starting points for adding annotation to models. Saint can provide new MIRIAM-compliant annotation with even a minimal amount of data, making it useful both for early-stage and already-annotated models. This section compares two Saint-annotated models against their original models as well as against stripped variants of those original models. In the first example, Saint annotates versions of the original model stripped of all biological annotation. The stripped model is then compared with the original model to asses the performance of Saint when presented with minimal information. In the second example, Saint annotates the original models. This comparison determines the extra annotation possible when Saint is provided with all available information.

4.1 Initial set-up of the comparisons

The models used for these comparisons are the telomere model as defined in Section 3,
BIOMD0000000189 [14] and BIOMD0000000032 [15]. The telomere model describes telomere uncapping in S.cerevisiae. Proctor and Gray created BIOMD0000000189, hereafter referred to as the p53-Mdm2 model, to model oscillations and variability in the p53-Mdm2 system in humans. Finally, in BIOMD0000000032 Kofahl and Klipp model the dynamics of the pheromone pathway in S.cerevisiae. For readability, BIOMD0000000032 is hereafter referred to as the pheromone model.

As well as making use of the intact, original version of each model, stripped versions were created by removing their biological annotation. Specifically, after each model is downloaded from BioModels, the sboTerm and the annotation element of each species is removed. Even though SBO terms, species names and annotation elements are added by Saint in these comparisons, names and identifiers were deliberately not removed. Without these fields, Saint would lose those strings as query terms. Generally, modellers use either descriptive id attributes without names, or descriptive name attributes together with an arbitrary identifier scheme. Though all URIs within the annotation elements are stripped from the model, only a proportion of those URIs are currently supported by Saint and therefore only those supported types can be added back by Saint. These stripped models are then used as a starting point for Saint annotation for the first set of comparisons.

The reactions added by Saint are new and as such cannot be usefully compared against the original models. This functionality of Saint is therefore not compared. Additionally, species which describe a composite or complex formed from multiple proteins will not necessarily behave in the same way as its constituents. As such, only the SBO terms and UniProtKB cross-references suggested by Saint will be retained for comparison in those instances. An expert modeller would be able to correctly choose the appropriate name and GO terms, but for the purposes of this example they were simply removed from consideration.

To perform the comparison, the stripped models are loaded into Saint and the new annotation is retrieved. After the new annotation is displayed, basic editing of the annotation is performed, including the choosing of sensible species names from the list and the removal of obvious false positive annotation. These choices mimic the expected behaviour of the modeller using the application. Then the newly-annotated model is downloaded and compared to the original, intact model.

Two phrases are used throughout these comparisons: useful species and Saint-supported URIs. Species with informative id or name attributes in the original model are termed useful species. Informative identifiers and names closely match actual protein or gene names, and are therefore suitable for querying external resources. Sometimes, if an explicit statement must be made as to which of either the identifier or name is useful, the terms useful identifier or useful name will be used. There is a limit to how much information Saint can pull from non-informative identifiers and names. In these cases, it is important to use any MIRIAM information already stored in the annotation elements. This style of annotation is covered Section 4.3. Saint-supported URIs are MIRIAM URIs, such as those for GO and UniProtKB, which are parsed by Saint as well as added during the annotation process. Saint-supported URIs are therefore, by definition, those which can be compared both before and after new annotation has been added.

All original, stripped and re-annotated versions of the models used in this example are available from the Saint Subversion repository (Note: http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/trunk/docs/supplementary-data).

4.2 Annotating stripped models

The first comparison illustrates the benefits of searching based on the id or name attributes of an SBML species element. The stripped versions of the telomere model and the p53-Mdm2 model, both of which contained a number of useful species, were annotated by Saint and then compared against their originals. Even within a single model, some names and identifiers can be useful to Saint while others are not.

Annotation of the stripped telomere model

Basic information on the annotation of both the original and re-annotated the telomere model is available in Table 1, together with a summary of the comparison between the two models. A detailed listing of the modifications made to the model is available in Table 2.

Species 55
Species with any annotation† 51/55
Useful species 15
Species with names 0
Species with SBO terms 0
Saint-supported URIs in useful species 14/15
Species with names after re-annotation 14
Species with SBO terms after re-annotation 15
URIs after re-annotation 163
URIs recovered from useful species 14/14
Table 1: Basic information on the annotation present in the original version of the telomere model and in bold for the re-annotated version. This model had no species name attributes, and therefore Saint relies on the id attribute. There were also no SBO terms in the original model. Although there are 15 useful species, only 14 contain either a Saint-supported URI, a name or an SBO term; the fifteenth (the “RPA” species) has a useful identifier but contains only a PIRSF URI unsupported by Saint. †Any annotation refers to MIRIAM URIs, names and SBO terms.

The total number of Saint-supported URIs has increased from 14 to 163, with further addition or subtraction of URIs possible if required by the modeller. In addition, Saint recovered all 14 of the URIs from the useful species. The original model contained no SBO terms or species names, and Saint has added 15 SBO Terms and 14 species names, which is almost complete coverage of the useful species.

The species “ssDNA” is an example of incorrect annotation by Saint due to a lack of available information. ”ssDNA“ is difficult for Saint to filter: it is not a protein, but can be used as a phrase within protein names. Currently, the Saint user must spot these errors and remove the suggested annotation using the Saint interface prior to downloading the model. The one useful species which was not completely annotated with SBO term, species name and URIs was “RPA”. A large amount of annotation was suggested by Saint, but all except the SBO term was manually rejected. In the original annotation, the “RPA” annotation is a single URI which points to the PIR family (PIRSF) “replication protein A, RPA70 subunit” (Note: http://pir.georgetown.edu/cgi-bin/ipcSF?id=PIRSF002091). This URI is marked as having an “isVersionOf” relationship the RPA species, rather than as an exact “is” match, as the PIRSF URI is linked to mouse, and not yeast. A search for an equivalent for yeast in UniProtKB shows a number of yeast subunits. In theory, an appropriate choice would be made by the modeller. In PIRSF002091, the equivalent yeast UniProtKB accession is listed as P22336, which is “ Replication factor A protein 1”, a.k.a. “RFA1”. This leads to the conclusion that the species in the model was named after the mouse protein and not the yeast protein. As “RFA” is not a name that matched the original ID (“RPA”), there is no way to retrieve the correct information from the data sources.

In conclusion, in comparing the annotation on the useful species for this model, all Saint-supported URIs were recovered, and new URIs, SBO terms and species names were added. A detailed listing of all new and recovered annotation present in the useful species are shown in Table 2.

Cdc13 SBO:0000245
Rad17 SBO:0000245
Rad24 SBO:0000245
RPA SBO:0000245
Mec1 SBO:0000245
Exo1I SBO:0000245
Exo1A SBO:0000245
Rad9I SBO:0000245
Rad9A SBO:0000245
Rad53I SBO:0000245
Rad53A SBO:0000245
Chk1I SBO:0000245
Chk1A SBO:0000245
Dun1I SBO:0000245
Dun1A SBO:0000245
Table 2: A detailed breakdown of recovered (in the “MATCH” column) and new (in the “NEW” column) annotation by Saint for the telomere model. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms added as MIRIAM URIs are omitted for brevity.

Annotation of the stripped p53-Mdm2 model

Basic information on the annotation of both the original and re-annotated p53-Mdm2 model is presented in Table 3, together with a summary of the comparison between the two models. A detailed listing of the modifications made to the model is available in Table 4. The total number of Saint-supported URIs has increased from 7 to 89, and of the original seven Saint-supported URIs present in the original useful species, five were recovered. The other two URIs were from related species which were incorrectly identified, and are described below. The 84 remaining URIs added to the model are GO URIs. The original model contained only two SBO terms, both of which were recovered. Additionally, two new SBO terms were added. Finally, there were no species names in the original model, and two new names were added with Saint.

In contrast to the telomere model, the p53-Mdm2 model contains species whose id attribute was created as a mixture of two protein names, e.g. “Mdm2_p53”. Of the six useful species, four have identifiers formed in this way. For each of these four, Saint suggested annotation based around both of their associated proteins. However, as a composite of two proteins will not necessarily behave in the same way as its constituents, all annotation except for SBO terms and UniProtKB URIs was manually removed. An expert modeller would be able to correctly choose the appropriate name and GO terms. The UniProtKB URIs were kept, but with a relationship of hasPart.

Two species, “ARF” and “ARF_Mdm2”, were incorrectly identified. In the original model, the both species have an exact “is” match with UniProtKB Q8N726, whose names and synonyms are “Cyclin-dependent kinase inhibitor 2A, isoform 4”, “p14ARF” and “p19ARF”. The annotation procedure by Saint only had the id to use, however, as the MIRIAM URIs were removed in the stripped models. As such, the correct entry could not be returned due to the mismatch of the query term with the UniProtKB names. However, a number of alternative suggestions were provided by Saint.

Species 18
Species with any annotation† 13
Useful species 6
Species with names 0
Species with SBO terms 2
Saint-supported URIs in useful species 7
Species with names after re-annotation 2
SBO terms after re-annotation 6
SBO terms recovered from useful species 2/2
URIs after re-annotation 89
URIs recovered from useful species 5/7
Table 3: Basic information on the annotation present in the original version of the p53-Mdm2 model and in bold for the re-annotated version. This model has no species name attributes, and therefore Saint relies on the id attribute. The two unrecoverable URIs are for the “ARF” species and are described in the main text. †Any annotation refers to MIRIAM URIs, names and SBO terms.

Old New
E3 ubiquitin-protein ligase Mdm2
Mdm2 SBO:0000245
Cellular tumor antigen p53
p53 SBO:0000245
Mdm2_p53 urn:miriam:uniprot:P04637
Mdm2_mRNA urn:miriam:uniprot:Q00987
ARF urn:miriam:uniprot:Q8N726
ARF_Mdm2 urn:miriam:uniprot:Q00987
Table 4: A detailed breakdown of annotation applied by Saint for the p53-Mdm2 model. Recovered annotation is shown in the MATCH column. If an item of annotation is new, it will only be present as a “New” annotation. If an item of annotation has been removed, it will only be present in as an “Old” annotation. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms are omitted for brevity.

4.3 Annotating an intact model

Two annotation runs are performed on the pheromone model in this example. First, a stripped version is re-annotated and then the original, intact model is annotated. This example illustrates the benefits of searching based on the resource URI of a species element. Comparison using URIs is suitable primarily for those models whose identifiers and names are uninformative with respect to their biologically-relevant names or descriptions.

As in the previous examples, species composed of more than one protein had all of their suggested GO URIs removed. The authors of the model would themselves be able to choose the correct terms, but for simplicity GO URIs were only kept when there was only a single UniProtKB URI per species to ensure no false positives were retained.

Annotation of the stripped pheromone model

This section summarises the annotation received by the stripped version of this model. Basic information on the annotation of both the original and re-annotated pheromone model model as well as a summary of the comparison between them is presented in Table 5. A detailed listing of the modifications made to the model is available in Table 6.

The total number of Saint-supported URIs has increased from 14 to 174. Saint recovered all 14 of the URIs from the useful species, as shown in Table 5. 15 new SBO terms were added where none had been present previously. The number of species names increased from 10 to 21 when the new annotation was applied. Of the ten original names, four names were changed to be more descriptive and six were unmodified.

With some species such as “\alpha factor” it was unclear which \alpha factor was intended from the original model, as no further annotation was provided. Therefore one of the most likely candidates was chosen, and only those GO URIs which had the most sources of evidence were kept. Names for complexes such as “complexB” were completely uninformative, and did not retrieve any results.

Species 37
Species with any annotation† 36
Useful species 15
Species with names 10
Species with SBO terms 0
Saint-supported URIs in useful species 14
Species with names after re-annotation 21
  Species names changed 4/10
  New species names 11
SBO terms after re-annotation 15
URIs after re-annotation 174
URIs recovered from useful species 14/14
Table 5: Basic information on the annotation present in the stripped version of the pheromone model, and in bold for the re-annotated version. This model has no SBO terms in its original form, although it did contain 10 names, 6 of which were recovered and 4 of which were modified. †Any annotation refers to MIRIAM URIs, names and SBO terms.

Table 6 shows that a number of UniProtKB accessions were not recovered when re-annotating the stripped model. This is likely due to the nature of the annotations themselves. For instance, Gabc has three UniProtKB URIs in the original model, each of which were associated with the species via the hasPart MIRIAM qualifier. These three proteins are subunits of a larger complex, and neither the complex nor any of the subunits have protein names matching the string literal “Gabc”. As such, when the stripped version was annotated, there was no way for the species to be reconnected to these three proteins. The lack of similarity of the other identifiers to their protein names caused a similar loss of UniProtKB URIs for their species.

Old New
\alpha -factor Mating factor alpha-1
alpha SBO:0000245
Pheromone alpha factor receptor
Ste2 SBO:0000245
Ste2active Pheromone alpha factor receptor
Ste2a SBO:0000245
Protein STE5
Ste5 SBO:0000245
Serine/threonine-protein kinase STE11
Ste11 SBO:0000245
Serine/threonine-protein kinase STE7
Ste7 SBO:0000245
Mitogen-activated protein kinase FUS3
Fus3 SBO:0000245
Serine/threonine-protein kinase STE20
Ste20 SBO:0000245
Protein STE12
Ste12 SBO:0000245
Ste12active Protein STE12
Ste12a SBO:0000245
Bar1 SBO:0000245
Bar1active Barrierpepsin
Bar1a SBO:0000245
Cyclin-dependent kinase inhibitor FAR1
Far1 SBO:0000245
Cell division control protein 28
Cdc28 SBO:0000245
Protein SST2
Sst2 SBO:0000245
Gabc urn:miriam:uniprot:P18851
GaGTP urn:miriam:uniprot:P08539
Gbc urn:miriam:uniprot:P18852
GaGDP urn:miriam:uniprot:P08539
Fus3PP urn:miriam:uniprot:P16892
Bar1aex urn:miriam:uniprot:P12630
Far1PP urn:miriam:uniprot:P21268
Far1U urn:miriam:uniprot:P21268
Table 6: A detailed breakdown of Saint annotation for the pheromone model, starting from a stripped model. Recovered annotation is shown in the MATCH column. New and removed annotation are in the “New” and “Old” columns, respectively. Modified annotation items have their original values in the “Old” column and their new values in the “New” column. Only names, SBO terms and Saint-supported URIs are shown in this table, with GO terms omitted for brevity. The choice among possible species name is at the discretion of the user.

Annotation of the intact pheromone model

This section summarises the improvements in annotation gained by making use of the MIRIAM URIs already present in the original model. When annotating the stripped model, there were a number of species (e.g. Fus3PP) which had no annotation retrieved as the identifier and name were unsuitable. In these cases, making use of the URIs present in the intact, original model increases annotation coverage. For instance, UniProtKB URIs can unambiguously identify species with uninformative identifiers such as GaGTP and complexC, allowing Saint to suggest appropriate species names, GO URIs and SBO terms.

Basic information on the annotation of the intact pheromone model by Saint is presented in Table 7, together with a summary of how the new annotation compares both with the original model and the re-annotated stripped model previously described (see Section ). A detailed listing of the modifications made to the model is available in Table 8.

Overall, Saint produced more annotation when starting from the intact model versus the stripped model. The total number of Saint-supported URIs increased from 101 to 325, as compared with 174 URIs added to the stripped version of the model. While the original model did not contain SBO terms and the re-annotation of the stripped model added only 15, the annotation of the intact model produced a total of 21 terms. Further, while the re-annotation of the stripped model was able to add 11 more species names, a further two names were added when starting with an intact model.

Original model Stripped model annotated by Saint Original model annotated by Saint
Species with names 10 21 23
SBO terms 0 15 21
Saint-supported URIs 101 174 325
Table 7: The amount of annotation added by Saint when annotating the original version of the pheromone model and when annotating its stripped form. More annotation is added by Saint when the original model is annotated as compared to annotating the stripped model.

Old New
G\alpha GTP Guanine nucleotide-binding protein alpha-1 subunit
GaGTP SBO:0000245
G\alpha GDP Guanine nucleotide-binding protein alpha-1 subunit
GaGDP SBO:0000245
Mitogen-activated protein kinase FUS3
Fus3PP SBO:0000245
Bar1activeEx Barrierpepsin
Bar1aex SBO:0000245
Cyclin-dependent kinase inhibitor FAR1
Far1PP SBO:0000245
Far1ubiquitin Cyclin-dependent kinase inhibitor FAR1
Far1U SBO:0000245
Table 8: A detailed breakdown of recovered (in the MATCH column) and new or modified annotation (in the CHANGED columns) by Saint for the pheromone model, starting from an intact model. Saint discovers all annotation found in Table 6 plus the annotation shown in this table. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms are omitted for brevity. Whether the original or new species name is used is entirely at the discretion of the user, and these particular examples are not necessarily the only choice a user could make.

5 Discussion

Saint is a syntactic data integration application for the enhancement of systems biology models. It was developed as an interactive Web tool to annotate models with new MIRIAM resources and reactions, keeping track of data provenance so the modeller can make an informed decision about the quality of the suggested annotation. The system makes it easy for modellers to add explicit biological knowledge to their models, increasing a model’s usefulness both as a reference for other researchers and as an input for further computational analysis.

While Saint allows the user to see which data source provided the prospective annotation, annotation accepted by the user does not itself have any provenance information within the completed model. At the moment, both SBML and CellML are unable to make statements about annotations. The proposed annotation package for SBML Level 3 will address this and other issues [16], while the updated metadata specification for CellML will also allow such provenance statements (Note: http://www.cellml.org/specifications/metadata/mcdraft).

Saint can save time as well as add information that might not otherwise be introduced without the help of specialists like BioModels curators. Saint can be useful either as a first-pass annotation tool or as a way of adding useful cross-references and possible new reactions. Queries in Saint are sent to the data sources asynchronously, letting the user work on each result as it is returned, without having to wait for a complete response. With Saint, the user examine retrieved annotation, chooses the required annotations, and then moves on to the next task. In general, Saint returns a large amount of prospective annotation for the modeller to examine. Not all of it will be useful; many suggested reactions will be beyond the scope of the uploaded model, and suggested ontology terms might be too general to be useful. However, Saint provides modeller-directed annotation, where some level of filtering is expected.

By examining how Saint behaves with both stripped and intact models, we have isolated the two main query types used within the application: id / name querying and querying based on pre-existing MIRIAM resources. The use of URIs to retrieve information is an effective method of bypassing the variable quality of identifiers and names.

It is difficult for automated procedures such as Saint to disambiguate the identifiers and names used to identify species. While a number of checks and automated modifications can and are performed, many different naming conventions are used. Saint performs well with a minimal amount of data, thus allowing annotation of models that are still at an early stage of development. However, the use of a common naming scheme would improve the coverage of new annotations.

5.1 Similar work

While there are a number of tools available for manipulating and validating SBML (e.g. LibSBML), simulating SBML models, analysing simulations and running modelling workflows (e.g. Taverna), Saint is the first to provide basic automatic annotation of SBML models in an easy-to-use GUI. The purpose of Saint is to aid the researcher in the difficult task of information discovery by seamlessly querying multiple databases and providing the results of that query within the SBML model itself. By providing a modelling interface to existing data sources, modellers are able to add valuable information to models quickly and simply.

A few related applications are available for automating the retrieval and integration of data for the annotation of SBML models. SemanticSBML [17] provides MIRIAM annotations via a combination of data warehousing and query translation via Web services as part of a larger application. The Java library libAnnotationSBML [18] uses query translation to provide annotation functionality with a minimal user interface. Unlike libAnnotationSBML, Saint is accessible through an easy-to-use Web interface and unlike both tools is unique in its ability to add new reactions and associated species.

5.2 Collaborations and future plans

Saint is under active development. Future enhancements will include the addition of new data sources and ontologies, annotation of elements other than species and reactions and complete support for other modelling formalisms such as CellML. Since early 2009, the CellML developer and curator community has been in active collaboration with the Saint group. A CellML developer has recently been assigned to complete the CellML integration with Saint. Additionally, the collaboration is helping to finalise the way MIRIAM annotations are described in CellML and the addition of helper code for such annotations in the CellML API. They have provided suggestions on how the Saint interface could be modified, and these and other requirements have been formalised in a Saint requirements document (Note: http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/trunk/docs/SaintRequirements.pdf). This requirements document contains a full, itemised list of all of the future plans for Saint.

5.3 Rule-based mediation and Saint

Saint is a useful, lightweight syntactic data integration environment for systems biology. It provides not only a simple user interface to allow the modeller to integrate data and export it into their models, but is also a successful method of enhancing systems biology models through data integration. However, the work described in this thesis goes further, investigating semantic as well as syntactic methods of integration for the annotation of systems biology models.

Chapter describes the application of semantic data integration to enhance systems biology model annotation. This integration methodology results in the creation of a domain ontology populated with data from multiple data sources. In future, Saint could be modified to use this populated ontology as a data source and query it in a manner similar to the already existing data sources. This would combine the user-friendly interface of Saint with a semantically-integrated knowledgebase.


M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
Catherine M. Lloyd, James R. Lawson, Peter J. Hunter, and Poul F. Nielsen. The CellML Model Repository. Bioinformatics, 24(18):2122–2123, September 2008.
Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
Benjamin J. Bornstein, Sarah M. Keating, Akiya Jouraku, and Michael Hucka. LibSBML: an API Library for SBML. Bioinformatics, 24(6):880–881, March 2008.
Andrew Miller, Justin Marsh, Adam Reeve, Alan Garny, Randall Britten, Matt Halstead, Jonathan Cooper, David Nickerson, and Poul Nielsen. An overview of the CellML API and its implementation. BMC Bioinformatics, 11(1):178+, April 2010.
Damian Smedley, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. BioMart–biological queries made easy. BMC genomics, 10(1):22+, 2009.
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
Lars J. Jensen, Michael Kuhn, Manuel Stark, Samuel Chaffron, Chris Creevey, Jean Muller, Tobias Doerks, Philippe Julien, Alexander Roth, Milan Simonovic, Peer Bork, and Christian von Mering. STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research, 37(Database issue):D412–416, January 2009.
Nicolas Le Novère. Model storage, exchange and integration. BMC neuroscience, 7 Suppl 1(Suppl 1):S11+, 2006.
Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000.
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
Camille Laibe and Nicolas Le Novere. MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology, 1(1):58+, 2007.
C. J. Proctor, D. A. Lydall, R. J. Boys, C. S. Gillespie, D. P. Shanley, D. J. Wilkinson, and T. B. L. Kirkwood. Modelling the checkpoint response to telomere uncapping in budding yeast. Journal of The Royal Society Interface, 4(12):73–90, February 2007.
Carole Proctor and Douglas Gray. Explaining oscillations and variability in the p53-Mdm2 system. BMC Systems Biology, 2(1):75+, 2008.
Bente Kofahl and Edda Klipp. Modelling the dynamics of the yeast pheromone pathway. Yeast, 21(10):831–850, July 2004.
Dagmar Waltemath, Neil Swainston, Allyson Lister, Frank Bergmann, Ron Henkel, Stefan Hoops, Michael Hucka, Nick Juty, Sarah Keating, Christian Knuepfer, Falko Krause, Camille Laibe, Wolfram Liebermeister, Catherine Lloyd, Goksel Misirli, Marvin Schulz, Morgan Taschuk, and Nicolas Le Novère. SBML Level 3 Package Proposal: Annotation. Nature Precedings, (713), January 2011.
Falko Krause, Jannis Uhlendorf, Timo Lubitz, Marvin Schulz, Edda Klipp, and Wolfram Liebermeister. Annotation and merging of SBML models with semanticSBML. Bioinformatics, 26(3):421–422, February 2010.
Neil Swainston and Pedro Mendes. libAnnotationSBML: a library for exploiting SBML annotations. Bioinformatics, 25(17):2292–2293, September 2009.