Data Integration

Chapter 4: MFO (Thesis)

[Previous: Chapter 3: Saint]
[Next: Chapter 5: Rule-based mediation]

Model Format OWL integrates multiple SBML specification documents


The creation of quantitative SBML models that simulate the system under study is a time-consuming manual process. Currently, the rules and constraints of model creation, curation, and annotation are distributed over at least three separate specification documents: the SBML XSD, SBO, and the SBML Language Specification for each SBML level (known hereafter as the SBML Manual). Although the XSD and SBO are computationally accessible, the majority of the SBML usage constraints are contained within the SBML Manual, which is not amenable to computational processing.

MFO, a syntactic conversion of the SBML standard to OWL, was created to integrate the three specification documents and contains a large amount of the information contained within them [1]. MFO is a a fit-for-purpose ontology performing both structural and constraint integration; it can also be reasoned over and used for model validation. SBML models are represented as instances of OWL classes, resulting in a single machine-readable resource for model checking. Knowledge that was only accessible to humans is now explicit and directly available computationally. The integration of all structural knowledge into a single resource creates a new style of model development and checking. Inconsistencies in the computationally inaccessible SBML Manual were discovered while creating MFO, and non-conformance of SBML models to the specification were identified while reasoning over the models.

Section 1 describes the rules and constraints of the SBML standard, while Section 2 provides a detailed description of MFO as well as how each section of the SBML standard was modelled. Large-scale conversion of SBML entries as well as the reasoning times for the populated ontologies are described in Section 3. Finally, a discussion of similar work and further applicability is in Section 4.


The MFO website is, and MFO itself is available for download (Note: MFO is available from In order for the import statements to work, a copy of SBO must be in the same directory. A release of SBO guaranteed to work with MFO is available from The website contains general information about the project, links to the Subversion repository, associated publications and instructions for use. MFO is licensed by Allyson Lister and Newcastle University under a Creative Commons Attribution 2.0 UK: England & Wales License (Note: The MFO Software Toolset is covered under the LGPL (Note: For more information, see LICENSE.txt in the MFO Subversion project (Note:

1 The SBML standard

Both biological and structural knowledge are stored within an SBML document. Biological knowledge describes and identifies the biological entities themselves, while the structural information defines the capturing of biological knowledge in well-formed documents suitable for processing by machines. The structural knowledge required to create an SBML model is tied up in three main locations:

  • The SBML [2] XSD, describing the range of XML documents considered to be in SBML syntax;
  • SBO [3], containing a hierarchy of unambiguous terms useful as a biologically-meaningful semantic layer of annotation for systems biology models; and
  • The SBML Manual [4], detailing the many restrictions and constraints upon SBML documents, the context within which SBO terms can be used, and information on the correct interpretation of documents conforming to the SBML specification.

The SBML specification is in separate documents because each document has been created to match a different type of consumption. The portion of the knowledge codified in the XSD transmits only the information needed to parametrise and run a computational simulation of the system. The knowledge in SBO is intended to aid human understanding of the model and to aid conversion between formats. Finally, the SBML Manual is aimed at modellers as well as tool developers who need to ensure their developed software is fully compliant with the specification.

Only two of the three documents are directly computationally amenable. Firstly, the SBML XSD describes the range of legal XML documents, the required and optional elements and attributes, and the constraints on certain values within those entities. Secondly, SBO provides a term hierarchy of human-readable descriptions and labels together with machine-readable identifiers and inheritance information. Unfortunately, these two documents contain little information on how they interact, the usage of the XML entities or SBO terms, or what a conforming SBML document means to an end user. The majority of such compliance information is in the SBML Manual in human-readable English. With traditional programming techniques, the implementation of these English descriptions must be manually hard-coded.

Many libraries are available for the manipulation and validation of SBML models, with libSBML [5] being the most prominent example. LibSBML is an extensive library for manipulating SBML. It is also a powerful constraint validator, reporting any errors in an SBML document by incorporating the constraints found in all three SBML specification documents. The SBML website provides an online XML validator (Note:, and the SBML Toolbox [6] is also capable of XML validation via libSBML. The online validator uses only the SBML XSD to check the consistency of a model against the schema. The purpose of the SBML Toolbox is to integrate SBML models with the MATLAB environment, but it does have some validation features that check for the correct combination of components. The SBML Test Suite (Note: provides a set of test- cases meant to be used in software development, and can also be used to compare different software applications.

However, such manual encoding provides scope for misinterpretation of the specification, and has the potential to produce code that accepts or creates non-compliant documents due to silent bugs. In practice, these problems are ameliorated by regular SBML Hackathons (Note: and the widespread use of libSBML. However a formal, machine-readable description of the information contained in the SBML Manual will only become more relevant as the community grows beyond the point where developers and users can be adequately served by face-to-face meetings and informal representations of constraints.

2 Model Format OWL

While OWL provides the vocabulary and structure required to describe entities in a semantically-rich and computationally-expressive way, such usage is not mandated by the language. Simpler ontologies can be created in OWL which do not make full use of its semantics, trading expressivity for faster reasoning. A reduction in complexity does not necessarily mean a concomitant reduction in clarity; irrespective of expressivity level, OWL has a number of features unavailable to OBO such as access to reasoners and complex reasoner-based querying [7]. Additionally, when using fit-for-purpose modelling to create computer science ontologies, a rich ontology is often not needed and would generate a high overhead for little gain (see Section 1.5). Semantically simpler ontologies include those adhering to OWL profiles and those performing direct, syntactic conversion to OWL from a non-semantic format. The three official subsets, or profiles, of OWL are optimised for large numbers of classes (OWL-EL), large numbers of instances (OWL-QL), or rules usage (OWL-RL) [8]. Syntactic conversion of a data format into OWL allows reconciliation of the format in preparation for the application of more advanced data integration methods.

MFO is within the OWL 2 DL profile, but makes use of a number of expressions which prevent it from conforming to any of the other OWL profiles. MFO captures the entirety of the SBML structure and the SBO hierarchy plus a selection of the human-readable structural rules and semantic constraints from the SBML Manual. This method of constraint integration improves the accessibility of the specification for machines. Since MFO was originally published [1], reasoning and inference times have been dramatically reduced, and more aspects of the three main sources of rules and constraints in SBML have been incorporated.

Detailed descriptions of the modelling of these documents is available in Sections 2.32.5. These sections use the progression of SBase from XSD complex type to OWL class as a desciption of how MFO models information from each of the three SBML specification documents.

The mapping between SBML documents and the OWL representation is bi-directional: information can be parsed as OWL individuals from an SBML document, manipulated and studied, and then serialised back out again as SBML. Standard Semantic Web tools can be used to validate and reason over SBML models stored in MFO, checking their conformance and deriving any conclusions that follow from the facts stated in the document, all without manual intervention. In Chapter 5, we use MFO as the basis for a method of automatically improving the annotation of SBML models with rich biological knowledge.

2.1 Comments and annotations in MFO

Comments contained within the SBML XSD are converted to OWL along with the rest of the document. Such comments are identified within the XSD by the use of the xsd:documentation element, and are copied to OWL and identified using the rdfs:comment field. This field is also used provide attribution for other facts using the phrases below.

  • Structural Element of SBML: The classes with this comment have a direct association to a specific element or attribute of the SBML format, and generally have identical names to the original XML.
  • Structural Element of XML: Classes with this comment have a direct association to a specific part of an XML document. The only OWL class currently containing this statement is XMLNamespace, which holds any namespace information from a given SBML document.
  • Documented Constraint: The comments which begin with this phrase describe documented constraints on their containing classes. They are generally taken from the SBML Manual, and include the version of the specification supplying the constraint.

SBO classes do not contain these comment types because they are neither based on a structural XML element nor drawn from the SBML Manual. However, many MFO classes have more than one of these comments. The most common combination is “Structural Element of SBML” combined with “Documented Constraint”, indicating that a specific SBML component is defined not only in the XSD, but also in the SBML Manual. One example is the MFO class Reaction, which restricts its SBO terms to those found in the occurring entity representation hierarchy. This documented constraint is found in Table 6, Section 5.2.2 of Level 2 Version 3 Release 1 of the SBML Manual, while the XML element itself is defined in the XSD.

2.2 Metrics

Table 1 contains selected metrics for MFO and SBO. Definitions of commonly-used metrics are available in Section 1.5. As an ontology whose primary purpose is as a controlled vocabulary to facilitate the annotation and reuse of systems biology models [3], SBO is high in classes and in high in annotations. It is also low in instance data and contains only the transitive object property part of in addition to the standard subsumption property is a. This lack of complex properties and instances combined with a high number of classes and high level of annotation reflects its purpose as a vocabulary.

Model Format OWL (MFO) Systems Biology Ontology (SBO)
Classes 88 583
Equivalent classes 13 0
Subclass axioms 264 633
Disjoint statements 692 0
Object properties 55 1
Functional object properties 35 0
Data properties 44 0
Functional data properties 26 0
Instances 35 0
Annotation axioms 63 1319
Table 1: Selected ontology metrics for MFO and SBO. The first column of data is the metrics for MFO alone, although SBO is a direct import within MFO. A summary of metrics for SBO is provided in the second column. Metrics created using Protégé 4.

In contrast, MFO only needs enough classes to model the various entities present in the SBML XSD and their set of constraints. Therefore the number of classes is smaller in comparison to a vocabulary-based ontology such as SBO. Due to the interconnectedness of the entities and constraints within the SBML specification documents, the total number of class axioms and property types is high relative to the number of classes. Even though MFO is primarily a syntactic ontology and not a model of a biological domain, the syntax of SBML combined with the constraints present in the specification create an interconnected structure.

2.3 Modelling the SBML XML Schema

SBML is a good candidate for representation in OWL as the large amount of rules and constraints are described in detail, and the interlocking nature of those rules makes integrating them worthwhile. The XMLTab plugin [9] for Protégé 3.4 reads in either an XSD or XML file and converts that file into a representation in OWL. More information on the use of the XMLTab plugin is available in Chapter 5. This plugin was applied to the SBML XSD, resulting in the creation of an initial set of classes and relations. This creation step was virtually instantaneous, and did not require human intervention. However, limitations of the plugin required some manual additions. For instance, although the enumeration of allowed UnitSId values is available within the XSD, it is not converted by the XMLTab plugin. Instead, it was manually created as a set of 35 instances (shown in Table 1) of the corresponding UnitSId class within MFO.

In the creation of MFO, SBML elements became OWL classes and SBML attributes became OWL properties (either data or object properties, as appropriate). When specific SBML models are loaded, their data is stored as instances. An example of the conversion is shown in Figure 1, using a small portion of the SBML SBase complex type. SBase is useful as an example because nearly every SBML type inherits from this complex type.

Figure 1: A portion of the SBML SBase element for Level 2 Version 4, as defined by the XSD. For brevity, the complete complex type has not been shown. Source:

The XSD provides information on the characteristics of each complex type. Such information includes the optionality of attributes, how often attributes can be used and the typing of attributes (e.g. boolean). While not easy for humans to read, XML is easily parseable by computers. Figure 2 shows how constraints within the XSD for SBase are fully converted into OWL.

Figure 2: A portion of the MFO SBase class when converted directly from the XSD using the XMLTab plugin. Each of the elements and attributes shown in Figure 1 are shown converted into OWL axioms. The Manchester OWL Syntax has been used for readability. For brevity, the complete axiom list has not been shown.

A number of XSD constraints have direct equivalents in OWL. For instance, if an attribute is optional and can only be present once, the max 1 construct is used, as with the sboTerm property:

 sboTerm max 1 Thing

Using an axiom from the Species class as an example, the conversion of required attributes makes use of the exactly construct:

 compartment_id exactly 1 SId

If the attribute can only be filled with a particular type of entity, then that can be specified as well:

 sboTerm only SBOTerm

While the OWL shown in Figure 2 is a direct conversion of the SBML XSD, closer inspection reveals the limitations of XSDs, as very little rule and constraint information is contained within it. For example, the relationship between SBase and SBOTerm cannot be adequately described without SBO itself, and some usages of the SId class within Species cannot be correctly modelled without access to the additional specification information found in the SBML Manual. These and other examples of how MFO is able to model aspects of the SBML specification documents are described in detail in the next two sections.

2.4 Modelling SBO

SBO is a hierarchy of terms developed by the systems biology modelling community to assist compliance with MIRIAM, ensure unambiguous understanding of the meaning of the annotated entities and foster mapping between annotated elements from multiple formats making use of the ontology [10]. By adding SBO terms, modellers take an essential step towards such compliance [3]. Each term in SBO is related to its direct parent with an is a subsumption relationship such as catalyst is a modifier. For a full description of SBO itself and its relationship to other systems biology standards, see Section 1.4.

While SBO is natively stored as a relational database, it can be exported either as OWL or OBO (Note: The OWL export of SBO is regularly downloaded and imported into MFO. The current version of SBO in MFO is the auto-generated OWL export downloaded on 13 December 2011. While the import could directly reference the online version of SBO from the EBI servers, a local copy ensures that MFO works offline and that unexpected changes in SBO do not cause any adverse effects.

Figure 3: The SpeciesPreL2V4 and SpeciesAtAndAfterL2V4 classes whose union is the full necessary and sufficient definition of Species, as shown in Figure 4. These classes allow the reasoner to check that the appropriate SBO terms are used for the appropriate SBML model versions. The Manchester OWL Syntax has been used for readability, and only part of the SpeciesAtAndAfterL2V4 class is shown for brevity.

In SBML, SBase provides an optional sboTerm attribute which creates a link to a specific term in SBO. Most elements in SBML inherit from SBase and therefore have access to the sboTerm attribute. In MFO, this is represented as the object property sboTerm. Figure 3 provides examples of sboTerm usage within MFO. The SBOTerm simple type, which is the type for the sboTerm attribute used within SBase, defines a string pattern restriction of “(SBO:)([0-9]7)” rather than referencing SBO directly. As such, the XSD constraint is only an approximation rather than a complete representation of the correct values of the sboTerm attribute. MFO improves upon this situation by linking the sboTerm attribute directly to SBO itself. Figure 3 shows how the material entity and physical entity representation classes from SBO are linked directly to the MFO classes SpeciesAtAndAfterL2V4 and SpeciesPreL2V4, respectively.

There are both formal and informal restrictions on how SBO terms should be used within an SBML model. The formal restrictions are listed in the SBML Manual and covered in the next section. The informal restrictions are concerned with those entities which inherit from SBase but which do not have any specific rules governing how SBO should be used. For instance, while the ListOf__ elements inherit from SBase, the SBML Manual does not provide the range (if any) of allowed SBO terms for these elements [4], making it difficult to model them correctly in OWL. Therefore in MFO, ListOf__ classes have no restrictions on the use of SBO terms. While this is an exact reproduction of the rules and constraints in the specification, a lack of clarity within the specification leads to possibly incomplete modelling.

2.5 Modelling the SBML Manual

The SBML Manual contains the majority of the rules and constraints to which valid models must conform. For instance, depending on the value of the constant and boundaryCondition attributes of the species element, corresponding speciesReference elements may or may not be products or reactants in reactions [4, Section 4.8.6, Table 5].

A new document is released for each new version of SBML. New versions of the manual may create new rules, invalidate old ones or publish changes to existing rules. For example, an explicit statement of the SBO terms allowed for modifierSpeciesReference was removed in later versions of the SBML Manual. The Level 2 Version 4 Manual states that modifierSpeciesReference elements should have SBO terms from the modifier hierarchy [11, Section 4.13.2]. In Level 3 Version 1 this constraint is no longer explicitly stated, though it is implied [4, Section 4.11.4]. Determining the correct restriction to use in cases such as this becomes impossible without direct consultation with the developers of the specification. Such a lack of precision is a serious pitfall of specification documents which are only human readable.

In addition to the inexact wording possible with an English specification, new versions of the SBML Manual also change which SBO term hierarchies can be used for particular elements [11, 4, Table 6]. For example, as shown in Figure 3, material entity and its children are the only SBO terms that can be used with Species classes at or after Level 2 Version 4. However, Species classes have different constraints on their SBO terms if they are before Level 2 Version 4.

To solve this problem and allow the reasoner access to this logic, the Species class was given two additional children, SpeciesAtAndAfterL2V4 and SpeciesPreL2V4. These child classes are together marked as equivalent to the Species class (Figure 4), and are fully modelled based on the level and version type of the SBML document (Figure 3). These are some of the many constraints placed on the use of SBO terms documented within the SBML Manual, and not captured in either the SBML XSD or the SBO OBO file.

Figure 4: Example constraints added to MFO based on information contained in the SBML Manual. The constraints listed here are in addition to those already added by XMLTab. As some constraints are dependent upon the level and version of the SBML model, subclasses of Species were created (see also Figure 3). Additionally, the constraints present in the SBML Manual allowed the correct linking of properties such as speciesType_id and compartment_id to their owning classes via a restriction on the SId used in those properties.

The SId class models all identifiers internal to the SBML model, and is heavily used in the XML to provide links from one part of the document to another. It and its child class, UnitSId, constrain the type of identifier allowed for entities within SBML [4, Section 3.1.7]. The conversion to OWL from the XSD does create an SId class, but a lack of further information means that the relevant parts of the SBML Manual must also be used to correctly model SIds within MFO. Simple statements using the id object property require no changes to the axioms created by the conversion process:

 id only SId

The statement above, taken from the Species class, states that the Species identifier must be of type SId. This matches the behaviour defined in the SBML Manual, and no further modifications to the OWL are required. However, other cases where SIds are used within the Species class are not completely defined in the XSD. These cases are when a SId is used as the link between Species and another class. For instance, compartment_id is the object property which, ultimately, links a given Species to its containing Compartment, but the XSD does not specify this, so the conversion to OWL results in:

 compartment_id exactly 1 SId

While this statement tells us that compartment_id must be of type SId, it does not let us know that it must connect to an SId belonging to a Compartment. Therefore, the addition of this information from the SBML Manual results in the following modification to the statement (also shown in Figure 4):

 compartment_id only (id_of only Compartment),
 compartment_id exactly 1 Thing

Figure 5: Example constraints added to MFO based on information contained in the SBML Manual. A new class called MiriamAnnotation was created and linked to SBase, allowing the full and complete modelling of the MIRIAM structure. The constraints listed here are in addition to those already added by XMLTab.

On its own, the conversion of SBase into OWL results in the creation of a generic annotation data property which is filled by a string literal (Figure 2). However, a major part of the annotation of an SBML entity is its MIRIAM annotation. Therefore to align more closely with the SBML Manual and to aid further integration work in this thesis, classes describing MIRIAM annotation in SBML have been added. Figure 5 shows the MiriamAnnotation class and the miriamResource data property in the context of SBase, while Figure 6 shows the MIRIAM qualifiers within MFO. The MIRIAM biological qualifiers are held within the bqb property hierarchy, and the model qualifiers are listed within the bqm hierarchy. MIRIAM-compliant annotation elements are important to rule-based mediation (Chapter 5), as these elements contain information such as cross references and taxonomic information.

Figure 6: The hierarchy of MIRIAM qualifiers in OWL, as shown in Protégé 4. miriamResource is the top-level data property, and the biology qualifier (modelled here as bqb) and model qualifier (modelled here as bqm) hierarchies follow those found in the SBML Manual.

3 Working with MFO

On its own, MFO integrates the rules and constraints present in the three SBML specification documents, making those rules available to machines and presenting them in a single view for humans. Further, by adding data from SBML models themselves, the information stored within the model is exposed to reasoners and semantic querying. This section describes the population and exporting of MFO using the entirety of the BioModels database as an example, and shows the benefits of reasoning over even a relatively simple syntactic ontology such as MFO. Chapter 5 then describes the larger semantic integration project and how MFO can be used in such a situation.

3.1 Creating instance data

Figure 7: The species “DL” from BioModels entry BIOMD0000000001 in the native XML format. Please note that multiple sections of the XML have been removed to aid readability.

In Figure 7, an SBML species is shown in its original XML format, while its corresponding instance data in MFO is shown in Figure 8. Each attribute of Species is filled with the data provided from the XML. Data properties point directly to the value for that property. For example, the name data property points to “DesensitisedACh”. Object properties can be followed to the instances being referenced. For example, a miriamAnnotation object property points to
BIOMD0000000001_000012_bqbIsVersionOf_0_0, an instance of the class MiriamAnnotation. This instance has a bqbIsVersionOf data property whose value is “urn:miriam:interpro:IPR002394”. Additionally, the sboTerm attribute points, not to a string literal as described in the XSD, but to an actual SBOTerm, as required by the SBML Manual.

Figure 8: The BIOMD0000000001_DL instance of the MFO Species class from BioModels entry BIOMD0000000001. In addition to BIOMD0000000001_DL itself, a small number of object properties are expanded to show their actual values.

3.2 Populating MFO

When converting data from SBML to MFO, first the XML is parsed and then corresponding OWL is created. Each of these steps is aided by existing libraries coordinated with the MFO Software Toolset. LibSBML 5.0.0 [5] converts the XML structures into plain Java objects. Each Java object is then assigned to a particular MFO entity and saved as OWL using the OWL API version 3.2.3 [12]. After conversion back to SBML, roundtrip comparisons of the XML both before and after processing were performed to ensure that all data is retained.

The BioModels [13] database release of 15 April 2011 contains 326 curated and 373 non-curated models. Within these two datasets, a total of six models could not be loaded into libSBML at all, and a further nine did not pass the roundtrip test from XML to MFO and back again. There was only a single curated model which failed loading into libSBML: BIOMD0000000326 contains a notes element which is in an incorrect format and was rejected. An additional five entries from the non-curated set failed loading for the same reason (Note: MODEL0911120000, MODEL1011020000, MODEL1012110001, MODEL1150151512, MODEL6655501972). Table 2 shows the number of instances and property axioms created when both the curated and non-curated BioModels datasets were converted into MFO.

Curated BioModels Non-Curated BioModels
Instances 149,912 1,217,067
Object Property Assertions 246,679 2,225,935
Data Property Assertions 221,738 1,884,250
Table 2: The number of created instances and related axioms when the BioModels database is converted into MFO. Curated and non-curated datasets are shown separately.

As SBML models are standalone and cannot be imported into each other, when downloading the BioModels database, each entry is stored in its own file. Figure 9 summarises individual creation times for MFO models of all parseable curated and uncurated entries in the BioModels database. The conversion to MFO generally takes less than one second, although uncurated entries take longer, on average, to be converted. (Note: All timings in this chapter were performed on a Dell Latitude E6400 computer, unless otherwise stated.). Conversion times were longer for larger SBML models such as the yeast molecular interaction network model [14] present in the non-curated dataset (Note: MODEL3883569319).

When roundtrip testing was performed, two curated models (BIOMD0000000169 and
BIOMD0000000195) had errors in their original MIRIAM annotation resulting in libSBML dropping some of the MIRIAM URIs. These have been confirmed as errors in the SBML model by the BioModels curators  (Note: A third curated model, BIOMD0000000228, gains a superfluous sbml namespace on most elements when converting from MFO to SBML, causing an error within libSBML. A total of six non-curated models failed the roundtrip process. Four of these models (Note: MODEL2463683119, MODEL4132046015, MODEL1009150002, MODEL1949107276) had segmentation faults from libSBML’s C++ core during the comparison of the original and the newly-converted SBML, and the two biggest non-curated models (Note: MODEL3883569319, MODEL6399676120) could not feasibly be compared in memory.

Figure 9: Summary of the conversion times for all MFO models stemming from the complete BioModels database. Each dot corresponds to a single BioModels entry, and the conversion times are plotted on a logarithmic scale. The extreme outlier, MODEL3883569319, is not included in this graph to aid readability (conversion time 2401 seconds).

Although each BioModels entry is stored within the MFO Subversion project in its own file to match how those database entries are provided, there is no reason why the entire database cannot be stored in a single OWL file. Within the MFO Subversion project, OWL files are available for download both singly and as a concatenation of all curated data.

3.3 Reasoning over MFO

Reasoning over an ontology can provide a number of benefits. Inconsistencies in the logic or in the instance data can be identified, and inference of more precise locations for classes and instances can create new hierarchies and pull out otherwise hidden knowledge. By using MFO, constraints can be automatically enforced that currently are only described textually in the SBML Manual or hard-coded into libSBML.

Libraries such as libSBML capture the process of validating constraints, while MFO explicitly captures the constraints themselves, thus identifying where the SBML Manual is not explicit or appears to contradict itself. Examples of this constraint capture and verification were already described in Section 2.5, where a lack of precision in the English language definitions as well as changes in correct annotation for sboTerm attributes result in logical inconsistencies. These inconsistencies were discovered while MFO was being created, as the axioms for the relevant parts of the specification were being added. However, while it is not easy to manually spot such errors it is also not required that they be noticed at this stage. If two incompatible SBO term branches of the SBO hierarchy were both stated to be the only branches allowed for a particular MFO entity, the reasoner would catch this inconsistency even if the ontologist did not.

The ongoing process of incorporating constraints into MFO will improve the quality of the SBML Manual through identification of these weaknesses. Once captured in MFO, these constraints can be investigated systematically by running a reasoner. This allows many aspects of models to be validated without the need for hard-coded checking libraries. Furthermore, the addition of new constraints requires the release of a new version of a library such as libSBML, but would only require retrieval of the new MFO file; no changes are required to the supporting MFO Toolset.

Factors affecting reasoning time are complex, and are not just tied to ontology, ABox or TBox size. Such a discussion is beyond the scope of this thesis, but Motik and colleagues provide a thorough discussion and comparison of current reasoners [15]. Figure 10 shows a plot of the reasoning times for curated BioModels, providing a general overview of how long reasoning takes over MFO models. Although, very broadly, larger models took longer to reason over, Figure 10 shows that size is not a highly useful indicator of reasoning time. Only the 290 models which were both consistent and satisfiable were included in the plot. The vast majority of models in this set took less than one second to reason over. Two extreme outliers took much longer to reason over, most likely due to their sizes, as they are two of the largest models in the curated BioModels dataset (Note: BIOMD0000000235 and BIOMD0000000255).

Figure 10: Reasoning times over curated BioModels modelled in MFO. Each dot corresponds to a single BioModels entry, and the reasoning times are plotted on a logarithmic scale. The reasoning time is plotted against the size of each model. Two extreme outliers (BIOMD0000000235: 924 seconds, BIOMD0000000255: 370 seconds) have not been included in the graph. HermiT [15] version 1.3.3 was used as the reasoner.

As a syntactic ontology, the simple structure allows fast reasoning, and the benefits gained include SBO validation and inference of more specific subtypes of classes. Of the 35 inconsistent curated BioModels entries, a sample were examined manually to determine the causes of the inconsistencies. For both BIOMD0000000055 and BIOMD0000000216, the inconsistencies stemmed from errors in the use of SBO terms. In BIOMD0000000055, the SBO term messenger RNA was used within a species. As this model is Level 2 Version 4, and messenger RNA is not part of the material entity hierarchy (see Section 2.5), the use of this term makes the ontology inconsistent. Similarly, in BIOMD0000000216 the incorrect SBO term simple chemical is used within a parameter. Therefore, inconsistencies in models represented in MFO can be traced to errors in the underlying SBML documents.

Reasoners can also place instances in more specific sub-classes than the class in which they were originally asserted. For instance, how a species’ quantity can be changed depends upon the values of the boundaryCondition and constant attributes [4, Table 5]. If both are true, the quantity of the species cannot be altered by the reaction system and therefore its value will not change during a simulation run. MFO models all four possible combinations of these booleans as four distinct sub-types of Species. The placement in the Species hierarchy of a given instance can be inferred using the values of these properties. As an example, if an instance is asserted to be a Species and both boundaryCondition and its constant are true, reasoners will correctly classify this individual as a ConstantBoundarySpecies.

4 Discussion

4.1 Similar work

The CellML Ontological Framework [16] has parallels with MFO in that it aims to represent CellML models in OWL. This project was published two years after MFO, and uses, among other things, graph reduction algorithms to represent CellML models in a biological “view” which shows the biology of the model while hiding the complexity of the full CellML document. However, the purpose of the CellML Ontological Framework is to aid readability of CellML models, not to integrate or validate.

SBML Harvester is a tool similar to MFO, allowing the conversion of SBML models into OWL [17]. Like MFO, multiple SBML documents can be loaded into a single OWL file, reasoned over, and queried. SBML Harvester can be exported into the OWL-EL profile, and incorporates a minimal, custom-made upper-level ontology which itself makes use of the Relation Ontology [18]. Unlike MFO, which models only the SBML specification and not the underlying biology, SBML Harvester incorporates both biological concepts and the model concepts into the same ontology. This conflates two completely different domains into a single ontology. Further, SBML Harvester is not a complete model of the SBML specification. It currently only covers a limited subset of the SBML document: compartments, species, reactions and the model element itself. This means that it is not capable of performing a full round-trip of the SBML documents.

In theory BioPAX could be used to model all SBML documents, as there are tools available for converting from one format to another. However, this solution is not feasible for a number of reasons. Most importantly, BioPAX does not store quantitative information or the logical restrictions and axioms that are otherwise only stored in either SBO or the SBML Manual. As such, BioPAX cannot be easily used to perform the tasks possible with MFO. Le Novère stated that current converters result in “pathological SBML or BioPAX, and the round-tripping is near impossible” (Note: Few modellers in the target group for MFO use BioPAX natively. Their simulators accept SBML, and they develop and test their models in SBML. For those familiar with SBML, MFO is likely to be a much more accessible view of models than BioPAX. Finally, the export of data from the integration project described in Chapter 5 needs the SBML format, which can be exported losslessly with MFO but not with BioPAX.

4.2 Applicability

MFO represents the information required to build and validate an SBML model in a way accessible to machines. The three separate sources of this information meet the needs of distinct user groups and, while optimal for the existing range of human-driven applications, this disparate arrangement is not suitable for computational approaches to model construction and annotation. This is primarily because much of the knowledge about what constitutes valid and meaningful SBML models is tied up in a non-machine-readable format. Once available to machines, the model is exposed to reasoners which can be used to check for internal consistency and correct structure.

MFO is not intended to be a replacement for any of the APIs or software programs available to the SBML community. Rather, MFO is a complementary approach that facilitates the development of the rule-based mediation integration strategy (Chapter 5) for model improvement and manipulation by drawing upon internal consistency and inference capabilities. With MFO, inconsistencies both in the specification documents and in the models themselves can be identified. It addresses the very specific need of a sub-community within SBML that wishes to be able to express their models in OWL for the purpose of reasoning, validation and querying.

The scope of MFO does not extend to mapping other domains or integrating other data sources. These integrative tasks have a logically distinct purpose. MFO can be used on its own or as one of the source ontologies integrated via rule-based mediation. Chapter 5 goes beyond the standalone usage described here to explain how rule-based mediation utilises MFO with other source ontologies to integrate and query systems biology data. In rule-based mediation, MFO has a dual role: firstly, it acts as the format representation for SBML models such as those stored within the BioModels database, and secondly it exports query results as new annotation on SBML documents.


A. L. Lister, M. Pocock, and A. Wipat. Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models. Journal of Integrative Bioinformatics, 4(3):80+, 2007.
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
Nicolas Le Novère. Model storage, exchange and integration. BMC neuroscience, 7 Suppl 1(Suppl 1):S11+, 2006.
Michael Hucka, Michael Hucka, Frank Bergmann, Stefan Hoops, Sarah Keating, Sven Sahle, James Schaff, Lucian Smith, Darren Wilkinson, Michael Hucka, Frank T. Bergmann, Stefan Hoops, Sarah M. Keating, Sven Sahle, James C. Schaff, Lucian P. Smith, and Darren J. Wilkinson. The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core. Nature Precedings, (713), October 2010.
Benjamin J. Bornstein, Sarah M. Keating, Akiya Jouraku, and Michael Hucka. LibSBML: an API Library for SBML. Bioinformatics, 24(6):880–881, March 2008.
Sarah M. Keating, Benjamin J. Bornstein, Andrew Finney, and Michael Hucka. SBMLToolbox: an SBML toolbox for MATLAB users. Bioinformatics (Oxford, England), 22(10):1275–1277, May 2006.
Mikel Aranguren, Sean Bechhofer, Phillip Lord, Ulrike Sattler, and Robert Stevens. Understanding and using the meaning of statements in a bio-ontology: recasting the Gene Ontology in OWL. BMC Bioinformatics, 8(1):57+, February 2007.
Diego Calvanese, Jeremy Carroll, Giuseppe De Giacomo, Jim Hendler, Ivan Herman, Bijan Parsia, Peter F. Patel-Schneider, Alan Ruttenberg, Uli Sattler, and Michael Schneider. OWL2 Web Ontology Language Profiles, October 2009.
Michael Sintek. XML Tab., June 2008.
Melanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novere. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011.
Nicolas Le Novère, Michael Hucka, Stefan Hoops, Sarah Keating, Sven Sahle, Darren Wilkinson, Michael Hucka, Stefan Hoops, Sarah M. Keating, Nicolas Le Novère, Sven Sahle, and Darren Wilkinson. Systems Biology Markup Language (SBML) Level 2: Structures and Facilities for Model Definitions. Nature Precedings, (713), December 2008.
Matthew Horridge, Sean Bechhofer, and Olaf Noppens. Igniting the OWL 1.1 Touch Paper: The OWL API. In Proceedings of OWLEd 2007: Third International Workshop on OWL Experiences and Directions, 2007.
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
Tommi Aho, Henrikki Almusa, Jukka Matilainen, Antti Larjo, Pekka Ruusuvuori, Kaisa-Leena Aho, Thomas Wilhelm, Harri Lähdesmäki, Andreas Beyer, Manu Harju, Sharif Chowdhury, Kalle Leinonen, Christophe Roos, and Olli Yli-Harja. Reconstruction and Validation of RefRec: A Global Model for the Yeast Molecular Interaction Network. PLoS ONE, 5(5):e10662+, May 2010.
Boris Motik, Rob Shearer, and Ian Horrocks. Hypertableau Reasoning for Description Logics. Journal of Artificial Intelligence Research, 36:165–228, 2009.
S. M. Wimalaratne, M. D. B. Halstead, C. M. Lloyd, E. J. Crampin, and P. F. Nielsen. Biophysical annotation and representation of CellML models. Bioinformatics, 25(17):2263–2270, September 2009.
Robert Hoehndorf, Michel Dumontier, John Gennari, Sarala Wimalaratne, Bernard de Bono, Daniel Cook, and Georgios Gkoutos. Integrating systems biology models and biomedical ontologies. BMC Systems Biology, 5(1):124+, August 2011.
Barry Smith, Werner Ceusters, Bert Klagges, Jacob Kohler, Anand Kumar, Jane Lomax, Chris Mungall, Fabian Neuhaus, Alan Rector, and Cornelius Rosse. Relations in biomedical ontologies. Genome Biology, 6(5):R46+, 2005.
Data Integration Semantics and Ontologies Software and Tools

Current Research into Reasoning over BioPAX and SBML

What’s going on these days in the world of reasoning and systems biology modelling? What were people’s experiences when trying to reason over systems biology data in BioPAX and/or SBML format? These were the questions that Andrea Splendiani wanted to answer, and so he collected three of us with some experience in the field to give 10-minute presentations to interested parties at a BioPAX telecon. About 15 people turned up for the call, and there were some very interesting talks. I’ll leave you to decide for yourselves if you’d class my presentation as interesting: it was my first talk since getting back from leave, and so I may have been a little rusty!

The first talk was given by Michel Dumontier, and covered some recent work that he and colleagues performed on converting SBML to OWL and reasoning over the resulting entities.

Essentially, with the SBMLHarvester project, the entities in the resulting OWL file can be divided into two broad categories: in silico entites covering the model elements themselves, and in vivo entities covering the (generally biological) subjects the model elements represent. They copied all of BioModels into the OWL format and performed reasoning and analysis over the resulting information. Inconsistencies were found in the annotation of some of the models, and additionally queries can be performed over the resulting data set.

I gave the second talk about my experiences a few years ago converting SBML to OWL using Model Format OWL (MFO) (paper) and then, more recently, using MFO as part of a larger semantic data integration project whose ultimate aim is to annotate systems biology models as well as create skeleton (sub)models.

I first started working on MFO in 2007, and started applying that work to the wider data integration methodology called rule-based mediation (RBM) (paper) in 2009. As with SBMLHarvester, libSBML and the OWLAPI are used in the creation of the OWL files based on BioModels entries. All MFO entries can be reasoned over and constraints present within MFO from the SBML XSD, the SBML Manual, and from SBO do provide some useful checks on converted SBML entries. The semantics of SBMLHarvester are more advanced than that of MFO, however MFO is intended to be a conversion of a format only, so that SWRL mappings can be used to input/output data from MFO to/from the core of the rule-based mediation. Slide 8 of the above presentation provides a graphic of how rule-based mediation works. In summary, you start with a core ontology which should be small and tightly-scoped to your biological domain of interest. Data is fed to the core from multiple syntactic ontologies using SWRL mappings. These syntactic ontologies can be either direct format conversions from other, non-OWL, formats or pre-existing ontologies in their own right. I use BioPAX in this integration work, and while I have mainly reasoned over MFO (and therefore SBML), I do also work with BioPAX and plan to work more with it in the near future.

The final presenter was Ismael Navas Delgado, whose presentation is available from Dropbox. His talk covered two topics: reasoning over BioPAX data taken from Reactome, and the use of a database back-end called DBOWL for the OWL data. By simply performing reasoning over a large number of BioPAX entries, Ismael and colleagues were able to discover not just inconsistencies in the data entries themselves, but also in the structure of BioPAX. It was a very interesting summary of their work, and I highly recommend looking over the slides.

And what is the result of this TC? Andrea has suggested that, after discussion on the mailing list (contact Andrea Splendiani if you are not on it and want to be added) and then have another TC in a couple of weeks. Andrea has also suggested that it would be nice to “setup a task force within this group to prepare a proof of concept of reasoning on BioPAX, across BioPAX/SBML, or across information resources (BioPAX/OMIM…)”. I think that would be a lot of fun. Join us if you do too!

CISBAN Semantics and Ontologies Software and Tools

SBML in OWL: some thoughts on Model Format OWL (MFO)

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.
A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.
Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.
How a single reaction looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

CISBAN Data Integration Semantics and Ontologies Software and Tools Standards

Of GelML and MFO

A couple of papers from here at Newcastle University have appeared over the past couple of weeks. Here's a summary of them both.

  • Data Standards
    From "An Update on Data Standards for Gel Electrophoresis" in Practical Proteomics Issue 1, September 2007, and by Andrew R. Jones and Frank Gibson.
    From the abstract: "We report on standards development by the Gel Analysis Workgroup of the
    Proteomics Standards Initiative. The workgroup develops reporting
    requirements, data formats and controlled vocabularies for experimental
    gel electrophoresis, and informatics performed on gel images. We
    present a tutorial on how such resources can be used and how the
    community should get involved with the on-going projects. Finally, we
    present a roadmap for future developments in this area."
    Provides a summary of ongoing work in the Gel electrophoresis and Gel informatics fields in terms of data and metadata standardization. This includes work on MIAPE GE and MIAPE GI, two checklists for minimal information required on these types of experiments and analyses. For both GE and GI, there are data formats (GelML and GelInfoML, respectively, both extensions of FuGE) and a suggested controlled vocabulary (sepCV). More information can be found on
    Frank works in the CARMEN neuroscience project here at Newcastle, and Andy is in Liverpool and works on, among other things, FuGE. CARMEN collaborates with the SyMBA project, which was originally developed by me and a few others within Neil Wipat's Integrative Bioinformatics Group here at Newcastle but which is now a sourceforge project at Andy Jones is a co-author with me, Neil Wipat, Matt Pocock and Olly Shaw on an upcoming SyMBA paper.
  • Semantic Data Integration
    A paper that was presented at the Integrative Bioinformatics Conference 2007 by me and my co-authors, Matt Pocock and Neil Wipat, is now available from the Journal of Integrative Bioinformatics website.
    Allyson L. Lister, Matthew Pocock, Anil Wipat. Integration of
    constraints documented in SBML, SBO, and the SBML Manual facilitates
    validation of biological models
    . Journal of Integrative Bioinformatics,
    4(3):80, 2007.

Read and post comments |
Send to a friend