Integrative Bioinformatics 2007, Day 2: Model Format OWL (MFO), Lister et al.

Integration of constraints documented in SBML, SBO, and the SBML Manual facilitates validation of biological models

Published September 2007 by the Journal of Integrative Bioinformatics

Allyson L. Lister1,2, Matthew Pocock2, Anil Wipat1,2,*
1 Centre for Integrated Systems Biology of Ageing and Nutrition (http://www.cisban.ac.uk)
2 School of Computing Science (http://www.cs.ncl.ac.uk),
Newcastle University (http://www.ncl.ac.uk)*

Abstract

The creation of
quantitative, simulatable, Systems Biology Markup Language (SBML) models that
accurately simulate the system under study is a time-intensive manual process
that requires careful checking. Currently, the rules and constraints of model
creation, curation, and annotation are distributed over at least three separate
documents: the SBML schema document (XSD), the Systems Biology Ontology (SBO),
and the “Structures and Facilities for Model Definition” document. The latter
document contains the richest set of constraints on models, and yet it is not
amenable to computational processing. We have developed a Web Ontology Language
(OWL) knowledge base that integrates these three structure documents, and that
contains a representative sample of the information contained within them. This
Model Format OWL (MFO) performs both structural and constraint integration and
can be reasoned over and validated. SBML Models are represented as individuals
of OWL classes, resulting in a single computationally amenable resource for
model checking. Knowledge that was only accessible to humans is now explicitly
and directly available for computational approaches. The integration of all
structural knowledge for SBML models into a single resource creates a new style
of model development and checking.

Introduction

Systems Biology Markup Language[1] (SBML) is an XML format that has emerged as the de facto standard file format for
describing computational models in systems biology. It is supported by a
vibrant community who have developed a wide range of tools, allowing models to
be generated, analysed and curated in any one of many independently maintained
software applications[1].
The Systems Biology Ontology
[2][2] (SBO)
was developed to enable a useful understanding of the biology to which a model
relates, and to provide well-understood terms for describing common modelling
concepts. The community is engaged in an on-going effort to develop the SBML
standard in ways needed to support systems biology applications. As part of
this process, a manual is maintained that describes and defines SBML and SBO[3].

The biological knowledge used to create and
annotate a high-quality SBML model is typically analysed and integrated by a researcher.
These modellers know and understand both the systems they are modelling and the
intricacies of SBML. However, as with most areas of biology, the amount of data
that is relevant to generating even a relatively small and well-scoped model is
overwhelming. In order to extend the range of modelling tasks that can be
automated, it is necessary to both capture the salient biological knowledge in
a form that computers can process, and represent the SBML rules in a way
computers can systematically interpret. Here we address the latter issue:
describing SBML, SBO and the rules about what constitutes a correctly formed
model in a way suitable for computational manipulation.

The Semantic Web[4]
can be seen as today’s incarnation of the goal to allow computers to go beyond
performing numerical computations, and to share and integrate information more
easily. There are now several standards forming within the Semantic Web
community that together formalise computational languages for representing
knowledge and strictly define what conclusions can be reached from facts
expressed in these languages. The Web Ontology Language
[3][5] (OWL) is
one such language that enjoys strong tools support and which is used for
capturing biological and medical knowledge (e.g. OBI[6],
BioPax[7],
EXPO
[4], and FMA[5] and GALEN[6] in OWL). Once the information about the domain has been modelled in
an OWL file, a software application called a reasoner
[7,
8]
can automatically deduce all other facts that must logically follow
as well as find inconsistencies between asserted facts.

The knowledge about a system described in
SBML can be divided into two parts. Firstly, there is the biological knowledge. This includes information about the
biological entities involved and their biological. Secondly, there is the structural knowledge, describing how the
biological knowledge must be captured in well-formed documents suitable for
processing by applications. In the case of a high-quality SBML model, the structural knowledge required to create
such a model is tied up in three main locations:

  • The Systems Biology Markup Language (SBML[1][8])
    XML Schema Document (XSD[9]),
    describing the range of XML documents considered to be in SBML syntax,
  • The Systems Biology Ontology (SBO[2][10]),
    describing the range of terms that can be used to describe parts of the
    model in a way understandable to the community using the Open Biological
    Ontologies (OBO[11])
    format, and
  • The "Structures and Facilities for Model Definition"
    document[12]
    (hereafter referred to as the "SBML Manual"), describing many
    additional restrictions and constraints upon SBML documents, and the
    context within which SBO terms can be used, as well as information about
    how conformant documents should be interpreted.

From a knowledge-engineering point of view,
it makes sense to represent these sources of structural knowledge as part of a
single knowledge base. Although, to a knowledge-engineer, this current
separation of documents could appear arbitrary, it is in fact well-motivated
according to consumers of each type of information. The portion of the
knowledge codified in SBML transmits all of and only the information needed to
parameterise and run a computational simulation of the system. The knowledge in
SBO is intended to aid humans in understanding what is being modelled. The SBML
Manual is aimed at tools developers needing to ensure that software developed
is fully compliant with the specification.

Only two of these three sources of
structural knowledge are directly computationally amenable. SBML has an
associated XSD that describes the range of legal XML documents, which elements
and attributes must appear, and constraints on the values of text within the
file. SBO captures a term hierarchy containing human-readable descriptions and
labels for each term and a machine-readable ID for each term. Neither of these documents
contains much information about how XML elements or SBO terms should be used in
practice, how the two interact, or what a particular conformant SBML document
should mean to an end-user. The majority of information required to develop a
format-compliant model is in the SBML Manual, in formal English. Anything more
than simple programmatic steps, such as XML validation, can currently only be
done by manually encoding the English descriptions in the SBML Manual into
rules in a program. libSBML[13]
is the reference implementation of this procedure, capturing the process of
validating constraints. Manual encoding provides scope for misinterpretation of
the intent of the SBO Manual or may produce code that accepts or generates
non-compliant documents due to silent bugs. In practice, these problems are
ameliorated by regular SBML Hackathons[14]
and the use of libSBML by many SBML applications. However, the need for a more
formal and complete description of the information in the SBML Manual becomes
more pressing as the community grows beyond the point where all of the relevant
developer groups can be adequately served by face-to-face meetings.

We find that some of these issues can be
avoided by combining the structural knowledge currently spread across three
documents in three formats into a single computationally amenable resource.
This method of constraint integration for all information pertinent to SBML
will require a degree of rigour that can only improve the clarity of the
specification. Once established, standard OWL tools can be used to validate and
reason over SBML models, to check their conformance and to derive any
conclusions that follow from the facts stated in the document, all without
manual intervention.

To address this proposition, we have
developed the Model Format OWL (MFO), implemented in OWL-DL and capturing the
SBML structure plus a representative sample of SBO and human-readable
constraints from the SMBL Manual. We demonstrate that MFO is capable of directly
capturing many of the structural rules and semantic constraints documented in
the SBML Manual. The mapping between SBML documents and the OWL representation
is bi-directional: information can be parsed as OWL individuals from an SBML
document, manipulated and studied, and then serialized back out again as SBML.
We demonstrate feasibility with two simple, illustrative, examples. In future,
we hope to use this as the basis for a method of automatically improving the
annotation of SBML models with rich biological knowledge, and as an aid to principled
automated model improvement and merging.

The integration of all structural knowledge
for SBML models into a single resource creates a new style of model document
development, which we believe will greatly reduce the overheads associated with
computational transformations between biological knowledge and high-quality
systems biology models. MFO is not intended to be a replacement for any of the
APIs or software programs available to the SBML community today. It addresses
the very specific need of a sub-community within SBML that wishes to be able to
express their models in OWL for the purpose of reasoning, validation, and
querying. It has also been created as the first step in a larger data
integration strategy that will eventually encompass the biological as well as
structural knowledge present in SBML documentation and models.

[1]       Hucka,
M. et al.: The systems biology markup
language (SBML): a medium for representation and exchange of biochemical
network models. Bioinformatics (Oxford, England) 19 (2003) 524-531

[2]       Le Novere, N.: Model storage, exchange
and integration. BMC Neurosci 7 Suppl 1 (2006) S11

[3]       Horrocks, I., Patel-Schneider, P.F., van
Harmelen, F.: From SHIQ and RDF to OWL: The making of a web ontology language.
J. of Web Semantics 1 (2003) 7-26

[4]       Soldatova, L.N., King, R.D.: An ontology
of scientific experiments. Journal of the Royal Society, Interface / the Royal
Society 3 (2006) 795-803

[5]       Heja, G., Varga, P., Pallinger, P.,
Surjan, G.: Restructuring the foundational model of anatomy. Studies in health
technology and informatics 124 (2006) 755-760

[6]       Heja, G., Surjan, G., Lukacsy, G.,
Pallinger, P., Gergely, M.: GALEN based formal representation of ICD10.
International journal of medical informatics 76 (2007) 118-123

Enjoyed this? To read the rest, please see the Journal of Integrative Bioinformatics

Read and post comments
|
Send to a friend

original

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s