# Chapter 3: Saint (Thesis)

[Previous: Chapter 2: SyMBA]
[Next: Chapter 4: MFO]

# Introduction

Quantitative modelling is at the heart of systems biology. Model description languages such as SBML [1] and CellML [2] allow the relationships between biological entities to be captured and the dynamics of these interactions to be described mathematically. Without a biological context to these dynamic models, they are only easily understandable by their creators. Interested third parties must rely on extra documentation such as publications or possibly-incomplete descriptions of species and reactions included in an element’s name or identifier to determine the relevance of a model to their work. Many dynamic models include only the mathematical information required to run simulations, and do not explicitly contain the full biological context. However, the efficient exchange, reuse and integration of models is aided by the presence of biological information in the model. If biological information about a model is added consistently and thoroughly, the model becomes useful not just for simulation, but also as an input in other computational tasks and as a reference for researchers.

Saint, an automated annotation integration environment, aids the modeller and reduces development time by providing extra information about a model in an easy-to-use interface [3]. Saint is a lightweight Web application which enables modellers to rapidly mark up models with biological information derived from a range of data sources. It supports the addition of basic annotation to SBML and CellML entities and identifies new reactions which may be valuable for extending a model. While the addition of biological annotation does not modify the behaviour of the model, the incorporation of new reactions or species adds new features that can later be built upon to potentially change the model’s output. Saint is a syntactic integration solution to the annotation of systems biology models, developed both to explore the data sources relevant to modellers and to provide a practical examination of data integration for the purpose of model annotation.

Section 1 describes the importance of linking biology to models, and why the syntax of those links is vital to their usefulness. Section 2 describes the technology used to implement Saint, while Section 3 focuses on how Saint retrieves model annotation and how it can be used to pick and choose the annotation to keep. Section 4 provides practical examples of annotating models using Saint. Finally, Section 5 describes the future plans for Saint, the collaborations that have been developed, and how Saint can be used with rule-based mediation (see Chapter 5).

## 1 Availability

Saint is freely available for use on the Web at http://saint.ncl.ac.uk. The Web application is implemented in GWT, with all major browsers supported. Saint has a Sourceforge.net project website (Note: http://saint-annotate.sourceforge.net) and wiki (Note: http://sourceforge.net/apps/mediawiki/saint-annotate/index.php?title=Main_Page). There is also a Newcastle University project page (Note: http://cisban-silico.cs.ncl.ac.uk/saint.html). The Sourceforge.net project website provides access to the Subversion repository (Note: http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/) as well as various documents for programmers. The Newcastle University website and the Sourceforge.net wiki are both user-oriented, and contain general information about the project, associated publications, screenshots and instructions for use.

Saint is licensed by Allyson Lister, Morgan Taschuk and Newcastle University under the LGPL (Note: http://www.gnu.org/copyleft/lesser.html). For more information, see LICENSE.txt in the Saint Subversion project. Installing Saint on a Web server requires an installation of LibSBML [4] and the CellML API [5]. Installation and running of Saint has been tested on 32-bit Ubuntu 8.10 and higher.

# 1 Linking biology to models

Figure 1 is one of a set of examples provided by SBGN to demonstrate the standard. Specifically, it is a diagrammatic representation of glycolysis. There are two layers to systems biology models such as this one: the mathematical model, where a term like “hexokinase” is the name of a variable in a model to be simulated; and the real, underlying biological system where hexokinase is a molecule inside a biological compartment which physically interacts with other entities. The mathematical model is self contained, and will work perfectly without information regarding hexokinase’s biological importance. However, mathematical models are created to aid the understanding of biology therefore links to the biology of the model should be created and maintained. In systems biology models, these links are added via annotations.

Model annotations are necessary to describe how the model has been generated and to define the meaning of the components that make up the model in a computationally-accessible fashion. The addition of biological annotations to a model is usually a manual, time-consuming process; there is no single resource encompassing all suitable data sources. A modeller usually has to visit many websites, applications and interfaces in order to identify relevant information, and may not be aware of all potentially useful databases. Because modellers add information manually, it is very difficult to annotate exhaustively.

However, without computationally-amenable biological annotations, it is difficult to create software tools that determine the biological pathway or reaction being modelled. The manual curation of such semantic information to non-curated models is a complex process. For instance, the advice for the curation of new BioModels database entries includes checks that all entities are what they say they are (e.g. that a species is a species and not a parameter), the creation of meaningful names for SBML entities and the association of the model with at least one publication (Note: http://biomodels.caltech.edu/curationtips). For many submitted models, the quality of the biological annotation is low, making it difficult to programmatically determine what biological pathway or pathways a model is describing.

While annotations are vital for model sharing and reuse, they do not contribute to the mathematical content of a model and are not critical to its successful functioning. The addition of biological knowledge must be performed quickly and easily in order to make annotation worthwhile to a modeller. A large number of tools (Note: http://sbml.org/SBML_Software_Guide) are available for the construction, manipulation and simulation of models, but there is currently a limited number of tools to facilitate rapid and systematic model annotation. While websites and applications specialising in integrating disparate data sources exist such as BioMart [6] and Pathway Commons, few are designed to put information directly into a model.

SBML and CellML can store both the mathematical model and the links to the biological information necessary to interpret the model. The difficulty lies in retrieving appropriate biological information quickly and with a minimal amount of effort. In the glycolysis example, UniProtKB [7] states that hexokinase is a protein with an enzymatic role in glycolysis (Note: http://www.uniprot.org/uniprot/P19367). STRING [8] and Pathway Commons provide a list of interactions and pathways in which hexokinase is involved. MIRIAM, SBO [9] and GO [10] provide Saint with the vocabulary needed to label hexokinase and describe its role in the model.

MIRIAM annotations use vocabularies and ontologies to document the biological knowledge in a systems biology model. In SBML, biological annotation is added to annotation elements according to the MIRIAM specification [11]. MIRIAM annotations link external resources such as ontologies and data sources to the model in a standardised way. MIRIAM also defines an annotation scheme, accessible via Web services, which specifies the format and set of standard data types which should be used for these URIs [12]. The use of the MIRIAM format provides a standard structure for explicit links between the mathematical and biological aspects of an SBML model.

# 2 Implementation

## 2.1 Integration methodology

Saint makes it easy to find information related to the domain of interest and save that information back to SBML. The biological annotation of models is facilitated by using query translation to present an integrated view of data sources and suggested ontological terms. As described in Section 1.6, query translation uses a mediator to distribute a single query across known data sources. When the results are retrieved, they are formatted according to the set of global Saint objects and sent to the user interface.

Within Saint, the query for each species is translated into a set of queries over Web services provided by each data source. The results are imported into global objects, which can then be further manipulated before presentation: querying and integration occur together. This method is suitable for a single-purpose application that runs pre-specified queries and then stores the query results generically. The combined query results are then displayed in the Web browser.

Query translation removes the need for a local database of information, and allows the easy addition of new data sources. However, changes to the source Web services or data structures require equivalent changes in the Saint query system. As changes to services are relatively infrequent and the size of the available datasets very large, query translation is the best option for this type of integration.

Saint performs syntactic integration by matching data to a particular identifier, name or MIRIAM URI through syntactic equivalence of the query term and the external data source. No semantic integration occurs with the standard version of Saint, although Section 5 discusses how Saint can be connected to the rule-based mediation knowledgebase, thus adding a semantically-aware data source.

## 2.2 Saint back-end

On the client side, Saint is a simple Web application which can be used immediately, without creating an account or requesting access. Behind the user interface, Saint is implemented in GWT (Note: http://code.google.com/webtoolkit/) and hosted on a Tomcat (Note: http://tomcat.apache.org/) server. For more information on GWT, see Chapter 2. GWT was chosen for a number of reasons:

• Asynchronous data transfer. Remote procedure calls can be used to query the server-side of the GWT application. As soon as the query is dispatched, control is handed back to the client. A notification is sent back to the client once the query to the server completes. This allows for multiple calls to be performed simultaneously and asynchronously. For instance, new annotation can be retrieved for many SBML species at the same time.
• Abstraction of the details of creating a user interface. Although the programmer writes code describing the look and feel of the panels and forms, the specific details of the user interface are hidden. The default skin of GWT Web applications also creates a clean and lightweight feel with very little cost in development time.
• Speed of development. GWT only requires a knowledge of standard packages and libraries in Java, so there is a low initial overhead for creating a working Web application. The majority of bioinformaticians are not experts in the creation of user interfaces, and by using GWT more time can be spent on functionality rather than the user interface.

# 3 Data retrieval and model annotation

This section describes both the structure of Saint and its annotation methodology using the curated BioModels entry BIOMD0000000087 [13] by Proctor and colleagues as an example, hereafter referred to as the telomere model. This model describes telomere uncapping in S.cerevisiae, a process implicated in ageing.

Figure 2 shows an overview of the Saint integration infrastructure. As soon as the user session begins, the full complement of official MIRIAM URIs are downloaded via EBI Web services. These official URIs are used within MIRIAM-compliant models to provide cross-references to many different data sources. By retrieving the URIs for each data source at the beginning of a session, Saint will remain up-to-date with respect to any changes made by MIRIAM developers.

A model is uploaded by copying either a CellML or SBML model into the input box and pressing the “Load Model” button. Saint then validates the model with LibSBML for SBML or the CellML API for CellML documents. If the model is valid the SBML species or the CellML variables are retrieved and displayed, together with any MIRIAM annotation already present. Obsolete MIRIAM annotations are updated, and Saint then uses a table to display the elements available for annotation (see Figure 3).

It is possible to restrict which types of annotation are retrieved by Saint. The “Advanced Annotation Options” shown at the bottom of Figure 3 can be disabled as necessary prior to starting the annotation process. For instance, a modeller who has a skeleton model might be interested in adding all types of annotation, while one who has already defined the reactions of interest might want to deliberately exclude the addition of new reactions. A model that has already been submitted to a database such as BioModels will already have annotation for each species. However, a modeller might want to extend such a model, and therefore might only be interested in new reactions. Additionally, modellers might not add GO annotation to a species if the species is already annotated with a UniProtKB accession, as that information can be found in the UniProtKB entry itself. However, if the curated SBO term is quite high level, providing a GO term with a more specific description could be helpful.

Both SBML and CellML models can be read by Saint, although the CellML export is not yet implemented as the method used to store MIRIAM annotation within CellML has not yet been finalised. The user can then remove any items they are not interested in annotating. For instance, some terms such as “sink” are modelling artefacts and do not correspond to genes or proteins. Therefore, the user would normally wish to delete this species from the search space to prevent any possible matches with actual biological species of a similar name. Deletion of a row in the table is accomplished by clicking on the “Hide Row” button for the appropriate row. Figure 3 shows a portion of the main table in Saint after the telomere model has been loaded.

Each of the species from the telomere model is added as a new row to the table, although the screenshot does not show the entire list. Those that already contain annotation will have a collapsible list allowing the user to browse the pre-existing annotation. For instance, Figure 4 shows how the Ctelo species has two items of evidence, the GO terms chromosome, telomeric region (GO:0000781) and telomere cap complex (GO:0000782). At this stage, the only tab visible within this row is the “Known” tab containing the pre-existing items of evidence.

Once the user is satisfied with the list of items to be annotated, the model is submitted using the “Annotate” button at the bottom of the table. Pressing “Clear” will instead unload the current model and return the user to the Saint homepage. As annotation queries complete, summaries are added to the main table. Each row of the table is submitted for annotation separately, which allows the query response to be sent back to the user as soon as it is received, even if other rows have not yet returned information.

In the example, the focus is on the annotation of Ctelo, Cdc13 and Mec1. After all other species are hidden from view, the “Annotate” button is pressed. Once annotation is added, a status message appears in the second column of the table, and an “Inferred” tab is added next to “Known” (see Figure 5). Clicking on the Inferred tab allows the user to view and modify the newly-suggested annotation. When hovering over a new item of annotation with the mouse, the evidence for that piece of annotation will appear. Information on each of the data sources where the annotation was found is displayed. The results can be customised by the user, who can examine the evidence and then discard any individual piece of inappropriate annotation. The new model is kept in sync with the choices being made, and can be exported at any time.

In systems biology models, separate species are often used to describe the various states of a single biological entity. For instance, in the telomere model the enzyme Exo1 is represented by Exo1A for the active state and Exo1I for the inactive. For the modeller, it is useful to name multiple species in this way to allow for differing amounts of protein present in each state. Such naming schemes caused difficulties for Saint because the provided name or identifier does not match the canonical name.

Naming ambiguities have been partially solved by parsing commonly-used systematic naming schemes into more likely query candidates. Systematic naming schemes are conventions used by researchers for naming species in alternative states. If no results are retrieved for the original name or identifier, Saint removes any appended “_I”, “_A”, “active” or “inactive” literals and tries the query again. In some cases, species names or identifiers are created as a hybrid of two protein names, e.g. “proteinA_proteinB”. In such cases, Saint suggests annotation based around both proteins. Having a name attribute as well as the id attribute helps, as generally the name used by modellers is a more precise identification of the biological entity.

As shown in Figure 5, in the telomere model example Saint found new annotation for Cdc13 and Mec1 but not for Ctelo. Figures 6 and 7 show a selected subset of the suggested annotation for Cdc13 and Mec1, respectively. Ctelo is the modeller’s private name for the capped state of a telomere, just as Utelo represents the uncapped telomere. As such, no hits are found in the source data. However, both Cdc13 and Mec1 are correctly identified, and the new annotation is displayed to the user. The first section of the results shows which ontology terms are suggested for the species. For SBO, a suggested term is provided as the default value in a pull-down list, allowing the user to either accept, change or delete the suggestion. Saint suggests the SBO term ‘macromolecule’ (SBO:0000245) as the best SBO match to a protein. This term was suggested both because ‘cdc13’ was found within UniProtKB and because the interaction set identified the species as a protein.

Because multiple MIRIAM annotations such as GO and UniProtKB cross-references are allowed within the annotation element, each annotation has its own row and may be individually removed. Each of these rows also shows the relationships available to link the annotation to the model. While SBO terms are added directly to SBML elements, GO terms and UniProtKB accessions are added to the MIRIAM annotation section.

As well as annotations, extensions to the model are also suggested. Binary interactions retrieved from both Pathway Commons and STRING can be made into basic SBML reactions. Also, alternative names for the species or variables are presented to the user by parsing the description lines of all retrieved UniProtKB entries. For each row, the depth of the green background is dependent upon the number of sources of evidence; the greater the sources, the darker the green. Reactions and associated species are added directly to the SBML model, while the majority of the other annotation is added as MIRIAM-compliant URIs.

Once the new annotation has been approved by the user, the model can be saved as a new SBML document. Clicking on the “Get Annotated Model Code” tab in Saint adds all approved annotation and presents the new SBML model for viewing and download.

# 4 Comparing Saint-annotated models with pre-existing annotation

Saint uses the name and identifier of an element as well as the existing resource URIs as starting points for adding annotation to models. Saint can provide new MIRIAM-compliant annotation with even a minimal amount of data, making it useful both for early-stage and already-annotated models. This section compares two Saint-annotated models against their original models as well as against stripped variants of those original models. In the first example, Saint annotates versions of the original model stripped of all biological annotation. The stripped model is then compared with the original model to asses the performance of Saint when presented with minimal information. In the second example, Saint annotates the original models. This comparison determines the extra annotation possible when Saint is provided with all available information.

## 4.1 Initial set-up of the comparisons

The models used for these comparisons are the telomere model as defined in Section 3,
BIOMD0000000189 [14] and BIOMD0000000032 [15]. The telomere model describes telomere uncapping in S.cerevisiae. Proctor and Gray created BIOMD0000000189, hereafter referred to as the p53-Mdm2 model, to model oscillations and variability in the p53-Mdm2 system in humans. Finally, in BIOMD0000000032 Kofahl and Klipp model the dynamics of the pheromone pathway in S.cerevisiae. For readability, BIOMD0000000032 is hereafter referred to as the pheromone model.

As well as making use of the intact, original version of each model, stripped versions were created by removing their biological annotation. Specifically, after each model is downloaded from BioModels, the sboTerm and the annotation element of each species is removed. Even though SBO terms, species names and annotation elements are added by Saint in these comparisons, names and identifiers were deliberately not removed. Without these fields, Saint would lose those strings as query terms. Generally, modellers use either descriptive id attributes without names, or descriptive name attributes together with an arbitrary identifier scheme. Though all URIs within the annotation elements are stripped from the model, only a proportion of those URIs are currently supported by Saint and therefore only those supported types can be added back by Saint. These stripped models are then used as a starting point for Saint annotation for the first set of comparisons.

The reactions added by Saint are new and as such cannot be usefully compared against the original models. This functionality of Saint is therefore not compared. Additionally, species which describe a composite or complex formed from multiple proteins will not necessarily behave in the same way as its constituents. As such, only the SBO terms and UniProtKB cross-references suggested by Saint will be retained for comparison in those instances. An expert modeller would be able to correctly choose the appropriate name and GO terms, but for the purposes of this example they were simply removed from consideration.

To perform the comparison, the stripped models are loaded into Saint and the new annotation is retrieved. After the new annotation is displayed, basic editing of the annotation is performed, including the choosing of sensible species names from the list and the removal of obvious false positive annotation. These choices mimic the expected behaviour of the modeller using the application. Then the newly-annotated model is downloaded and compared to the original, intact model.

Two phrases are used throughout these comparisons: useful species and Saint-supported URIs. Species with informative id or name attributes in the original model are termed useful species. Informative identifiers and names closely match actual protein or gene names, and are therefore suitable for querying external resources. Sometimes, if an explicit statement must be made as to which of either the identifier or name is useful, the terms useful identifier or useful name will be used. There is a limit to how much information Saint can pull from non-informative identifiers and names. In these cases, it is important to use any MIRIAM information already stored in the annotation elements. This style of annotation is covered Section 4.3. Saint-supported URIs are MIRIAM URIs, such as those for GO and UniProtKB, which are parsed by Saint as well as added during the annotation process. Saint-supported URIs are therefore, by definition, those which can be compared both before and after new annotation has been added.

All original, stripped and re-annotated versions of the models used in this example are available from the Saint Subversion repository (Note: http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/trunk/docs/supplementary-data).

## 4.2 Annotating stripped models

The first comparison illustrates the benefits of searching based on the id or name attributes of an SBML species element. The stripped versions of the telomere model and the p53-Mdm2 model, both of which contained a number of useful species, were annotated by Saint and then compared against their originals. Even within a single model, some names and identifiers can be useful to Saint while others are not.

### Annotation of the stripped telomere model

Basic information on the annotation of both the original and re-annotated the telomere model is available in Table 1, together with a summary of the comparison between the two models. A detailed listing of the modifications made to the model is available in Table 2.

 Species 55 Species with any annotation† 51/55 Useful species 15 Species with names 0 Species with SBO terms 0 Saint-supported URIs in useful species 14/15 Species with names after re-annotation 14 Species with SBO terms after re-annotation 15 URIs after re-annotation 163 URIs recovered from useful species 14/14
Table 1: Basic information on the annotation present in the original version of the telomere model and in bold for the re-annotated version. This model had no species name attributes, and therefore Saint relies on the id attribute. There were also no SBO terms in the original model. Although there are 15 useful species, only 14 contain either a Saint-supported URI, a name or an SBO term; the fifteenth (the “RPA” species) has a useful identifier but contains only a PIRSF URI unsupported by Saint. †Any annotation refers to MIRIAM URIs, names and SBO terms.

The total number of Saint-supported URIs has increased from 14 to 163, with further addition or subtraction of URIs possible if required by the modeller. In addition, Saint recovered all 14 of the URIs from the useful species. The original model contained no SBO terms or species names, and Saint has added 15 SBO Terms and 14 species names, which is almost complete coverage of the useful species.

The species “ssDNA” is an example of incorrect annotation by Saint due to a lack of available information. ”ssDNA“ is difficult for Saint to filter: it is not a protein, but can be used as a phrase within protein names. Currently, the Saint user must spot these errors and remove the suggested annotation using the Saint interface prior to downloading the model. The one useful species which was not completely annotated with SBO term, species name and URIs was “RPA”. A large amount of annotation was suggested by Saint, but all except the SBO term was manually rejected. In the original annotation, the “RPA” annotation is a single URI which points to the PIR family (PIRSF) “replication protein A, RPA70 subunit” (Note: http://pir.georgetown.edu/cgi-bin/ipcSF?id=PIRSF002091). This URI is marked as having an “isVersionOf” relationship the RPA species, rather than as an exact “is” match, as the PIRSF URI is linked to mouse, and not yeast. A search for an equivalent for yeast in UniProtKB shows a number of yeast subunits. In theory, an appropriate choice would be made by the modeller. In PIRSF002091, the equivalent yeast UniProtKB accession is listed as P22336, which is “ Replication factor A protein 1”, a.k.a. “RFA1”. This leads to the conclusion that the species in the model was named after the mouse protein and not the yeast protein. As “RFA” is not a name that matched the original ID (“RPA”), there is no way to retrieve the correct information from the data sources.

In conclusion, in comparing the annotation on the useful species for this model, all Saint-supported URIs were recovered, and new URIs, SBO terms and species names were added. A detailed listing of all new and recovered annotation present in the useful species are shown in Table 2.

Table 2: A detailed breakdown of recovered (in the “MATCH” column) and new (in the “NEW” column) annotation by Saint for the telomere model. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms added as MIRIAM URIs are omitted for brevity.

### Annotation of the stripped p53-Mdm2 model

Basic information on the annotation of both the original and re-annotated p53-Mdm2 model is presented in Table 3, together with a summary of the comparison between the two models. A detailed listing of the modifications made to the model is available in Table 4. The total number of Saint-supported URIs has increased from 7 to 89, and of the original seven Saint-supported URIs present in the original useful species, five were recovered. The other two URIs were from related species which were incorrectly identified, and are described below. The 84 remaining URIs added to the model are GO URIs. The original model contained only two SBO terms, both of which were recovered. Additionally, two new SBO terms were added. Finally, there were no species names in the original model, and two new names were added with Saint.

In contrast to the telomere model, the p53-Mdm2 model contains species whose id attribute was created as a mixture of two protein names, e.g. “Mdm2_p53”. Of the six useful species, four have identifiers formed in this way. For each of these four, Saint suggested annotation based around both of their associated proteins. However, as a composite of two proteins will not necessarily behave in the same way as its constituents, all annotation except for SBO terms and UniProtKB URIs was manually removed. An expert modeller would be able to correctly choose the appropriate name and GO terms. The UniProtKB URIs were kept, but with a relationship of hasPart.

Two species, “ARF” and “ARF_Mdm2”, were incorrectly identified. In the original model, the both species have an exact “is” match with UniProtKB Q8N726, whose names and synonyms are “Cyclin-dependent kinase inhibitor 2A, isoform 4”, “p14ARF” and “p19ARF”. The annotation procedure by Saint only had the id to use, however, as the MIRIAM URIs were removed in the stripped models. As such, the correct entry could not be returned due to the mismatch of the query term with the UniProtKB names. However, a number of alternative suggestions were provided by Saint.

 Species 18 Species with any annotation† 13 Useful species 6 Species with names 0 Species with SBO terms 2 Saint-supported URIs in useful species 7 Species with names after re-annotation 2 SBO terms after re-annotation 6 SBO terms recovered from useful species 2/2 URIs after re-annotation 89 URIs recovered from useful species 5/7
Table 3: Basic information on the annotation present in the original version of the p53-Mdm2 model and in bold for the re-annotated version. This model has no species name attributes, and therefore Saint relies on the id attribute. The two unrecoverable URIs are for the “ARF” species and are described in the main text. †Any annotation refers to MIRIAM URIs, names and SBO terms.

 SPECIES ID MATCH CHANGED Old New E3 ubiquitin-protein ligase Mdm2 Mdm2 SBO:0000245 urn:miriam:uniprot:Q00987 Cellular tumor antigen p53 p53 SBO:0000245 urn:miriam:uniprot:P04637 SBO:0000245 Mdm2_p53 urn:miriam:uniprot:P04637 urn:miriam:uniprot:Q00987 SBO:0000245 Mdm2_mRNA urn:miriam:uniprot:Q00987 SBO:0000245 ARF urn:miriam:uniprot:Q8N726 SBO:0000245 ARF_Mdm2 urn:miriam:uniprot:Q00987 urn:miriam:uniprot:Q8N726
Table 4: A detailed breakdown of annotation applied by Saint for the p53-Mdm2 model. Recovered annotation is shown in the MATCH column. If an item of annotation is new, it will only be present as a “New” annotation. If an item of annotation has been removed, it will only be present in as an “Old” annotation. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms are omitted for brevity.

## 4.3 Annotating an intact model

Two annotation runs are performed on the pheromone model in this example. First, a stripped version is re-annotated and then the original, intact model is annotated. This example illustrates the benefits of searching based on the resource URI of a species element. Comparison using URIs is suitable primarily for those models whose identifiers and names are uninformative with respect to their biologically-relevant names or descriptions.

As in the previous examples, species composed of more than one protein had all of their suggested GO URIs removed. The authors of the model would themselves be able to choose the correct terms, but for simplicity GO URIs were only kept when there was only a single UniProtKB URI per species to ensure no false positives were retained.

### Annotation of the stripped pheromone model

This section summarises the annotation received by the stripped version of this model. Basic information on the annotation of both the original and re-annotated pheromone model model as well as a summary of the comparison between them is presented in Table 5. A detailed listing of the modifications made to the model is available in Table 6.

The total number of Saint-supported URIs has increased from 14 to 174. Saint recovered all 14 of the URIs from the useful species, as shown in Table 5. 15 new SBO terms were added where none had been present previously. The number of species names increased from 10 to 21 when the new annotation was applied. Of the ten original names, four names were changed to be more descriptive and six were unmodified.

With some species such as “$\alpha$ factor” it was unclear which $\alpha$ factor was intended from the original model, as no further annotation was provided. Therefore one of the most likely candidates was chosen, and only those GO URIs which had the most sources of evidence were kept. Names for complexes such as “complexB” were completely uninformative, and did not retrieve any results.

 Species 37 Species with any annotation† 36 Useful species 15 Species with names 10 Species with SBO terms 0 Saint-supported URIs in useful species 14 Species with names after re-annotation 21 Species names changed 4/10 New species names 11 SBO terms after re-annotation 15 URIs after re-annotation 174 URIs recovered from useful species 14/14
Table 5: Basic information on the annotation present in the stripped version of the pheromone model, and in bold for the re-annotated version. This model has no SBO terms in its original form, although it did contain 10 names, 6 of which were recovered and 4 of which were modified. †Any annotation refers to MIRIAM URIs, names and SBO terms.

Table 6 shows that a number of UniProtKB accessions were not recovered when re-annotating the stripped model. This is likely due to the nature of the annotations themselves. For instance, Gabc has three UniProtKB URIs in the original model, each of which were associated with the species via the hasPart MIRIAM qualifier. These three proteins are subunits of a larger complex, and neither the complex nor any of the subunits have protein names matching the string literal “Gabc”. As such, when the stripped version was annotated, there was no way for the species to be reconnected to these three proteins. The lack of similarity of the other identifiers to their protein names caused a similar loss of UniProtKB URIs for their species.

 SPECIES ID MATCH CHANGED Old New $\alpha$-factor Mating factor alpha-1 alpha SBO:0000245 urn:miriam:uniprot:P01149 Pheromone alpha factor receptor Ste2 SBO:0000245 urn:miriam:uniprot:P06842 Ste2active Pheromone alpha factor receptor Ste2a SBO:0000245 urn:miriam:uniprot:P06842 Protein STE5 Ste5 SBO:0000245 urn:miriam:uniprot:P32917 Serine/threonine-protein kinase STE11 Ste11 SBO:0000245 urn:miriam:uniprot:P23561 Serine/threonine-protein kinase STE7 Ste7 SBO:0000245 urn:miriam:uniprot:P06784 Mitogen-activated protein kinase FUS3 Fus3 SBO:0000245 urn:miriam:uniprot:P16892 Serine/threonine-protein kinase STE20 Ste20 SBO:0000245 urn:miriam:uniprot:Q03497 Protein STE12 Ste12 SBO:0000245 urn:miriam:uniprot:P13574 Ste12active Protein STE12 Ste12a SBO:0000245 urn:miriam:uniprot:P13574 Barrierpepsin Bar1 SBO:0000245 urn:miriam:uniprot:P12630 Bar1active Barrierpepsin Bar1a SBO:0000245 urn:miriam:uniprot:P12630 Cyclin-dependent kinase inhibitor FAR1 Far1 SBO:0000245 urn:miriam:uniprot:P21268 Cell division control protein 28 Cdc28 SBO:0000245 urn:miriam:uniprot:P00546 Protein SST2 Sst2 SBO:0000245 urn:miriam:uniprot:P11972 urn:miriam:uniprot:P08539 Gabc urn:miriam:uniprot:P18851 urn:miriam:uniprot:P18852 GaGTP urn:miriam:uniprot:P08539 urn:miriam:uniprot:P18851 Gbc urn:miriam:uniprot:P18852 urn:miriam:bind:50058 GaGDP urn:miriam:uniprot:P08539 Fus3PP urn:miriam:uniprot:P16892 Bar1aex urn:miriam:uniprot:P12630 Far1PP urn:miriam:uniprot:P21268 Far1U urn:miriam:uniprot:P21268
Table 6: A detailed breakdown of Saint annotation for the pheromone model, starting from a stripped model. Recovered annotation is shown in the MATCH column. New and removed annotation are in the “New” and “Old” columns, respectively. Modified annotation items have their original values in the “Old” column and their new values in the “New” column. Only names, SBO terms and Saint-supported URIs are shown in this table, with GO terms omitted for brevity. The choice among possible species name is at the discretion of the user.

### Annotation of the intact pheromone model

This section summarises the improvements in annotation gained by making use of the MIRIAM URIs already present in the original model. When annotating the stripped model, there were a number of species (e.g. Fus3PP) which had no annotation retrieved as the identifier and name were unsuitable. In these cases, making use of the URIs present in the intact, original model increases annotation coverage. For instance, UniProtKB URIs can unambiguously identify species with uninformative identifiers such as GaGTP and complexC, allowing Saint to suggest appropriate species names, GO URIs and SBO terms.

Basic information on the annotation of the intact pheromone model by Saint is presented in Table 7, together with a summary of how the new annotation compares both with the original model and the re-annotated stripped model previously described (see Section ). A detailed listing of the modifications made to the model is available in Table 8.

Overall, Saint produced more annotation when starting from the intact model versus the stripped model. The total number of Saint-supported URIs increased from 101 to 325, as compared with 174 URIs added to the stripped version of the model. While the original model did not contain SBO terms and the re-annotation of the stripped model added only 15, the annotation of the intact model produced a total of 21 terms. Further, while the re-annotation of the stripped model was able to add 11 more species names, a further two names were added when starting with an intact model.

 Original model Stripped model annotated by Saint Original model annotated by Saint Species with names 10 21 23 SBO terms 0 15 21 Saint-supported URIs 101 174 325
Table 7: The amount of annotation added by Saint when annotating the original version of the pheromone model and when annotating its stripped form. More annotation is added by Saint when the original model is annotated as compared to annotating the stripped model.

 SPECIES ID MATCH CHANGED Old New G$\alpha$GTP Guanine nucleotide-binding protein alpha-1 subunit GaGTP SBO:0000245 urn:miriam:uniprot:P08539 G$\alpha$GDP Guanine nucleotide-binding protein alpha-1 subunit GaGDP SBO:0000245 urn:miriam:uniprot:P08539 Mitogen-activated protein kinase FUS3 Fus3PP SBO:0000245 urn:miriam:uniprot:P16892 Bar1activeEx Barrierpepsin Bar1aex SBO:0000245 urn:miriam:uniprot:P12630 Cyclin-dependent kinase inhibitor FAR1 Far1PP SBO:0000245 urn:miriam:uniprot:P21268 Far1ubiquitin Cyclin-dependent kinase inhibitor FAR1 Far1U SBO:0000245 urn:miriam:uniprot:P21268
Table 8: A detailed breakdown of recovered (in the MATCH column) and new or modified annotation (in the CHANGED columns) by Saint for the pheromone model, starting from an intact model. Saint discovers all annotation found in Table 6 plus the annotation shown in this table. Only names, SBO terms and Saint-supported URIs are shown in this table. Additionally, the large number of new GO terms are omitted for brevity. Whether the original or new species name is used is entirely at the discretion of the user, and these particular examples are not necessarily the only choice a user could make.

# 5 Discussion

Saint is a syntactic data integration application for the enhancement of systems biology models. It was developed as an interactive Web tool to annotate models with new MIRIAM resources and reactions, keeping track of data provenance so the modeller can make an informed decision about the quality of the suggested annotation. The system makes it easy for modellers to add explicit biological knowledge to their models, increasing a model’s usefulness both as a reference for other researchers and as an input for further computational analysis.

While Saint allows the user to see which data source provided the prospective annotation, annotation accepted by the user does not itself have any provenance information within the completed model. At the moment, both SBML and CellML are unable to make statements about annotations. The proposed annotation package for SBML Level 3 will address this and other issues [16], while the updated metadata specification for CellML will also allow such provenance statements (Note: http://www.cellml.org/specifications/metadata/mcdraft).

Saint can save time as well as add information that might not otherwise be introduced without the help of specialists like BioModels curators. Saint can be useful either as a first-pass annotation tool or as a way of adding useful cross-references and possible new reactions. Queries in Saint are sent to the data sources asynchronously, letting the user work on each result as it is returned, without having to wait for a complete response. With Saint, the user examine retrieved annotation, chooses the required annotations, and then moves on to the next task. In general, Saint returns a large amount of prospective annotation for the modeller to examine. Not all of it will be useful; many suggested reactions will be beyond the scope of the uploaded model, and suggested ontology terms might be too general to be useful. However, Saint provides modeller-directed annotation, where some level of filtering is expected.

By examining how Saint behaves with both stripped and intact models, we have isolated the two main query types used within the application: id / name querying and querying based on pre-existing MIRIAM resources. The use of URIs to retrieve information is an effective method of bypassing the variable quality of identifiers and names.

It is difficult for automated procedures such as Saint to disambiguate the identifiers and names used to identify species. While a number of checks and automated modifications can and are performed, many different naming conventions are used. Saint performs well with a minimal amount of data, thus allowing annotation of models that are still at an early stage of development. However, the use of a common naming scheme would improve the coverage of new annotations.

## 5.1 Similar work

While there are a number of tools available for manipulating and validating SBML (e.g. LibSBML), simulating SBML models, analysing simulations and running modelling workflows (e.g. Taverna), Saint is the first to provide basic automatic annotation of SBML models in an easy-to-use GUI. The purpose of Saint is to aid the researcher in the difficult task of information discovery by seamlessly querying multiple databases and providing the results of that query within the SBML model itself. By providing a modelling interface to existing data sources, modellers are able to add valuable information to models quickly and simply.

A few related applications are available for automating the retrieval and integration of data for the annotation of SBML models. SemanticSBML [17] provides MIRIAM annotations via a combination of data warehousing and query translation via Web services as part of a larger application. The Java library libAnnotationSBML [18] uses query translation to provide annotation functionality with a minimal user interface. Unlike libAnnotationSBML, Saint is accessible through an easy-to-use Web interface and unlike both tools is unique in its ability to add new reactions and associated species.

## 5.2 Collaborations and future plans

Saint is under active development. Future enhancements will include the addition of new data sources and ontologies, annotation of elements other than species and reactions and complete support for other modelling formalisms such as CellML. Since early 2009, the CellML developer and curator community has been in active collaboration with the Saint group. A CellML developer has recently been assigned to complete the CellML integration with Saint. Additionally, the collaboration is helping to finalise the way MIRIAM annotations are described in CellML and the addition of helper code for such annotations in the CellML API. They have provided suggestions on how the Saint interface could be modified, and these and other requirements have been formalised in a Saint requirements document (Note: http://saint-annotate.svn.sourceforge.net/viewvc/saint-annotate/trunk/docs/SaintRequirements.pdf). This requirements document contains a full, itemised list of all of the future plans for Saint.

## 5.3 Rule-based mediation and Saint

Saint is a useful, lightweight syntactic data integration environment for systems biology. It provides not only a simple user interface to allow the modeller to integrate data and export it into their models, but is also a successful method of enhancing systems biology models through data integration. However, the work described in this thesis goes further, investigating semantic as well as syntactic methods of integration for the annotation of systems biology models.

Chapter describes the application of semantic data integration to enhance systems biology model annotation. This integration methodology results in the creation of a domain ontology populated with data from multiple data sources. In future, Saint could be modified to use this populated ontology as a data source and query it in a manner similar to the already existing data sources. This would combine the user-friendly interface of Saint with a semantically-integrated knowledgebase.

# Bibliography

[1]
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
[2]
Catherine M. Lloyd, James R. Lawson, Peter J. Hunter, and Poul F. Nielsen. The CellML Model Repository. Bioinformatics, 24(18):2122–2123, September 2008.
[3]
Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
[4]
Benjamin J. Bornstein, Sarah M. Keating, Akiya Jouraku, and Michael Hucka. LibSBML: an API Library for SBML. Bioinformatics, 24(6):880–881, March 2008.
[5]
Andrew Miller, Justin Marsh, Adam Reeve, Alan Garny, Randall Britten, Matt Halstead, Jonathan Cooper, David Nickerson, and Poul Nielsen. An overview of the CellML API and its implementation. BMC Bioinformatics, 11(1):178+, April 2010.
[6]
Damian Smedley, Syed Haider, Benoit Ballester, Richard Holland, Darin London, Gudmundur Thorisson, and Arek Kasprzyk. BioMart–biological queries made easy. BMC genomics, 10(1):22+, 2009.
[7]
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
[8]
Lars J. Jensen, Michael Kuhn, Manuel Stark, Samuel Chaffron, Chris Creevey, Jean Muller, Tobias Doerks, Philippe Julien, Alexander Roth, Milan Simonovic, Peer Bork, and Christian von Mering. STRING 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research, 37(Database issue):D412–416, January 2009.
[9]
Nicolas Le Novère. Model storage, exchange and integration. BMC neuroscience, 7 Suppl 1(Suppl 1):S11+, 2006.
[10]
Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000.
[11]
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
[12]
Camille Laibe and Nicolas Le Novere. MIRIAM Resources: tools to generate and resolve robust cross-references in Systems Biology. BMC Systems Biology, 1(1):58+, 2007.
[13]
C. J. Proctor, D. A. Lydall, R. J. Boys, C. S. Gillespie, D. P. Shanley, D. J. Wilkinson, and T. B. L. Kirkwood. Modelling the checkpoint response to telomere uncapping in budding yeast. Journal of The Royal Society Interface, 4(12):73–90, February 2007.
[14]
Carole Proctor and Douglas Gray. Explaining oscillations and variability in the p53-Mdm2 system. BMC Systems Biology, 2(1):75+, 2008.
[15]
Bente Kofahl and Edda Klipp. Modelling the dynamics of the yeast pheromone pathway. Yeast, 21(10):831–850, July 2004.
[16]
Dagmar Waltemath, Neil Swainston, Allyson Lister, Frank Bergmann, Ron Henkel, Stefan Hoops, Michael Hucka, Nick Juty, Sarah Keating, Christian Knuepfer, Falko Krause, Camille Laibe, Wolfram Liebermeister, Catherine Lloyd, Goksel Misirli, Marvin Schulz, Morgan Taschuk, and Nicolas Le Novère. SBML Level 3 Package Proposal: Annotation. Nature Precedings, (713), January 2011.
[17]
Falko Krause, Jannis Uhlendorf, Timo Lubitz, Marvin Schulz, Edda Klipp, and Wolfram Liebermeister. Annotation and merging of SBML models with semanticSBML. Bioinformatics, 26(3):421–422, February 2010.
[18]
Neil Swainston and Pedro Mendes. libAnnotationSBML: a library for exploiting SBML annotations. Bioinformatics, 25(17):2292–2293, September 2009.