Data Integration

Background: Data integration methodologies for systems biology (Thesis 1.6)

[Previous: What are ontologies?]
[Next: Chapter 2: SyMBA]

Data integration methodologies for systems biology

The amount of data available to systems biologists is constantly increasing. However, its integration remains an ongoing challenge due to the heterogeneous nature of biological information, the multitude of data sources and the variability of data formats both in syntax and semantics [1]. Very broadly, integration methodologies resolve this heterogeneity by loading multiple data sources into an integration interface and then providing access to that interface. Integrated data sources, whether distributed or centralised, allow querying of multiple data sources in a single search.

This section provides a review of those data integration structures and methodologies relevant to systems biology and to the work described in this thesis. Section 1 describes the three main obstacles facing researchers attempting to integrate systems biology data. Section 2 describes the difference between structuring the integration methodology syntactically and semantically. The methods of integration described in this section can be classified along two axes: firstly, the mapping type (Section 3) determines how the integration interface is connected to the underlying data sources; and secondly, the integration interface (Section 4) determines how the data is retrieved and structured for the end application or user. The implications for the work presented in this thesis and the design decisions made with regard to integration structure and methodology are described in Chapter 5.

There are a number of existing reviews of data integration methodologies in the life sciences as a whole. Joyce and Palsson [2] review the algorithms for the integration and generation of networks from omics data, and the benefit of combining different pairs of such datasets. Stein [3] and Sujansky [4] present a more global view of data integration in biology and biomedicine, respectively, without touching on the specific difficulties for systems biology. Philippi and Koehler [5] discuss the practical, political and social hurdles to integrating high throughput and primary databases in systems biology, but do not evaluate specific tools. Harland and colleagues provide an up-to-date summary of the challenges facing both industry and academia in data integration, and how the use of controlled vocabularies can address these challenges [1].

1 Data integration challenges in the life sciences

Experimental data is noisy, incomplete, of different granularities, changeable, context dependent, does not always have a long lifetime and is not necessarily freely available. If data integration is to surmount these hurdles and be successful in the long term, three practical, social and political obstacles must be overcome.

  • The format challenge. Representational heterogeneity occurs when the underlying databases have completely different methods of representing data [4]. Resolving representational heterogeneity through a unified representation is difficult, with misunderstandings possible if matching terms are not semantically equal [5]. Even if the underlying databases contain semantically identical knowledge, there is no guarantee that they share a common representation of that knowledge [4]. Further, the lack of attribution of annotation must be resolved [5].
  • The availability challenge. The availability and lifetime of the data must be considered, and Web access should be provided for all data sources. Partial downloads of databases should be available to allow manageable offline use of data [5].
  • The social challenge. Social and political factors can often be central to the success of a data source. Integration can be hampered by a lack of communication and understanding among database providers and biologists, and by restrictive licensing methods [5]. Local autonomy of data resources may result in a lack of knowledge about any integration efforts as well as complete independence from those efforts [4].

Although a detailed discussion of the availability and social challenges are beyond the scope of this thesis, mapping types such as hybrid mediator mapping (see also Section ) enforcing the use of a common schema or ontology across multiple research groups minimise format and social challenges and allow researchers to concentrate on availability issues. In such cases, format-based obstacles are minimised as all data sources are part of a collaborative project. However, a common structure for data does not in any way ensure that the associated data repositories survive past the lifetime of their grant or project. In many cases, prospective users often find only broken links or an outdated database [6, 7, 8]. These problems are exemplified by the former Integrated Genome Database which, as Stein reported, collapsed due to database churn after about one year [3].

Most biological databases are neither based on identical schemas nor refer to a common ontology; it would be impractical for a nucleotide sequence database, for example, to have the same data model as a database that stores mass spectrometry results. Hybrid mediator mapping is therefore not always possible, especially when resolution of social challenges via a shared schema across research groups is simply not available. In such cases, format challenges must be resolved through the integration of heterogeneous databases. In systems biology, heterogeneity needs to be addressed not only for practical management purposes, but also to build a more systems-level view of the organisms under study [9]. The use of community-endorsed data standards for guiding the content, syntax and semantics of life-science data (see Section 1.4) provides partial solutions to format and social obstacles. However, there will always be new data formats and new experiment types, preventing format challenges from being fully resolvable through common data standards alone. Further, development of a common structure for related data at an early stage is often hampered by political and social factors.

When common data standards have been used as much as possible, and a hybrid mediator mapping approach is not feasible, other methods for integrating heterogeneous data sources must be chosen. The work described in this thesis directly addresses the format challenge in a number of ways. An archive for storing experimental metadata in a common format has been developed and a Web application integrating data from disparate sources for the purposes of systems biology model annotation has been created. Additionally, and of greatest relevance to this section, a semantic integration methodology has been developed to resolve representational heterogeneity. The end result is a unified ontology-based view of multiple data sources which can be used to add knowledge to systems biology models.

2 Structuring integrated data

Traditional methods of data integration map multiple schemas to an integration interface based on the common structures, such as table names or fields in database entries, within each format. Such methods tend to resolve syntactic heterogeneity but do not address semantic heterogeneity. The problems of, and historical approaches to, syntactic and semantic data integration have been well described [4, 10]. Whereas traditional methods of data integration rely on the mapping of component data models to a single data model, semantic data integration incorporates computationally-accessible meanings, or definitions of the data types to be integrated. In general, the integration structures used in syntactic integration are schemas, while those used in semantic integration are ontologies.

Syntactic heterogeneity is when data of interest is available in different formats such as in SBML or BioPAX, or XML as well as flat-file format. Data in multiple formats can be aligned to a global schema by linking structural units such as XSD components or table and row names. The mapping of syntaxes to a global schema in syntactic integration approaches tends to be hard-coded for the task at hand, and therefore not easily reusable in other projects. Further, concepts between the source and global schema are often linked based on syntactic similarity, which does not necessarily account for differences in the meanings of those concepts. For instance, a protein in BioPAX is strictly defined as having only one polypeptide chain, while a protein in UniProtKB can consist of multiple chains. Such semantic inequalities in syntactically identical terms (in this example, “protein”) can result in errors in data integration, creating possible inconsistencies in how concepts are defined among those data sources [5].

A high level of semantic heterogeneity makes mapping difficult; to be successful, extra information about the entities of interest might be required. Semantic data integration resolves the syntactic heterogeneity present in multiple data models as well as the semantic heterogeneity among similar concepts across those data models. Semantic integration can make use of a richer description of biology than is possible with syntactic methods, and makes that description accessible to machines and humans.

The example of differing protein definitions described earlier in this section for BioPAX and UniProtKB can be further extended to illustrate the practical differences between syntactic and semantic integration. In traditional data integration methods, two database schemas may contain a “Protein” table, but if the developers’ definitions of “Protein” are not identical, it is difficult to determine this difference programmatically. A syntactic integration project using these two schemas as data sources may erroneously mark them as equivalent tables. In semantic integration, if the two data sources model their Protein classes correctly, the differences in their meaning would be clear both programmatically and to a human. Identification of semantic differences is the first step in their resolution. For instance, one possible solution would be the creation of a Protein superclass that describes a protein in a high-level, generic way. The two source definitions could then be modelled as children of that Protein superclass.

Often, ontologies or other Semantic Web tools such as RDF [11] are used both to provide a richer model of biological data and to perform the integration. The Semantic Web is a network of data which acts as a framework for sharing, reuse and integration of that data. It fulfils its purpose via common formats for integration and via a common language for relating that data to real-life objects (Note: Ruttenberg and colleagues see the Semantic Web, of which both OWL and RDF are components, as having the potential to aid translational and systems biology research; any life science field where there are large amounts of data in distributed, disparate formats should benefit from Semantic Web technologies [12]. While RDF is a Semantic Web technology, making use of this format is not enough on its own to resolve semantic heterogeneity. Often OWL ontologies, either on their own or in conjunction with RDF, are used to address semantic differences in data.

Ontologies can be more generic, reusable and independent of the integrative applications they were created for when compared with traditional approaches [13]. Ontologies are enhanced logic-based formalisms where classes can be logically compared for equivalence and where classes and other entities can be integrated across domains [14]. Mappings between schemas in non-semantic approaches are specific to those schemas, and cannot be applied to other data sources; conversely, mappings between ontologies—and therefore data sources utilising those ontologies—can be used by any resource making use of those ontologies, and not just the original, intended, data sources [15]. Section 1.5 provides a detailed description of ontologies.

3 Mapping source data

There are two axes, the mapping type used and the integration interface, along which the four main integration categories can be classified (see Table 1). These four categories are data encapsulation, query encapsulation, data translation and query translation. In this section, the mapping types for linking native data to an integrated end result are described. Section 4 then relates these mapping types with the two most common integration interfaces for source data: warehouses and federation.

Table 1: Summary of integration methodologies and their main characteristics according to mapping type (rows) and integration interface (columns). Information linkage and direct mapping, the two mapping types which do not make use of an integration interface, are not included in this table as the lack of an integration interface makes them neither a warehouse nor a federated resource. (†) Query encapsulation is not a commonly used method, but is included for completeness. *As hybrid mediator mapping guarantees that the local data sources are aware of the global schema, it has no valid warehousing interface, as warehoused hybrid mapping would be equivalent to data translation.

The mapping methods described here were originally defined by Alonzo-Calvo and colleagues [10] for federated databases only. In the work described in this thesis, their definitions have been extended and corresponding changes to the names of the mapping types made. For example, Alonzo-Calvo and colleagues originally used a naming scheme containing the phrase “conceptual schema” (e.g. multiple conceptual schema). Here, “conceptual” has been removed, as it relates only to a federated method’s conceptual data model as opposed to a warehouse’s applied data model. Further, as “schema” is often shorthand for “database schema”, “mediator” is used instead to incorporate both syntactic and semantic integration methods. In syntactic integration, the data sources are generally described by database schemas or other schemas which define the format of the data. Conversely, semantic integration uses ontologies or other similar semantic structures to model the data.

Figures 12 describe simple mapping methods which do not require an explicit integration interface. As such, these simple types are fully explained in this section and are not included in Section 4. The “integration interface” named in Figures 35 refers to either a warehouse or a federated solution, as described in Section 4.

Information linkage

The simplest method of linking data sources to a user is information linkage [10], where the data is presented without an integration step. This method is also known as link integration [3, 16], navigational integration [17] and portal [18] integration. Shown in Figure 1, information linkage provides the data directly to the user and is therefore more a measure of interconnectedness than of integration. This solution is lightweight for the program developer, as it only involves linking to related information. However, it puts a greater burden on the user, who must visit all links to determine which resources are useful and then manually pull the relevant information from those resources. It also forces the user to click through into a site of unknown quality [3].

Figure 1: A visualisation of information linkage. Here, data is provided directly to the user, usually in the form of hyperlinks. There are no connections among data sources, only links from the end user’s interface (such as a Web page) to each data source. As such, information linkage is not strictly a mapping method, but rather a method of connecting the user to resources.

Links out

Links out are information linkage in its simplest form. This method connects disparate datasets and analytical tools via hyperlinks or other explicit links, allowing the user to easily navigate from one data entry to another in a different dataset. Most Web-based database viewers which provide cross references fall into this category.

Integrated analysis environments

Integrated analysis environments are specialised cases of information linkage. While having the same capabilities as links out, an integrated analysis environment also locally stores user data for analysis. The main goal of an integrated analysis environment is to provide an environment specifically for analysing user-provided raw data such as high throughput omics datasets. One example of this mapping type is Gaggle, which provides access to a variety of analysis tools and databases using external data sources as well as locally stored copies of user datasets [19]. Gaggle’s main purpose is as a single environment for accessing Web-based resources and programs. Its controller (the Gaggle “boss”) directly accesses remote or local sources, each via their own interface, or “goose” [19]. VANTED is similar to Gaggle in that it accepts various types of expression data in Microsoft Excel format then allows the mapping of this data onto either user-defined or publicly-accessible networks [20]. It also provides various analytical tools and network-based viewers. However, the environment is centralised into a single application, rather than the boss-geese application set of Gaggle.

Direct mapping

Figure 2: Direct mapping between two schemas or ontologies. Though not often used in syntactic integration, schema mapping would directly link specific columns in one database with appropriate columns in another. Within semantic integration, ontology alignment and ontology mapping allow the mapping of a source class from one ontology to a target class in another ontology.

To present a more integrated view of data to the user than is available with information linkage, some level of mapping between resources must occur. The simplest type of mapping is direct mapping. As shown in Figure 2, this method maps one data source directly onto a second data source. Though possible in syntactic integration, direct mapping is more commonly used for performing ontology mapping or alignment in semantic integration. Ontology mapping and alignment creates associations between classes in two or more ontologies [21]. While in the literature alignment and mapping are often used interchangeably, a clear separation between the terms is made in this thesis. Ontology mapping is the term used when the existing ontologies cannot or should not be changed and does not necessarily result in a true merging of the original ontologies, while ontology alignment is used when the integrated result must be a single ontology created from merging source ontologies together.

Direct mapping via ontology alignment was performed by Lomax and colleagues between GO and UMLS [22]. These two mature ontologies were linked such that data annotated with one ontology could also be associated with the second. Ontology mapping rather than alignment was performed when Entrez Gene/HomoloGene was integrated with pathway resources [23]. Here, the EKoM OWL ontology of gene resources was directly mapped to BioPAX [23]. To integrate the two ontologies, BioPAX was imported into EKoM and relationships were added to link the two ontologies together. These ontologies were then populated directly from the data sources, and SPARQL was used to query over the integrated knowledge base [23]. Direct mapping is not easily extensible, nor is it easy to update the integrated ontology when either source ontology changes.

Other mapping approaches make use of a bridging ontology between two resources. Often, the resources linked via a bridging ontology are not required to be ontologies themselves. One example is the alignment of the UMLS semantic network with the upper-level ontology BioTop by Schulz and colleagues [24]. Schulz and colleagues created a bridging ontology which imports both the UMLS semantic network and BioTop in order to improve the quality and richness of BioTop. While the UMLS semantic network is a hierarchical set of classes and relationships, it is not a formal ontology but rather an information legacy system [24]. SBPAX is another bridging ontology between the SBML format and the BioPAX ontology. SBPAX was created to integrate the information available in both formats to aid systems biology model integration [25]. In this approach, data can be either converted from one format to another, or merged together and then saved in a target format. Only concepts common to both formats are modelled within SBPAX [25]. While SBPAX provides a useful approach for conversion and merging of model and pathway data, it is not a generic solution. Further, later releases of SBPAX (discussed in Chapter 5) are moving away from integration and towards the creation of an extension to the BioPAX ontology which includes quantitative information.

Mapping types

The two axes of classification for integration methods defined earlier are the mapping type and the integration interface. In this section, the various mapping types are described. The single common feature of all of these mapping types is the presences of a mediator. Mediators can be created in different ways and using different technologies, and when combined with a data storage solution become the integration interface for a particular methodology. Integration interfaces are further described in the next section, 4.

Rather than trying to map three or more data sources directly to each other, mediator-based approaches individually map data sources to one or more mediator schemas or ontologies. The mediator-based mapping types are represented in the rows of Table 1. Each of the mappings illustrated in Figures 35 uses integration interfaces to mediate between the user and the source data. The first of these types, multiple mediator mapping, has an integration interface for each data source or group of similar data sources. This makes it robust with respect to changes in the data source, but produces a complex integration interface. Single mediator mapping and hybrid mediator mapping utilise a single integration interface, which creates a stable, unified global schema but which requires changes to that schema when underlying data sources are modified. These two single interface mapping types have three further subtypes. First developed for database schemas, global-as-view, local-as-view and hybrid strategies are also applicable for ontology-based integration techniques where more than two ontologies are required.

Global-as-view mapping defines a mediator ontology or schema as a function of the data sources rather than as a semantically rich description of the research domain in its own right, though the level of dependence of the mediator ontology can vary [26, 27, 28]. With local-as-view mapping the mediator is independent of the sources and the sources themselves are described as views of the mediator. Local-as-view mapping can be used to automatically generate queries over the data source ontologies [27].

In addition to pure global-as-view and local-as-view subtypes hybrid approaches, such as that used in this thesis [29, 30], are also available. Though these approaches generate mappings between sources and the mediator, unlike traditional approaches, the mediator is completely independent of any of the sources. Such approaches allow both the straightforward addition of new sources as well as the maintenance of the mediator as an independent entity. More information on ontology-based integration and these mapping subtypes is available in Chapter 5.

Multiple mediator mapping

With multiple mediator mapping, a schema or ontology for each data source or grouping of similar data sources is created, as shown in Figure 3. While any schema, controlled vocabulary or ontology may be used as the integration interface, there is no guarantee that the underlying data sources will have any knowledge of that interface [10]/. This type of mapping allows new data sources to be added without disruption to existing resources or existing code. However, the complexity of the integrative interface is much greater, as many schemas or ontologies need to be queried.

Figure 3: Multiple mediator mapping for data integration. Each data format is assigned its own mediator within the integration interface.

Single mediator mapping

The complexity problems of multiple mediators are solved in single mediator mapping by modelling all incoming data sources using a unified global schema, as shown in Figure 4. This is the second of the mediator-based approaches and may be realised via a number of routes, including schema matching and ontology mapping [10]. This method presents a unified view of the data to the user and does not force the underlying data sources to subscribe to that unified view.

Global-as-view and hybrid ontology mapping subtypes are possible with single mediator mapping as they do not require the data sources to have any knowledge of the mediator ontology. Global-as-view implementations must change their mediator ontology with any change to any underlying data source, which may affect the entire integration method. The local-as-view mapping subtype is not available for single mediator mapping, as data source schemas or ontologies are a view of the global ontology and therefore must have knowledge of the global mediator ontology.

Figure 4: Single schema mapping provides a single view over multiple data sources.

Hybrid mediator mapping

Hybrid mediator mapping utilises multiple schemas that either inherit from a global mediator or map to each other via a common mediator. Such a mapping must guarantee that all queries be resolved correctly for all appropriate data sources [10]. While data traditionally comes from multiple, autonomous database sources, the use of hybrid mediator mapping from the start of a multi-group project could simplify integration and ease the data storage burden on any one group. However, the difficulty of enforcing such a social and political restriction necessarily limits the environments in which this mapping method can be utilised. In suitable systems, related but non-intersecting data is stored in disparate locations using a common schema and viewed from a common interface [4]. Local-as-view mapping as well as the other mappings available to single mediator methods can be used with hybrid mediator systems, as the data sources are aware of the integration interface.

Figure 5: Hybrid mediator mapping provides high quality integration at the cost of independence of the data sources. This mapping type requires that all data sources know and implement the common schema or ontology used in the integration layer.

4 Integration interfaces

Irrespective of whether syntactic or semantic structures are used, integration interfaces for the mediator-based mapping types described in Section 3 can be divided into two broad categories: data warehousing using applied schemas or ontologies and federated data access using conceptual schemas or ontologies. If only two sources need to be reconciled, simpler methods without a mediator—such as direct mapping—are an effective strategy, as only two conceptualisations need to be compared. However, the complexity increases as more schemas or ontologies are added, rapidly limiting direct approaches. The use of a mediator allows an arbitrary number of resources without a large increase in complexity of the integration. Each integration type has its own benefits and constraints, and an integration project’s particular requirements will influence which method is used. Warehousing provides a stable resource under the control of the integration project which can be much quicker at retrieving data than federated methods. However, data federation ensures that the data being queried is up-to-date and the integration infrastructure lightweight.

Data warehousing

Data warehousing is when a single resource partially or wholly centralises multiple data sources into one integrated view [3, 18, 17, 16]. Data warehousing techniques can be useful, especially if there is relatively easy access to the data sources, but can be problematic due to high maintenance requirements. Whenever underlying databases change, the integrated database needs to be updated [10], and any format changes require the appropriate parsers to be modified as well. This “database churn” was identified by Stein to be a major limiting factor in establishing a successful data warehouse [3].

When applied to syntactic integration methodologies, the data warehouse is normally a relational database. When used for semantic integration, RDF-based triple stores are often used. Linked data is a way of publishing information which Tim Berners-Lee has defined as a collection of URIs structured using RDF for the purpose of making links that are accessible and searchable [31]. Linked open data is a W3C community project which publishes freely available linked data as RDF datasets on the Web and which links data among the different sources (Note: Some linked data systems are enhanced with OWL as well as RDF. Shared ontologies can be used to build connections between RDF data files, building upon existing connections among datasets.

Data encapsulation

Data encapsulation is the name chosen in this thesis for the subtype of data warehousing where multiple mediator mappings are stored in a single database. These data sources may be queried and presented as a completely integrated view, but the underlying data sources remain distinct. The querying side of data encapsulation is identical to query translation in that, for both methods, the API resolves differences in data structure as and when the data is requested. However, as a warehousing technique, data encapsulation stores information locally, and may pre-process incoming data before storage.

Though the SRS [32] is frequently described as an example of information linkage, it is more correctly a form of information linkage via data encapsulation. For instance, Stein [3] and Birkland [18] classify SRS as using information linkage, albeit one where fields can be structured, and two fields in different data sources can be explicitly connected. However, information linkage is commonly defined as a decentralised method of relating data items where hyperlinks are provided as cross-references between data entries in different databases [16]. Therefore, SRS is more precisely defined as information linkage through data encapsulation. SRS provides linking between data entries, but stores all databases locally in a flat-file format which is then indexed. While not a relational implementation, it is a data warehouse with powerful local indexing, searching and analysis tools. All data sources are centralised, and all analysis tools are run on the server rather than the client side.

The Atlas data warehouse is a data encapsulation scheme which stores multiple data sources accessible through SQL queries and a higher level API [33]. Each individual database is imported into its own section of the global Atlas schema, retaining its own identity within the larger warehouse [33]. Atlas relies on its API to resolve differences and present a unified view. While it makes limited use of ontologies, a semantically identical protein present in two different databases will not be given the same identifier.

The majority of the data warehousing projects utilising RDF use data encapsulation. RDF databases provide datasets in a syntactically similar way using this Semantic Web technology, but the use of RDF does not necessarily imply resolution of semantic heterogeneity. For instance, while the linked data resource Bio2RDF [34, 35] stores millions of RDF triples, queries must still trace a path against existing resources rather than have those resources linked via a shared ontology or ontologies [14]. Similarly with linked open data, warehousing is common as conversion to RDF must occur even when the triples remain organised according to the original data structure. However, it is more common for ontology-based semantic integration systems using mediators to be federated systems, as warehousing creates large ontologies and consequent slow reasoning times.

Neurocommons is a large RDF resource which can be accessed through standard RDF methods such as SPARQL. Neurocommons converts the structure of selected data sources to OWL and then passes data instances via these OWL representations into an RDF database [36]. Neurocommons stores one named graph per data source and provides a SPARQL endpoint for querying. Each one of these named graphs acts as a mediator in the data encapsulation process. The intention of the Neurocommons authors is to create a Semantic Web for science [36].

Other linked open data projects include BioLOD (Note: and OpenFlyData [37]. Some projects, such as Linked Life Data (Note:, provide both open and copyright-restricted data. Other RDF databases in the life sciences include BioGateway [38], YeastHub [39], LinkHub [40] and S3DB [41]. Many others are listed in Antezana and colleagues [42, Table 1].

Data translation

Data translation [4, 10] is the phrase commonly used to describe a data warehouse which transforms data from multiple native formats to a single unified schema. This method allows seamless integration in a controlled environment and greater control over the data than federated approaches (see Section ). However, it is unclear if what seems to be the same actually is semantically identical data. Additionally, frequent update cycles are required to maintain usability, and political obstacles such as licensing or access restrictions prevent redistribution of databases. Licensing restrictions force both data translation and data encapsulation implementations to limit availability to either a download of the software on its own, or to a limited implementation of a working installation containing only unrestricted resources [5].

The Genome-based Modelling System [43] is a database of genome-scale networks generated by linking a variety of databases via EC number. Enzyme information from UniProt/Swiss-Prot, KEGG [44], WIT [45] and BRENDA [46] are integrated into the GEM database together with all EC-associated genes from EMBL/GenBank/DDBJ [47] whole bacterial genomes. All genomes are available for viewing as qualitative models however very few quantitative SBML models are available. Further, while the Genome-based Modelling System is successful at presenting a genomic view of known pathways, it does not suggest any novel ones.

Another integration system using data translation is BioWarehouse [48], which uses a relational database for storage. Its public implementation, PublicHouse, contains only openly accessible databases for which republication is legally allowed. Biozon [18] integrates data from protein sequence, nucleotide sequence, pathway and interaction databases, as well as a variety of novel information derived from these databases. Unlike BioWarehouse, Biozon uses graphs as its internal data structure rather than a relational format [18]. Each native schema is translated into a graph and then overlaid on the Biozon graph schema. When the user interface reads from this schema the results view itself is not unified, but instead returns a disjointed result set where the query term is shown in each of the data sources separately.

PathSys [49] is another graph-based data translation application and was the original back end to the BiologicalNetworks [50] client application. BiologicalNetworks presents the user with a much more unified view than that provided by Biozon. For each data source in PathSys, two schemas are loaded prior to data upload: the native schema of the data source for initial reading of the data, and the mapping schema for conversion of the native data into the global PathSys schema. As the BiologicalNetworks resource has matured, it has replaced its original syntactic data translation system PathSys with a new back end called IntegromeDB [51]. This new system uses OWL and RDF for integrating and formatting the data. Like Biozon and BiologicalNetworks, ONDEX [52] makes use of a graph-based storage structure and stores its information with a similar level of expressivity to RDF. ONDEX uses a set of 15 ontology and pathway database parsers to create a graph-based data structure for generating, viewing and analysing networks which can be exported as RDF [52].

Data federation

Also known as view integration [3, 16] and mediator systems [18, 17], data federation translates a unified query into native queries for each underlying data source. The retrieved results are combined and presented to the user. Integration via data federation is easier to maintain than data translation as it does not rely on local, and possibly outdated, copies of data sources. Federation has a lower maintenance cost compared to data warehousing methods and provides the most up-to-date versions of the underlying data. Space requirements are also lower, as it does not require local installation of all source databases. However, the processing of a single query into the sub-queries to be sent to each data source is a complex process which can itself have a high overhead. Additionally, the integration method’s query engine is limited by the querying abilities of each native resource and of the quality of the internet connection to each resource. Finally, the querying of remote databases is often slower than equivalent warehousing options, and will not be any faster than the slowest data source [16].

While query encapsulation is possible in theory, in practice such methods are rarely used due to their impracticality. A hypothetical example of such a system would be similar to SRS or some other Web-based information retrieval service. Rather than presenting a unified query interface, this system would show the available data sources and provide query forms for each source. Such methods are impractical because a lot of work is done for very little gain over simply visiting the data sources directly. Therefore, in a very real sense, data federation is almost entirely composed of query translation methods.

Query translation [4, 10] is where a mediator schema or ontology is used to distribute a single query across all data sources known to the integration system [18]. Sujansky divides query translation into two broad categories: declarative mapping[4] and a hybrid of query and data translation called procedural mapping. Procedural mapping methods fetch data from underlying data sources and import the results into global objects, which can then be further manipulated before presentation. Querying and integration occur within the same procedure. Declarative mappings separate the global and native data models from any code that performs the queries. This separation results in a lower level of maintenance, as changes to a native data model only require updating of the global schema, and the query layer only has to work within a single global model.

ArrayXPath II [53] is a query translation system which accepts user input in the form of gene expression clusters, and visualises those clusters in the context of a variety of major pathway databases, providing a level of statistical significance for how well each expression cluster matches each node of the best-fit pathways. ArrayXPath II then layers the expression levels, if provided, over the pathway. Another example of query translation is the DiscoveryLink system by IBM [54]. DiscoveryLink presents the user with an SQL-based API which connects to a single mediator mapping scheme. The mediator is a middleware engine which breaks down the original query into sub-queries for the external data sources.

Query translation systems utilising ontologies include the TAMBIS ontology-based query system [55] and ComparaGRID (Note: TAMBIS allows the user to create a query according to its own global ontology. Each query is transformed into concrete searches specific to each data source. The results are then sent back to the user and displayed within the global ontology [55]. ComparaGRID uses local-as-view mapping to create application ontologies which are trimmed down versions of the semantically rich domain ontology describing the biology of interest. This method of creating on-the-fly application views allows ComparaGRID to reach a wide variety of users quickly and easily, but means it also suffers from a lack of efficiency [56].

OntoFusion uses the local-as-view method of query translation to build virtual local schemas—one per data format—that describe those formats using semantically identical concepts from the shared domain ontology [10]. While this allows for automated alignment of different data types because their virtual schemas share a common core ontology, the method relies on query translation rather than data warehousing, which can be costly.

5 Rule-based mediation

The purpose of the work described in this thesis is to annotate systems biology models using data from multiple data sources stored in multiple formats. Ultimately, while current methodologies have a variety of benefits as described in this section, a new semantic integration methodology was designed. Semantic approaches were chosen over syntactic integration methods to allow rich models of the biological domain to be created. Semantic structuring of the data is vital for making the knowledge accessible to humans and machines. Non-mediator mapping types were discarded due to a lack of flexibility. Single mediator mapping was chosen for its ability to present a unified view to both users and applications accessing the data. Further, the hybrid mapping subtype was chosen as it allows the global domain ontology to be completely independent from the source formats. This ensures that the biological domain ontology remains free from any reference to the structure of the underlying data sources. Finally, pure warehousing options were discarded as the large amount of data would be incompatible with reasoning and other semantic tasks. Therefore, the integration interface normally functions as a federated resource to minimise reasoning times, but can also store data and build queries over longer time scales.

As a result, rule-based mediation was developed using single mediator mapping with modified query translation. Ultimately, rule-based mediation builds tailored views of systems biology data accessible to computer programs and to users, allowing both automated and manual model annotation. Further details of rule-based mediation, including a comparison with the methodologies presented in this section, are available in Chapter 5.


Lee Harland, Christopher Larminie, Susanna-Assunta A. Sansone, Sorana Popa, M. Scott Marshall, Michael Braxenthaler, Michael Cantor, Wendy Filsell, Mark J. Forster, Enoch Huang, Andreas Matern, Mark Musen, Jasmin Saric, Ted Slater, Jabe Wilson, Nick Lynch, John Wise, and Ian Dix. Empowering industrial research with shared biomedical vocabularies. Drug discovery today, 16(21-22):940–947, September 2011.
Andrew R. Joyce and Bernhard O. Palsson. The model organism as a system: integrating ’omics’ data sets. Nature Reviews Molecular Cell Biology, 7(3):198–210, March 2006.
Lincoln D. Stein. Integrating biological databases. Nature Reviews Genetics, 4(5):337–345, May 2003.
W. Sujansky. Heterogeneous database integration in biomedicine. J Biomed Inform, 34(4):285–298, August 2001.
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
Jonathan D. Wren. URL decay in MEDLINE—a 4-year follow-up study. Bioinformatics, 24(11):1381–1385, June 2008.
Jonathan D. Wren and Alex Bateman. Databases, data tombs and dust in the wind. Bioinformatics, 24(19):2127–2128, October 2008.
Stella Veretnik, J. Lynn Fink, and Philip E. Bourne. Computational Biology Resources Lack Persistence and Usability. PLoS Comput Biol, 4(7):e1000136+, July 2008.
Alvis Brazma, Maria Krestyaninova, and Ugis Sarkans. Standards for systems biology. Nature Reviews Genetics, 7(8):593–605, August 2006.
R. Alonso-Calvo, V. Maojo, H. Billhardt, F. Martin-Sanchez, M. Garc\'{\i }a-Remesal, and D. Pérez-Rey. An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform, 40(1):17–29, February 2007.
Dave Beckett. RDF/XML Syntax Specification (Revised)., February 2004.
Alan Ruttenberg, Tim Clark, William Bug, Matthias Samwald, Olivier Bodenreider, Helen Chen, Donald Doherty, Kerstin Forsberg, Yong Gao, Vipul Kashyap, June Kinoshita, Joanne Luciano, M. Scott Marshall, Chimezie Ogbuji, Jonathan Rees, Susie Stephens, Gwendolyn Wong, Elizabeth Wu, Davide Zaccagnini, Tonya Hongsermeier, Eric Neumann, Ivan Herman, and Kei H. Cheung. Advancing translational research with the Semantic Web. BMC bioinformatics, 8 Suppl 3(Suppl 3):S2+, 2007.
Kei-Hoi Cheung, Andrew Smith, Kevin Yip, Christopher Baker, and Mark Gerstein. Semantic Web Approach to Database Integration in the Life Sciences. pages 11–30. 2007.
Michel Dumontier. Review of Semantic Integration in the Life Sciences. Ontogenesis, January 2010.
Allyson L. Lister. Semantic Integration in the Life Sciences. Ontogenesis, January 2010.
C. Goble and R. Stevens. State of the nation in data integration for bioinformatics. Journal of Biomedical Informatics, 41(5):687–693, October 2008.
Thomas Hernandez and Subbarao Kambhampati. Integration of biological sources: current systems and challenges ahead. SIGMOD Rec., 33(3):51–60, September 2004.
Aaron Birkland and Golan Yona. BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC bioinformatics, 7(1), February 2006.
Paul Shannon, David Reiss, Richard Bonneau, and Nitin Baliga. The Gaggle: An open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics, 7(1):176+, 2006.
Bjorn Junker, Christian Klukas, and Falk Schreiber. VANTED: A system for advanced data analysis and visualization in the context of biological networks. BMC Bioinformatics, 7(1):109+, 2006.
Yannis Kalfoglou and Marco Schorlemmer. Ontology mapping: the state of the art. The Knowledge Engineering Review, 18(01):1–31, January 2003.
J. Lomax and A. T. McCray. Mapping the gene ontology into the unified medical language system. Comparative and functional genomics, 5(4):354–361, 2004.
Satya S. Sahoo, Olivier Bodenreider, Joni L. Rutter, Karen J. Skinner, and Amit P. Sheth. An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence. Journal of biomedical informatics, 41(5):752–765, October 2008.
Stefan Schulz, Elena Beisswanger, Laszlo van den Hoek, Olivier Bodenreider, and Erik M. van Mulligen. Alignment of the UMLS semantic network with BioTop: methodology and assessment. Bioinformatics (Oxford, England), 25(12):i69–76, June 2009.
O. Ruebenacker, I. I. Moraru, J. C. Schaff, and M. L. Blinov. Integrating BioPAX pathway knowledge with SBML models. IET Systems Biology, 3(5):317–328, 2009.
Holger Wache, T. Vögele, Ubbo Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information — a survey of existing approaches. In H. Stuckenschmidt, editor, Proceedings of the IJCAI’01 Workshop on Ontologies and Information Sharing, Seattle, Washington, USA, Aug 4-5, pages 108–117, 2001.
Marie C. Rousset and Chantal Reynaud. Knowledge representation for information integration. Inf. Syst., 29(1):3–22, 2004.
Jinguang Gu, Baowen Xu, and Xinmeng Chen. An XML query rewriting mechanism with multiple ontologies integration based on complex semantic mapping. Information Fusion, 9(4):512–522, October 2008.
Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through rule-based semantic integration. Journal of biomedical semantics, 1 Suppl 1(Suppl 1):S3+, 2010.
Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML Models Through Rule-Based Semantic Integration. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 49+, June 2009.
Tim Berners-Lee. Linked Data – Design Issues., July 2006.
Evgeni M. Zdobnov, Rodrigo Lopez, Rolf Apweiler, and Thure Etzold. The EBI SRS server—recent developments. Bioinformatics, 18(2):368–373, February 2002.
Sohrab P. Shah, Yong Huang, Tao Xu, Macaire M. Yuen, John Ling, and B. Francis Ouellette. Atlas – a data warehouse for integrative bioinformatics. BMC bioinformatics, 6(1):34+, 2005.
Peter Ansell. Model and prototype for querying multiple linked scientific datasets. Future Generation Computer Systems, 27(3):329–333, March 2011.
F. Belleau, M. Nolin, N. Tourigny, P. Rigault, and J. Morissette. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):706–716, October 2008.
Alan Ruttenberg, Jonathan A. Rees, Matthias Samwald, and M. Scott Marshall. Life sciences on the Semantic Web: the Neurocommons and beyond. Briefings in bioinformatics, 10(2):193–204, March 2009.
Alistair Miles, Jun Zhao, Graham Klyne, Helen White-Cooper, and David Shotton. OpenFlyData: An exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster. Journal of Biomedical Informatics, 43(5):752–761, October 2010.
Erick Antezana, Ward Blondé, Mikel Egaña, Alistair Rutherford, Robert Stevens, Bernard De Baets, Vladimir Mironov, and Martin Kuiper. BioGateway: a semantic systems biology tool for the life sciences. BMC bioinformatics, 10 Suppl 10(Suppl 10):S11+, 2009.
Kei-Hoi H. Cheung, Kevin Y. Yip, Andrew Smith, Remko Deknikker, Andy Masiar, and Mark Gerstein. YeastHub: a semantic web use case for integrating data in the life sciences domain. Bioinformatics (Oxford, England), 21 Suppl 1, June 2005.
Andrew Smith, Kei H. Cheung, Kevin Yip, Martin Schultz, and Mark Gerstein. LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics. BMC Bioinformatics, 8(Suppl 3):S5+, 2007.
Helena F. Deus, Romesh Stanislaus, Diogo F. Veiga, Carmen Behrens, Ignacio I. Wistuba, John D. Minna, Harold R. Garner, Stephen G. Swisher, Jack A. Roth, Arlene M. Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H. Vogel, and Jonas S. Almeida. A Semantic Web Management Model for Integrative Biomedical Informatics. PLoS ONE, 3(8):e2946+, August 2008.
Erick Antezana, Martin Kuiper, and Vladimir Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4):392–407, July 2009.
Kazuharu Arakawa, Yohei Yamada, Kosaku Shinoda, Yoichi Nakayama, and Masaru Tomita. GEM System: automatic prototyping of cell-wide metabolic pathway models from genomes. BMC Bioinformatics, 7(1):168+, 2006.
Minoru Kanehisa, Michihiro Araki, Susumu Goto, Masahiro Hattori, Mika Hirakawa, Masumi Itoh, Toshiaki Katayama, Shuichi Kawashima, Shujiro Okuda, Toshiaki Tokimatsu, and Yoshihiro Yamanishi. KEGG for linking genomes to life and the environment. Nucleic acids research, 36(Database issue):D480–484, January 2008.
R. Overbeek, N. Larsen, G. D. Pusch, M. D’Souza, E. Selkov, N. Kyrpides, M. Fonstein, N. Maltsev, and E. Selkov. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic acids research, 28(1):123–125, January 2000.
Maurice Scheer, Andreas Grote, Antje Chang, Ida Schomburg, Cornelia Munaretto, Michael Rother, Carola Söhngen, Michael Stelzer, Juliane Thiele, and Dietmar Schomburg. BRENDA, the enzyme information system in 2011. Nucleic acids research, 39(Database issue), January 2011.
Tamara Kulikova, Ruth Akhtar, Philippe Aldebert, Nicola Althorpe, Mikael Andersson, Alastair Baldwin, Kirsty Bates, Sumit Bhattacharyya, Lawrence Bower, Paul Browne, Matias Castro, Guy Cochrane, Karyn Duggan, Ruth Eberhardt, Nadeem Faruque, Gemma Hoad, Carola Kanz, Charles Lee, Rasko Leinonen, Quan Lin, Vincent Lombard, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Francesco Nardone, Maria P. Pastor, Sheila Plaister, Siamak Sobhany, Peter Stoehr, Robert Vaughan, Dan Wu, Weimin Zhu, and Rolf Apweiler. EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Research, 35(suppl 1):D16–D20, January 2007.
Thomas J. Lee, Yannick Pouliot, Valerie Wagner, Priyanka Gupta, David W. Stringer-Calvert, Jessica D. Tenenbaum, and Peter D. Karp. BioWarehouse: a bioinformatics database warehouse toolkit. BMC bioinformatics, 7(1):170+, 2006.
Michael Baitaluk, Xufei Qian, Shubhada Godbole, Alpan Raval, Animesh Ray, and Amarnath Gupta. PathSys: integrating molecular interaction graphs for systems biology. BMC bioinformatics, 7(1), February 2006.
Sergey Kozhenkov, Yulia Dubinina, Mayya Sedova, Amarnath Gupta, Julia Ponomarenko, and Michael Baitaluk. BiologicalNetworks 2.0 – an integrative view of genome biology data. BMC Bioinformatics, 11(1):610+, 2010.
Michael Baitaluk and Julia Ponomarenko. Semantic integration of data on transcriptional regulation. Bioinformatics (Oxford, England), 26(13):1651–1661, July 2010.
Jacob Köhler, Jan Baumbach, Jan Taubert, Michael Specht, Andre Skusa, Alexander Rüegg, Chris Rawlings, Paul Verrier, and Stephan Philippi. Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics, 22(11):1383–1390, June 2006.
Hee-Joon Chung, Chan H. Park, Mi R. Han, Seokho Lee, Jung H. Ohn, Jihoon Kim, Jihun Kim, and Ju H. Kim. ArrayXPath II: mapping and visualizing microarray gene-expression data with biomedical ontologies and integrated biological pathway resources using Scalable Vector Graphics. Nucleic Acids Research, 33(suppl 2):W621–W626, July 2005.
L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope. DiscoveryLink: A system for integrated access to life sciences data sources. IBM Systems Journal, 40(2):489–511, 2001.
R. Stevens, P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. W. Paton, C. A. Goble, and A. Brass. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics (Oxford, England), 16(2):184–185, February 2000.
Robert Stevens, Phillip Lord, and Andrew Gibson. Something Nasty in the Woodshed: The Public Knowledge Model. In OWLED Workshop on OWL: Experiences and Directions, November 2006.

By Allyson Lister

Find me at and

11 replies on “Background: Data integration methodologies for systems biology (Thesis 1.6)”

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s