Data Integration

Background: Overview (Thesis 1.1)

[Previous: Additional Front Material]
[Next: What does systems biology data look like?]


Studying biological systems requires a large amount of data of different experimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. The use of community data standards can reduce the need for specialised, independent formats by providing a common syntax to make data retrieval and manipulation easier. However, standards uptake is not universal and the disparate data types required by systems biologists creates data that is not, or cannot, be completely described by a single standard. If existing data does not share a standard structure, theoretically any heterogeneous data could be reproduced in a single format by rerunning experiments. Because in practice such a method would be expensive and time consuming, integrative methods which reuse existing data should be explored.

Though it is possible for the biology represented by any given data format to be completely orthogonal with other experimental types, more commonly, portions of the biology—but not necessarily the formats describing them—overlap. These common aspects of biological data representations create theoretical integration points among the representations, allowing information reuse and re-purposing. However, shared biological concepts do not necessarily result in shared definitions of the biology. Whereas differences in format result in syntactic heterogeneity, the differences in meaning of seemingly identical biological concepts across different formats results in semantic heterogeneity. A portion of the work presented in this thesis addresses syntactic heterogeneity both through the use of a common experimental metadata standard and by systematically integrating data for the purposes of systems biology model annotation. While syntactic heterogeneity can be resolved through the alignment of common structures, semantic heterogeneity is a more complex challenge. Once the meaning of the underlying biological concepts of interest for all data sources has been made explicit, the semantic heterogeneity can be identified. Further, if the semantics of a data format are accessible to machines, computational reasoning can be applied to find inferences and logical inconsistencies. The work described in this thesis includes the conversion of a systems biology standard specification—in multiple documents—into a single semantically aware model of that specification.

A wide variety of integration methodologies addressing various aspects of syntactic and semantic heterogeneity are available, often optimised for different situations. Many of these methods do not address semantic heterogeneity in systems biology data and of those that do, very few use existing technologies, address syntactic and semantic heterogeneity and make use both of simple syntactic conversions of non-semantic formats and semantically-meaningful models of the biological domain of interest. The work described in this thesis includes a method of semantic data integration called rule-based mediation which provides these features and which was developed as an aid to systems biology model annotation. Integrating resources with rule-based mediation accommodates multiple formats with different semantics and provides richly-modelled biological knowledge suitable for annotation of systems biology models.

Within this introductory chapter, Section 1.2 provides an overview of systems biology and the challenge presented when multiple formats are used. Section 1.3 describes how differences in format are not simply the result of different types of experiments, but are also due to the variety of ways systems are modelled. Section 1.4 describes the content, syntax and semantics standards relevant for systems biology. Data heterogeneity is an issue not limited to systems biology, and as such there is a large amount of previous work on data integration (see Section 1.6). As described in Section 1.5, existing technologies such as ontologies, rules and reasoning can bring together heterogeneous data in a homogeneous, meaning-rich, computationally-amenable manner.

Figure 1: How the work described in this thesis relates to the semantic systems biology life cycle described further in Section 1.2, Figure 3. SyMBA provides a common structure for experimental metadata and a archive for experimental data of any type, thus aiding data storage and analysis. Saint helps systems biology modellers annotate models with biological information in a standard way, thus enhancing the quality of models used in in silico experiments. MFO and rule-based mediation use Semantic Web technologies to formalise systems biology data and perform automated reasoning over that data. Rule-based mediation also semantically integrates information relevant to systems biology.

Figure 1 provides a summary of the work described in the chapters of thesis in the context of the semantic systems biology life cycle originally described by Antezana and colleagues [1, Fig.2]. Retrieval and storage of systems biology experimental metadata using a common syntax becomes easier with applications like the collaboratively-developed SyMBA (Chapter 2). The Saint Web application (Chapter 3) provides syntactic integration of systems biology data as well as a simple interface for viewing, manipulating and exporting new biological annotation to existing systems biology models. Saint is useful both in its own right and as a test case for determining the data sources, capabilities and requirements for the implementation of rule-based mediation. New data sources can easily be added, and therefore Saint has the capacity to provide access to data integrated via rule-based mediation.

In some cases, community standards and syntactic integration are not enough. Rules and restrictions on the use of a standard syntax are not always confined to the syntax itself; extra information can be present in human-readable documentation such as Word or PDF documents, but not directly accessible to computers. Therefore, even if the data is in a common syntax, there are limits on its computational accessibility. This problem is resolved for SBML models through the use of MFO (Chapter 4), an ontology which holds SBML data as well as rules and restrictions on the SBML structure. MFO provides a format through which reasoning can be applied to SBML models, and stands both on its own and as part of the semantic data integration methodology described in Chapter 5.

Semantic data integration via rule-based mediation, described in Chapter 5, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations. Syntactic heterogeneity is resolved through the conversion to a computationally-accessible syntactic ontology, and semantic heterogeneity is resolved by mapping the syntactic ontology to a biological domain of interest which is itself modelled using an ontology.

Chapter 6 discusses future possibilities for data use and reuse. Will data integration become a thing of the past? An increase in the uptake of standards will likely occur as communities mature. Experiments and their outputs could become better organised, cheaper and more open. Data usage would be much easier if integration were not required at all. Ultimately, however, science moves faster than standards and there are always new questions and new experiments. Hopefully, integration will become seamless and transparent to the user as semantic methods, combined with use of large, open resources on the Semantic Web, serve heterogeneous data via a homogeneous view.


Erick Antezana, Martin Kuiper, and Vladimir Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4):392–407, July 2009.

By Allyson Lister

Find me at and

6 replies on “Background: Overview (Thesis 1.1)”

[…] We had a very interesting meeting with Joann Roskoski, Deputy Assistant Director of Biological Sciences at the US National Science Foundation. We already have a number of joint programmes with the NSF, and it was most interesting to see the consonance of our thinking in areas such as sustainability, ‘big data’, open access and the like. In this regard, I read an interesting blog post on the extent to which (non-open-access) scientific journals help or hinder scientific progress, and another that involved blogging a thesis. […]

I read an interesting blog post on the extent to which (non-open-access) scientific journals help or hinder research and BBSRC are referencing your blog post !

I am really impressed they way you articulate about data intergration, my opinion is that is here to stay and we will be able to reuse this data

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s