Background: Overview (Thesis 1.1)
Studying biological systems requires a large amount of data of different experimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. The use of community data standards can reduce the need for specialised, independent formats by providing a common syntax to make data retrieval and manipulation easier. However, standards uptake is not universal and the disparate data types required by systems biologists creates data that is not, or cannot, be completely described by a single standard. If existing data does not share a standard structure, theoretically any heterogeneous data could be reproduced in a single format by rerunning experiments. Because in practice such a method would be expensive and time consuming, integrative methods which reuse existing data should be explored.
Though it is possible for the biology represented by any given data format to be completely orthogonal with other experimental types, more commonly, portions of the biology—but not necessarily the formats describing them—overlap. These common aspects of biological data representations create theoretical integration points among the representations, allowing information reuse and re-purposing. However, shared biological concepts do not necessarily result in shared definitions of the biology. Whereas differences in format result in syntactic heterogeneity, the differences in meaning of seemingly identical biological concepts across different formats results in semantic heterogeneity. A portion of the work presented in this thesis addresses syntactic heterogeneity both through the use of a common experimental metadata standard and by systematically integrating data for the purposes of systems biology model annotation. While syntactic heterogeneity can be resolved through the alignment of common structures, semantic heterogeneity is a more complex challenge. Once the meaning of the underlying biological concepts of interest for all data sources has been made explicit, the semantic heterogeneity can be identified. Further, if the semantics of a data format are accessible to machines, computational reasoning can be applied to find inferences and logical inconsistencies. The work described in this thesis includes the conversion of a systems biology standard specification—in multiple documents—into a single semantically aware model of that specification.
A wide variety of integration methodologies addressing various aspects of syntactic and semantic heterogeneity are available, often optimised for different situations. Many of these methods do not address semantic heterogeneity in systems biology data and of those that do, very few use existing technologies, address syntactic and semantic heterogeneity and make use both of simple syntactic conversions of non-semantic formats and semantically-meaningful models of the biological domain of interest. The work described in this thesis includes a method of semantic data integration called rule-based mediation which provides these features and which was developed as an aid to systems biology model annotation. Integrating resources with rule-based mediation accommodates multiple formats with different semantics and provides richly-modelled biological knowledge suitable for annotation of systems biology models.
Within this introductory chapter, Section 1.2 provides an overview of systems biology and the challenge presented when multiple formats are used. Section 1.3 describes how differences in format are not simply the result of different types of experiments, but are also due to the variety of ways systems are modelled. Section 1.4 describes the content, syntax and semantics standards relevant for systems biology. Data heterogeneity is an issue not limited to systems biology, and as such there is a large amount of previous work on data integration (see Section 1.6). As described in Section 1.5, existing technologies such as ontologies, rules and reasoning can bring together heterogeneous data in a homogeneous, meaning-rich, computationally-amenable manner.
Figure 1 provides a summary of the work described in the chapters of thesis in the context of the semantic systems biology life cycle originally described by Antezana and colleagues [1, Fig.2]. Retrieval and storage of systems biology experimental metadata using a common syntax becomes easier with applications like the collaboratively-developed SyMBA (Chapter 2). The Saint Web application (Chapter 3) provides syntactic integration of systems biology data as well as a simple interface for viewing, manipulating and exporting new biological annotation to existing systems biology models. Saint is useful both in its own right and as a test case for determining the data sources, capabilities and requirements for the implementation of rule-based mediation. New data sources can easily be added, and therefore Saint has the capacity to provide access to data integrated via rule-based mediation.
In some cases, community standards and syntactic integration are not enough. Rules and restrictions on the use of a standard syntax are not always confined to the syntax itself; extra information can be present in human-readable documentation such as Word or PDF documents, but not directly accessible to computers. Therefore, even if the data is in a common syntax, there are limits on its computational accessibility. This problem is resolved for SBML models through the use of MFO (Chapter 4), an ontology which holds SBML data as well as rules and restrictions on the SBML structure. MFO provides a format through which reasoning can be applied to SBML models, and stands both on its own and as part of the semantic data integration methodology described in Chapter 5.
Semantic data integration via rule-based mediation, described in Chapter 5, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations. Syntactic heterogeneity is resolved through the conversion to a computationally-accessible syntactic ontology, and semantic heterogeneity is resolved by mapping the syntactic ontology to a biological domain of interest which is itself modelled using an ontology.
Chapter 6 discusses future possibilities for data use and reuse. Will data integration become a thing of the past? An increase in the uptake of standards will likely occur as communities mature. Experiments and their outputs could become better organised, cheaper and more open. Data usage would be much easier if integration were not required at all. Ultimately, however, science moves faster than standards and there are always new questions and new experiments. Hopefully, integration will become seamless and transparent to the user as semantic methods, combined with use of large, open resources on the Semantic Web, serve heterogeneous data via a homogeneous view.