By the end of last year, I had finished my work with both Manchester and Newcastle. Happily, I’ve found a new (working) home with Susanna, Philippe and the gang at the OERC in Oxford. I’ll be working part time on the BioSharing.org project, and will be doing all sorts of things related to biological data standards, policies, and databases.
I have a history in biological data standards and in developing community-driven standard formats, checklists and ontologies. I look forward to devoting some time to this collaborative, integrative project, and to helping people structure and manage their data well. To finish, here is a little bit about BioSharing, taken from the website itself:
“BioSharing works to map the landscape of community developed standards in the life sciences, broadly covering biological, natural and biomedical sciences. […] As part of the growing movement for reproducible research, a growing number of community-driven standardization efforts are working to make data along with the experimental details available in a standardized manner. BioSharing works to serve those seeking information on the existing standards, identify areas where duplications or gaps in coverage exist and promote harmonization to stop wasteful reinvention, and developing criteria to be used in evaluating standards for adoption.”
You may have noticed a pause in my posts (here, and on Twitter, and on G+ etc. etc.) – this is due to an 8 lb 15 1/2 ounce (4.07 kg) bouncing baby boy coming into our lives this past July :) So my apologies, I do plan to post more about bioinformatics and ontologies in the near future, but as this is a work blog (and not a baby blog!) there will be a little break now. Normal service will resume, eventually!
In a previous post, I talked about what kind of visualizations would make sense for large-scale epigenomic and related data. Now, I’d like to introduce the kind of data structures we’re building in the Newcastle ARIES project to support the creation of such visualizations.
Entanglement is a software platform that provides a generic, scalable, graph framework suitable for data integration applications that require embarrassing scalability. With the data sets we are currently importing, write times for Entanglement scale linearly over millions of database entities. A poster (Entanglement: Embarrassingly Scalable Graphs) and abstract for Entanglement were presented last month at the Integrative Bioinformatics 2013 symposium in IPK-Gatersleben, Germany.
Included below is the text of the abstract, together with a summary of the poster’s contents.
Epigenetics is becoming a major focus of research in the area of human genetics. In addition to contributing to our knowledge of inheritance, epigenetic profiles can be used as prognostic or predictive biomarkers. Methylation of DNA in leukocytes is one of the most commonly measured forms of epigenetic modification. The ARIES project generates epigenomic information for a range of human tissues at multiple points in time. ARIES uses both Illumina Infinium 450K methylation arrays and BS-seq approaches to generate epigenetic data on a number of samples from the Avon Longitudinal Study of Parents and their Children (ALSPAC) cohort. ALSPAC is a unique resource for studying how methylation patterns change over time. The ARIES project also provides tools and resources to the community for the interpretation of epigenomic data in the form of an integrated dataset and associated Web portal for browsing and integrative analysis of experimental methylation data. The integrated dataset includes a range of data types such as phenotypic, transcriptomic and methylation data from rodents, together with data generated by studies such as the ENCODE project.
The integration of these data is a considerable bioinformatics challenge. To meet this challenge we are developing a graph-based data integration platform, extending our previous work with the ONDEX system. We have developed a scalable, parallel graph storage system that exploits Cloud computing infrastructures for integrating data to produce graphs of entities and the relationships between them. This system, called Entanglement, has been designed to tackle the problem of scalability that is inherent in most graph-based approaches to bioinformatics data integration.
Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time.
Data is distributed across a MongoDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including rapidly cloning or merging existing graphs to form new graphs. Although the ultimate aim is a fully integrated dataset, by intentionally storing different data sources in different graphs a large amount of flexibility can be obtained.
Multiple ad-hoc integrated views can be composed by importing references to the nodes and edges in various individual dataset graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi.
Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.
Data integration for ARIES will ultimately require graphs containing hundreds of millions, if not billions, of graph entities. Entanglement has been shown to scale linearly with our initial ARIES datasets involving graphs with up to 50 million nodes and edges. Our results suggest that the system will scale to much larger graph sizes. Data storage capacity can be expanded by adding MongoDB servers to an existing cluster. Indexes required for efficient querying are designed to fit in memory, as long as enough machines are available to the cluster.
Entanglement is available under the Apache license at http://www.entanglementgraph.com.
I am currently working in Prof. Neil Wipat’s group at Newcastle University on the ARIES project. This involves working with large amounts of epigenomics data from the ARIES project itself, as well as with all sorts of related information from external data sources.
As well as the production of an integrated data set for ARIES’ nascent genome track browser, an indispensable tool for this type of data, we’re working on something else very exciting: graph data. Specifically, we’re trying out all sorts of visualizations for relevant data sets (including the ARIES data). Here’s one I’ve been playing with recently.
The pink nodes represent the median beta values for methylation sites along the human genome. The lighter the pink, the less likely this particular point on the chromosome is methylated, and vice versa. At a glance, the user can see all integrated beta values (and therefore all experiments containing methylation information) for a particular chromosome location. This is a small, gene-centric graph (the gene is in green) intended for people who would like to see an overview of known experimental results for a particular gene of interest.
This is just the start; we have lots of other visualization ideas, as well as lots of ideas for the creation of novel –and interesting– types of subgraphs. Our database is huge, and the hairball of the entire thing (or just of one chromosome) is likely to not be as informative as subgraphs like this created around a particular area of interest.
But we’re not just working on the export of interesting subgraphs from our graph database: my colleague Keith Flanagan has been developing a highly scalable and incredibly neat graph database built on MongoDB.
You’ll probably see a lot of pictures like the one above in the coming weeks on this blog, as we experiment with visualizations and views of our data. If you’re into epigenomics, and have always wanted to view your data in a particular way, please leave a comment below. I’d also love to hear your ideas about this particular visualization type. Your input would be most welcome.
Below you can find a complete table of contents for all thesis-related posts (you can also get to the posts via the “thesis” tag I have used for each). Enjoy!
- Additional Front Material
- Chapter 1: Background
- Chapter 2: SyMBA: a Web-based metadata and data archive for biological experiments
- Chapter 3: Saint: a Web application for model annotation utilising syntactic integration
- Chapter 4: Model Format OWL integrates multiple SBML specification documents
- Chapter 5: Rule-based mediation for the semantic integration of systems biology data
- Chapter 6: General discussion
And, if you’re interested in how I performed the conversion, I’ve written about that too.
As the structure, description and interconnectedness of life science data has become more sophisticated, a shared formalisation of all data for systems biology has become easier to imagine. Tools are available, such as those used by the Semantic Web community, to capture any area of biology in the form of a semantically defined model. While challenges such as achieving tractable reasoning over highly expressive ontologies still remain to be addressed, a future where vast amounts of data and metadata are modelled according to a shared formalisation is not just fantasy as shown by the work described in this thesis. The research presented here demonstrates the transformation of existing systems biology data models into more computationally amenable and more semantically aware structures.
The work presented also enables a more computationally amenable approach to systems biology, ultimately aiding the progression of systems biology metadata into a more formal structure. There is a general trend towards “big data”, both with in-house private repositories and open data across all of the sciences, making maintenance and structuring of that data of primary importance [1, 2]. SyMBA (Chapter 2) aids the formalisation of systems biology knowledge by providing a common structure for experimental metadata and limiting the complexity of metadata input at the user interface. Traditionally, metadata input procedures are time consuming and often complex for large volumes of data. SyMBA was created to make it easy for researchers to add metadata even when large amounts of data are involved. In general the easier metadata entry becomes, the more metadata will be stored by users. By structuring systems biology metadata in a common format, it is made much more accessible to computational techniques and to the systems biology community as a whole.
Saint (Chapter 3) helps researchers find appropriate biological data for systems biology models by reusing existing data from disparate locations. Saint provides a method of integrating multiple Web-accessible resources using syntactic clues in the query model, providing prospective annotation to the modeller. This type of syntactic data integration is fast, providing a large amount of integrated data for relatively low cost and high scalability. However, syntactic integration does not resolve semantic differences in the data, nor does it allow a high level of expressivity between data sources or in the description of those data sources. The schema reconciliation in non-semantic approaches is generally hard-coded for the immediate integrative task, and not easily reused. Often, data is aligned by linking structural units such as XSD components or table and row names rather than the underlying biological components. Further, concepts between the source and target schema are often linked based on syntactic similarity, which does not necessarily account for possible differences in the meanings of those concepts.
Semantic data integration allows greater expressivity at the expense of scalability. Controlled vocabularies and ontologies are of primary importance in the resolution of heterogeneity via semantic means . By using ontologies, entities can be integrated across domains according to their meaning. Ruttenberg and colleagues view the Semantic Web, of which both OWL and RDF are components, as having the potential to aid translational and systems biology research; indeed, any life science field where there are large amounts of data in distributed, disparate formats should benefit from Semantic Web technologies . However, application of such techniques in bioinformatics is difficult, partly due to the bespoke nature of the majority of available tools . MFO (Chapter 4) transforms existing SBML specification documents into a single model in OWL, allowing the storage both of SBML models and the restrictions on those models. While some of the SBML specification documents are already computationally accessible, one document is readable only by humans and each of the others are in different formats. MFO brings concepts from all of the SBML documents into a single computationally accessible format which can be used with a variety of Semantic Web technologies.
Rule-based mediation (Chapter 5) transforms multiple systems biology data models into a single, more computationally accessible model. This work examines the feasibility of semantic data integration within a systems biology context. To gain access to the underlying semantics of the data, a semantically-rich core ontology is utilised together with mappings to and from source ontologies representing the data sources of interest. New data sources can be easily inserted without modification of the biologically-relevant core ontology. Separation of syntactic integration and semantic description of the biology is robust to changes to both the source ontologies and the core ontology. The use of existing tools decreased development time and increased the applicability of this approach for future projects. While some hurdles–such as scalability and the creation of a simple user interface—remain, rule-based mediation has been shown to be an effective approach for model annotation.
The field of semantics and ontologies in the life sciences is continuing to mature. The traditional and highly useful role of the ontology as metadata and link provider is pervasive. In this role, a particular database entry structures the data according to a defined data format, and within that format links are provided to specific ontology terms. The data format is the primary structure, and the ontological terms are keywords and links out to more information and to other database entries sharing that term. Very recently, new research such as rule-based mediation has freed ontologies, reversing the roles of ontology and structured format [6, 7, 8, 9]. With semantic integration methodologies, ontologies—rather than the syntactic format—have become the backbone from which data hangs. Ontologies can therefore be used to describe the entirety of a data model or biological domain rather than just provide links to further information. The use of ontologies as a sophisticated semantic integration tool allows the creation of a methodology which not only follows cross-referenced links from one database entry to the next, but aligns information according to biological knowledge, limiting ambiguity and making that knowledge accessible to both humans and computers.
This work described in this thesis is one step towards the formalisation of biological knowledge useful to systems biology. Experimental metadata has been transformed into common structures, data appropriate to the annotation of systems biology models can be retrieved and presented to users and multiple data models have been formalised and made accessible to semantic integration techniques.
1 Simplicity versus complexity
When transforming data and metadata into a common structure or model, there are always compromises to be made between simplicity and complexity. Rather than there being a single correct answer, researchers investigating particular integration techniques choose points along this spectrum amenable to their requirements. The dichotomy of simplicity and complexity is appropriate for a range of topics relevant to systems biology data integration including metadata, simulatable model creation and expressivity. The question of expressivity versus scalability is discussed in more detail below as it has a direct effect on the performance of semantic data integration techniques such as rule-based mediation.
In general, researchers find it easier to create data than to annotate that data with large amounts of metadata; while simplistic metadata requirements allow quicker data deposition, complex metadata requirements can create a barrier to data entry. While standards are often developed once ’just-enough’ data description strategies are no longer enough (as described in Section 1.4), as the amount of metadata asked of researchers grows, the chance of receiving complete metadata shrinks.
Systems biology models themselves are another example of choosing between simplicity and complexity. This is not simply a question of the granularity of such models, as whole organism models and single pathway models of a similar level of complexity can be produced. However, simpler models with fewer species and less complex interactions can be more easily understood by people and simulated by computers. While complex models present a more complete picture, the resources required to simulate and understand will be correspondingly greater.
As a final example, formal models such as ontologies can be either simple and scalable or complex and expressive; there is always a balance to be made between these two choices. Expressivity is a measure of what it is possible to say with a language, and as such has a direct bearing on the inferences possible with that language; too little expressivity results in a lack of “reasoning opportunities” in a language which would otherwise not provide any benefits over existing languages . Simple formal models are usually scalable and can easily be reasoned over but do not, by definition, contain a high amount of expressivity. Complex models have a higher level of expressivity and are capable of modelling more biological context, for example. With such models a more precise description of the domain of interest is created at the cost of reasoning time.
OWL-DL is an ontology formalism which is guaranteed to be decidable. The decidable subsets of OWL, such as OWL-DL, are intended to allow both expressivity and tractability for reasoning purposes; these requirements are typically in opposition to each other, and as such it is a goal of OWL to find a suitable balance . Specifically, there are no guarantees made about the level of scalability of any given ontology written in OWL-DL. The more complex an ontology the less amenable it is to reasoning; a change of even a single axiom can result in a reasoner completing in days rather than seconds. Detailed studies of the complexity of ontological reasoning and how that complexity increases with the expressiveness of an ontology are available [11, Chapter 3], and are beyond the scope of this thesis. However, expressivity versus tractability is of vital importance in semantic data integration in general and in rule-based mediation specifically, where the core ontology is intended to be small but highly expressive.
Decidability and expressiveness are vital considerations when choosing a language for rule-based mediation; languages such as OBO were not originally created with these challenges in mind. Not only is it important to choose a language which supports a balance between expressivity and tractability, but it is important to create ontologies with those considerations in mind. Even if a decidable language such as OWL-DL is used, an ontology can be easily created which cannot be reasoned over in sensible time scales. Currently, in semantic data integration methods such as rule-based mediation, scalability issues occur before more general integration issues such as database churn become important. Therefore for many in the life sciences, ontology engineering involves keeping ontologies as simple as possible for the stated requirements, adding detail and expressivity only where necessary and in a non-disruptive way .
2 The future of data integration
The use of confidence values for particular mappings between ontologies has been described , although research in this area is still limited. However, the ability to modify the set of mappings between source and core ontologies based on the perceived quality of the mappings would be an interesting area of further research. Rule-based mediation makes use of an expressive core ontology, which currently places limitations on its scalability. As brute computational power and technology such as reasoners improve, methodologies such as rule-based mediation could go from precise, small-scale integrative tasks to much larger integrative challenges. Additional automation is also a necessary step if semantic methods such as rule-based mediation are to be implemented on a wider scale. However, existing and future collaborations should provide scope for these and other possibilities for this research. A collaboration of the CellML project with the Saint project started in 2009 and has continued in 2011 and 2012 with additional programmer support to extend the functionality of Saint with respect to CellML, and further aid the integration of the MIRIAM annotation with the CellML format and API. Rule-based mediation has also been identified by Hoehndorf and co-workers as a possible source of collaboration .
A much broader perspective on the future of data integration is not just which integrative methodology is best suited for a task, but whether data integration will even have a place in biological research. If all life science data were expressed in a homogeneous manner, data integration would be unnecessary. The rest of this section explores some of the ways in which data might become accessible to all, without the need for interfaces and integration layers among formats.
If standards become fully utilised within the life sciences, every researcher would be publishing data in recognised, common formats. However, this does not necessarily mean a single format for the entirety of the life sciences; more likely, a large number of standards would remain, as described in Section 1.4. While it is true that many community standards are already in common use (e.g. SBML and BioPAX), standards efforts have a history of slow development (e.g. OBI) and slow uptake (e.g. FuGE). Additionally, new questions will continue to be asked and experimental types will continue to be invented. By definition, these new experiment types will be created faster than the corresponding standards can be developed. Even as experiment types and, ultimately, community standards mature, new standards or combinations of standards might be required. Through experience gained as a developer of many community standards it has become clear that, irrespective of both the number of experimental types and the quality of standardised data formats, there will likely always be some overlap in the data descriptions among the formats. If there is even a small amount of information in one community standard that is of use to another community, data integration will continue to play a vital role in the life sciences. As such even widespread, endemic use of standards will not remove the need for high-quality data integration. As quickly as standards are made, new data is created and new descriptions of data are written which require standardisation.
Alternatively, existing data integration methods might become useless as newer methodologies are investigated. For instance, semantic methodologies are able to capture more knowledge about the data than syntactic methods. More complete, less ambiguous models of the domain of interest leads to less confusion and fewer mapping errors during the integration process. But as has been described in Section 1.6, many syntactic data integration methods remain popular and are being successfully developed throughout the life sciences. Emerging technologies such as reasoner-accessible semantic methods have been proven to be useful, but are not yet supplanting existing methods. Scalability and tractability are core issues when reasoners are applied, and the very employment of the expressivity which make ontology-based techniques useful quickly exposes such issues.
There is a growing movement within science towards openness of data. Open Data movements such as the Panton Principles (Note: http://pantonprinciples.org/) describe appropriate licensing for open experimental data, while many others have advocated open models of publishing papers as well as data (Note: More information can be found at http://cameronneylon.net/ and https://opencitations.wordpress.com/2011/08/04/the-plate-tectonics-of-research-data-publication/.). For instance arXiv (Note: http://arxiv.org/), an e-print archive for topics including physics, mathematics, computer science, quantitative biology and statistics, provides open access to both data and papers. Conferences are also becoming more open, with the publishing of papers and presentations in open access journals and with the use of social networking to provide live coverage of conferences [14, 15]. It may seem as if this increased availability and openness of data and publications might decrease the importance of data integration. However, open data is a prime example of why data reuse and integration will remain important. Larger amounts of data do not lessen the need of integration methodologies, as open data is not guaranteed to be in the same format.
Finally, if there is a move away from small, tightly-scoped experiments towards large-scale experiments, large amounts of data could be produced in a single format. For example, there would be no need to integrate disparate experimental data files the research could be reproduced cheaply and quickly. While large-scale experimentation would ease the burden of integration within one community or experimental type, it will not solve the integration problems across multiple communities. As such, this would not make data integration redundant.
Although there are a number of ways in which data will become more homogeneous, none will render data integration obsolete. If all of systems biology was formalised and structured according to a common data model, transformations on that data model would still be required. For instance, the model could go through rounds of optimisation or updating during schema evolution. Syntactic methods will most likely play a pivotal role in data integration for the foreseeable future, until the limitations of semantic methods are ameliorated and ontologies and other Semantic Web technologies gain much higher coverage within the life sciences. The ability of Semantic Web formats such as RDF to describe any type of data, and of Semantic Web languages such as ontologies to model any type of biology will ensure that semantic data integration will only increase in importance as scaling issues are addressed and the time taken to perform the integration decreases.
3 Guidance for modellers and other researchers
If data generators and systems biology modellers followed just a few guidelines when describing their information, data integration tasks could become much more straightforward and many existing methods of integration would immediately become more powerful. Small changes could result in substantive improvements in the way data is described and ultimately, the quality of the integration that can be performed. For instance, unambiguous naming would result in better hits against external data sources using a systems biology annotation application such as Saint (Chapter 3). The list below describes these guidelines, and how their use would aid data integration in general as well as help the integration methods presented in this thesis.
- Informative naming. In general, systems biology modellers name model entities using non-informative names and identifiers as well as non-intuitive naming schemes. A species might be called “A” or “species1”, providing no clue as to the identity of the species for prospective users of the model. Further, while an individual modeller might conform to their own personal naming scheme, there is no guarantee that others will know that the “_p” added to a species name means it is the phosphorylated form; it could equally mean gene product, probable name, or even possibly unknown. An informative naming methodology is simple and quick to implement while also being a great help to others. Even if no further information is provided about a model entity, educated guesses as to the identity of that entity can be made both by humans and computers if commonly-used names are provided.
- Consider standards from the start of the research. The life science community has many ways to store experimental data in a format recognisable to computers and to humans, and many helper applications to do most of the work for the researcher. SyMBA (Chapter 2) is an archival application capable of storing experimental data and metadata. The ISA Suite  assists with data management and standards-compliant storage while the experiments are being run, and does so in a spreadsheet-style environment comfortable to many researchers. Ensuring that the correct information is stored while the data is being gathered saves time for the researcher and eases submission of that data to public repositories prior to publication.
- Think about the future. As described by one twitter user, “metadata is a love note to the future” (Note: http://twitpic.com/6ry6ar). High quality metadata makes data integration easier, and enables greater reuse, sharing and re-purposing of the underlying information both for the creator of that information and for future interested researchers. The use of systems such as Saint (Chapter 3) and rule-based mediation (Chapter 5), which add annotation in a semi-automated way, can help with the addition of metadata, fostering understanding and allowing the research to reach a wider audience.
The early releases of life science databases such as EMBL/GenBank/DDBJ  and UniProtKB  provided data in a structured, text-based flat file format. As techniques, knowledge and computing power improved, the storage and exchange formats for these and other databases have become more complex. While EMBL/GenBank/DDBJ and UniProtKB can still be retrieved as flat files, internally the data is stored in relational databases, and UniProtKB provides both an XML and an RDF export. Newer life science databases such as BioModels  and Pathway Commons  never even produced a flat file format; instead, XML and OWL began to be used as the primary data format. The progression of the structure of data from text file to XML to OWL is a progression towards a formalisation of the representation of biology.
Semantic technologies are being heavily researched and used in the life sciences and in larger-scale commercial portals such as Google. However, not all semantic projects have been successful. For instance, the semantic database and publishing system Ambra (Note: http://www.topazproject.org/trac/wiki/Ambra), in use by the PLoS group of journals for a number of years (Note: http://blogs.plos.org/everyone/2009/05/13/all-plos-titles-now-on-the-same-publishing-platform/), was dropped in November 2011 for a faster system called New Hope (Note: http://blogs.plos.org/plos/2011/12/new-hope-the-new-platform-for-the-plos-journal-websites/). In contrast, both social networks and other large commercial portals have begun increasing their usage of semantic technologies. As of October 2011, Facebook has begun supporting linked data and RDF (Note: http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData). Google makes use of RDFa and microformats for their Rich Snippets (Note: http://support.google.com/webmasters/bin/answer.py?hl=enanswer=99170,
http://www.readwriteweb.com/archives/google_semantic_web_push_rich_snippets_usage_grow.php) technology. The massive amounts of data stored by companies such as Google and Facebook, or to a lesser degree by semantically-aware life science databases such as Bio2RDF  may provide the volume of metadata required to increase awareness and usage of the Semantic Web (Note: http://semanticweb.com/the-evolution-of-search-at-google_b25042). As the amount of life science data grows, improved semantic structure of that data will aid integration and querying tasks.
The research presented in this thesis standardises experimental metadata (Chapter 2) and integrates (Chapter 3, Chapter 5) and formalises (Chapters 4-5) systems biology knowledge. As a result, new biological annotation can be added to systems biology models, and information useful for systems biology has been transformed into more computationally amenable and semantically aware structures.
This research demonstrates the advantages gained with an increase in the formalisation of biological knowledge. The use of richer semantics allows the automation of many aspects of the annotation process for systems biology models as well as computational access to those models to rigorously assess their correctness. While this comes with a cost of a greater formality and standardisation for experimental data, the cost can be partially alleviated with the use of appropriate applications to make the input of the biological descriptions easier. As the life sciences become ever more data rich, this formalisation becomes less of an option and more of a necessity to ensure that the multitude of data resources remain accessible and manageable to researchers and computers.
- Clifford Lynch. Big data: How do your data grow? Nature, 455(7209):28–29, September 2008.
- Doug Howe, Maria Costanzo, Petra Fey, Takashi Gojobori, Linda Hannick, Winston Hide, David P. Hill, Renate Kania, Mary Schaeffer, Susan St Pierre, Simon Twigger, Owen White, and Seung Yon Y. Rhee. Big data: The future of biocuration. Nature, 455(7209):47–50, September 2008.
- Lee Harland, Christopher Larminie, Susanna-Assunta A. Sansone, Sorana Popa, M. Scott Marshall, Michael Braxenthaler, Michael Cantor, Wendy Filsell, Mark J. Forster, Enoch Huang, Andreas Matern, Mark Musen, Jasmin Saric, Ted Slater, Jabe Wilson, Nick Lynch, John Wise, and Ian Dix. Empowering industrial research with shared biomedical vocabularies. Drug discovery today, 16(21-22):940–947, September 2011.
- Alan Ruttenberg, Tim Clark, William Bug, Matthias Samwald, Olivier Bodenreider, Helen Chen, Donald Doherty, Kerstin Forsberg, Yong Gao, Vipul Kashyap, June Kinoshita, Joanne Luciano, M. Scott Marshall, Chimezie Ogbuji, Jonathan Rees, Susie Stephens, Gwendolyn Wong, Elizabeth Wu, Davide Zaccagnini, Tonya Hongsermeier, Eric Neumann, Ivan Herman, and Kei H. Cheung. Advancing translational research with the Semantic Web. BMC bioinformatics, 8 Suppl 3(Suppl 3):S2+, 2007.
- Allyson L. Lister. Semantic Integration in the Life Sciences. Ontogenesis, January 2010.
- Emek Demir, Michael P. Cary, Suzanne Paley, Ken Fukuda, Christian Lemer, Imre Vastrik, Guanming Wu, Peter D’Eustachio, Carl Schaefer, Joanne Luciano, Frank Schacherer, Irma Martinez-Flores, Zhenjun Hu, Veronica Jimenez-Jacinto, Geeta Joshi-Tope, Kumaran Kandasamy, Alejandra C. Lopez-Fuentes, Huaiyu Mi, Elgar Pichler, Igor Rodchenkov, Andrea Splendiani, Sasha Tkachev, Jeremy Zucker, Gopal Gopinath, Harsha Rajasimha, Ranjani Ramakrishnan, Imran Shah, Mustafa Syed, Nadia Anwar, Ozgün Babur, Michael Blinov, Erik Brauner, Dan Corwin, Sylva Donaldson, Frank Gibbons, Robert Goldberg, Peter Hornbeck, Augustin Luna, Peter Murray-Rust, Eric Neumann, Oliver Reubenacker, Matthias Samwald, Martijn van Iersel, Sarala Wimalaratne, Keith Allen, Burk Braun, Michelle Whirl-Carrillo, Kei-Hoi H. Cheung, Kam Dahlquist, Andrew Finney, Marc Gillespie, Elizabeth Glass, Li Gong, Robin Haw, Michael Honig, Olivier Hubaut, David Kane, Shiva Krupa, Martina Kutmon, Julie Leonard, Debbie Marks, David Merberg, Victoria Petri, Alex Pico, Dean Ravenscroft, Liya Ren, Nigam Shah, Margot Sunshine, Rebecca Tang, Ryan Whaley, Stan Letovksy, Kenneth H. Buetow, Andrey Rzhetsky, Vincent Schachter, Bruno S. Sobral, Ugur Dogrusoz, Shannon McWeeney, Mirit Aladjem, Ewan Birney, Julio Collado-Vides, Susumu Goto, Michael Hucka, Nicolas Le Novère, Natalia Maltsev, Akhilesh Pandey, Paul Thomas, Edgar Wingender, Peter D. Karp, Chris Sander, and Gary D. Bader. The BioPAX community standard for pathway data sharing. Nature biotechnology, 28(9):935–942, September 2010.
- The OBI Consortium. OBI Ontology.
- Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through rule-based semantic integration. Journal of biomedical semantics, 1 Suppl 1(Suppl 1):S3+, 2010.
- Robert Hoehndorf, Michel Dumontier, John Gennari, Sarala Wimalaratne, Bernard de Bono, Daniel Cook, and Georgios Gkoutos. Integrating systems biology models and biomedical ontologies. BMC Systems Biology, 5(1):124+, August 2011.
- Web Ontology Working Group. OWL Web Ontology Language Use Cases and Requirements, February 2004.
- Franz Baader, Diego Calvanese, Deborah Mcguinness, Daniele Nardi, and Peter Patel-Schneider, editors. The Description Logic Handbook – Cambridge University Press. Cambridge University Press, first edition, January 2003.
- Stefan Schulz, Kent Spackman, Andrew James, Cristian Cocos, and Martin Boeker. Scalable representations of diseases in biomedical ontologies. Journal of Biomedical Semantics, 2(Suppl 2):S6+, 2011.
- Jinguang Gu, Baowen Xu, and Xinmeng Chen. An XML query rewriting mechanism with multiple ontologies integration based on complex semantic mapping. Information Fusion, 9(4):512–522, October 2008.
- Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live Coverage of Intelligent Systems for Molecular Biology/European Conference on computational biology (ISMB/ECCB) 2009. PLoS computational biology, 6(1):e1000640+, January 2010.
- Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live Coverage of Scientific Conferences Using Web Technologies. PLoS Comput Biol, 6(1):e1000563+, January 2010.
- Philippe Rocca-Serra, Marco Brandizi, Eamonn Maguire, Nataliya Sklyar, Chris Taylor, Kimberly Begley, Dawn Field, Stephen Harris, Winston Hide, Oliver Hofmann, Steffen Neumann, Peter Sterk, Weida Tong, and Susanna-Assunta Sansone. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics, 26(18):2354–2356, September 2010.
- Guy Cochrane, Ruth Akhtar, James Bonfield, Lawrence Bower, Fehmi Demiralp, Nadeem Faruque, Richard Gibson, Gemma Hoad, Tim Hubbard, Christopher Hunter, Mikyung Jang, Szilveszter Juhos, Rasko Leinonen, Steven Leonard, Quan Lin, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Sheila Plaister, Rajesh Radhakrishnan, Stephen Robinson, Siamak Sobhany, Petra T. Hoopen, Robert Vaughan, Vadim Zalunin, and Ewan Birney. Petabyte-scale innovations at the European Nucleotide Archive. Nucleic Acids Research, 37(suppl 1):D19–D25, January 2009.
- The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
- Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
- Ethan G. Cerami, Benjamin E. Gross, Emek Demir, Igor Rodchenkov, Özgün Babur, Nadia Anwar, Nikolaus Schultz, Gary D. Bader, and Chris Sander. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Research, 39(suppl 1):D685–D690, January 2011.
- F. Belleau, M. Nolin, N. Tourigny, P. Rigault, and J. Morissette. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. Journal of Biomedical Informatics, 41(5):706–716, October 2008.