Categories
Data Integration

Background: What does systems biology data look like? (Thesis 1.2)

[Previous: Overview]
[Next: Modelling biological systems]

What does systems biology data look like?

Properties of a system exist that are more than just the sum of their parts; systems that contain these emergent properties are said to be irreducible (Note: Why Systems Matter, accessed December 2011.). Though reductionist methods of research can provide a large amount of detail for specific biological entities, a more holistic systems approach is required to understand emergent systems properties [1]. Such a top-down approach creates a life cycle of systems biology research. Beginning with the Hodgkin–Huxley model of squid axons in 1952 [2], hypotheses have been tested both in the laboratory and through simulations of mathematical models. Data from the laboratory informs these models, which can then be used to inform further experimentation and validate or invalidate hypotheses.

Systems biology focuses on the study of systems as a whole rather than on the examination of individual constituent parts. Data useful to systems biology tends to be large and heterogeneous both in dimensionality and in structure, with modern high-throughput techniques collecting vast amounts of relevant information [3]. It is standard practice to take data points from a sample not just once, but across space, time, geographical location, organisational or even spectral range [4]. The wide variety of experimental types leads to a correspondingly large number of data representations, analysis methods and modelling strategies [5]. The reconciliation of disparate systems biology data, and the concomitant organisation and management of biological data sources into an exploitable “resourceome”, is of great importance to researchers requiring access to existing data [6].

With the maturation of research methods, interpretations of the systems biology life cycle have become correspondingly more complex. Kitano detailed a relatively simple systems biology life cycle in 2002 which is summarised in Figure 1. By 2006, Philippi and colleagues had incorporated a data integration step as described in Figure 2. By 2009, semantics had become important enough to systems biology research that Antezana and colleagues had added formalisation of knowledge and reasoning to the cycle (see Figure 3).

Figure 1: The systems biology life cycle in 2002, based on Kitano [3, Fig.1]. “Dry”, in silico modelling and simulation experiments inform “wet” experiments, which in turn generate data used to create and further inform hypotheses.

Figure 2: The systems biology life cycle in 2006, based on Philippi and colleagues [7, Fig.1b]. Four years after the Kitano [3] life cycle was published, data integration methodologies, highlighted in yellow, were common enough to be added. Further, the entire cycle could be completed with either wet or dry experiments, or a combination of both.

Figure 3: The semantic systems biology life cycle in 2009, based on Antezana and colleagues [6, Fig.2]. The new methods of integration and the addition of a reasoning step are highlighted in yellow. The semantic phase is iterative, shown with an arrow back to the integration and formalism step. The continued importance of the original Kitano life cycle is described with an arrow bypassing the semantic phase. While the original figure by Antezana and colleagues did not explicitly include a reference to in silico research, the experiments described in the paper could have been either dry or wet.

Kitano’s life cycle does not mention databases or integration of generated data with other data sources. Philippi and colleagues’ modified life cycle has these additions as well as the acknowledgement that “dry” in silico experiments produce useful data independently of “wet” experiments. Historically data integration in bioinformatics consisted of cross references between databases or links out via URLs (see Section 1.6 for more information). More complex linking became common as ontologies such as the GO [8] made it possible to reference community-wide hierarchies of descriptive biological terms.

Very recently, with an increase in the use of Semantic Web (Note: http://www.w3.org/2001/sw/) technologies such as ontologies, semantic data integration has become an important tool in systems biology research [6] (see Section 1.6). Figure 3 shows an interesting progression in the perception of researchers with regard to the systems biology life cycle with the addition of semantic techniques. By 2009, semantics and ontologies were becoming a bigger part of systems biology research. As such, Antezana and colleagues added the formalisation of data to the integration step, allowing data to be viewed in a semantically uniform way. The semantic data then becomes accessible to computational methods, allowing reasoning and consistency checking of the data. Even so, the research described in this thesis is one of only a handful of projects focusing on semantic data integration in systems biology.

There are four main areas of study in systems biology research: (i) the structure (e.g. interactions and pathways) of a system; (ii) how a system behaves over time, or its dynamics; (iii) the method of controlling and modulating the system; and (iv) the design method, or the deliberate progress using well defined design principles [3]. These four properties are strongly tied to the quantitative modelling aspect of systems biology, and illustrate the importance of such models. However, models are of limited use to either people or computers if they do not have structured biological annotations to provide context [9]. For instance, until SBML [10] models are annotated by the BioModels team, elements often contain short-hand, biologically irrelevant names and descriptions in computationally incompatible free text [11]. While attaching additional biological knowledge to quantitative models is not a requirement for their simulation, without such annotations model sharing, interpretation of simulation results, integration and reuse becomes nearly impossible [9]. Therefore the addition of biologically relevant, computationally accessible metadata will not only enhance the semantics of a model but provide a method of unambiguously identifying its elements.

The majority of systems biology research projects can ultimately be interpreted to produce interconnected data such as gene networks, protein networks and metabolic networks [12]. The level of granularity of these networks of information can vary from large-scale omics networks with thousands of nodes to precisely calibrated quantitative models of specific molecular interactions. The integration of networks and models presents a challenge to systems biology, increasing the importance of bioinformatics techniques to the life science community, a result in opposition to early predictions [13]. In Section 1.3, the description of biological systems is examined through the use of networks and models.

Bibliography

[1]
Uwe Sauer, Matthias Heinemann, and Nicola Zamboni. Getting Closer to the Whole Picture. Science, 316(5824):550–551, April 2007.
[2]
A. L. Hodgkin and A. F. Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, August 1952.
[3]
Hiroaki Kitano. Systems Biology: A Brief Overview. Science, 295(5560):1662–1664, March 2002.
[4]
Jason R. Swedlow, Suzanna E. Lewis, and Ilya G. Goldberg. Modelling data across labs, genomes, space and time. Nature Cell Biology, 8(11):1190–1194, November 2006.
[5]
Katrin Hübner, Sven Sahle, and Ursula Kummer. Applications and trends in systems biology in biochemistry. FEBS Journal, 278(16):2767–2857, August 2011.
[6]
Erick Antezana, Martin Kuiper, and Vladimir Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4):392–407, July 2009.
[7]
Stephan Philippi and Jacob Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet, 7(6):482–488, June 2006.
[8]
Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Midori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C. Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000.
[9]
Nicolas L. Novere, Andrew Finney, Michael Hucka, Upinder S. Bhalla, Fabien Campagne, Julio Collado-Vides, Edmund J. Crampin, Matt Halstead, Edda Klipp, Pedro Mendes, Poul Nielsen, Herbert Sauro, Bruce Shapiro, Jacky L. Snoep, Hugh D. Spence, and Barry L. Wanner. Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology, 23(12):1509–1515, December 2005.
[10]
M. Hucka, A. Finney, H. M. Sauro, H. Bolouri, J. C. Doyle, H. Kitano, , the rest of the SBML Forum:, A. P. Arkin, B. J. Bornstein, D. Bray, A. Cornish-Bowden, A. A. Cuellar, S. Dronov, E. D. Gilles, M. Ginkel, V. Gor, I. I. Goryanin, W. J. Hedley, T. C. Hodgman, J. H. Hofmeyr, P. J. Hunter, N. S. Juty, J. L. Kasberger, A. Kremling, U. Kummer, N. Le Novère, L. M. Loew, D. Lucio, P. Mendes, E. Minch, E. D. Mjolsness, Y. Nakayama, M. R. Nelson, P. F. Nielsen, T. Sakurada, J. C. Schaff, B. E. Shapiro, T. S. Shimizu, H. D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4):524–531, March 2003.
[11]
Chen Li, Marco Donizelli, Nicolas Rodriguez, Harish Dharuri, Lukas Endler, Vijayalakshmi Chelliah, Lu Li, Enuo He, Arnaud Henry, Melanie Stefan, Jacky Snoep, Michael Hucka, Nicolas Le Novere, and Camille Laibe. BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC Systems Biology, 4(1):92+, June 2010.
[12]
James E. Ferrell. Q&A: systems biology. Journal of biology, 8(1):2+, January 2009.
[13]
Lincoln D. Stein. Bioinformatics: alive and kicking. Genome biology, 9(12):114+, December 2008.
Categories
Data Integration

Background: Overview (Thesis 1.1)

[Previous: Additional Front Material]
[Next: What does systems biology data look like?]

Overview

Studying biological systems requires a large amount of data of different experimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. The use of community data standards can reduce the need for specialised, independent formats by providing a common syntax to make data retrieval and manipulation easier. However, standards uptake is not universal and the disparate data types required by systems biologists creates data that is not, or cannot, be completely described by a single standard. If existing data does not share a standard structure, theoretically any heterogeneous data could be reproduced in a single format by rerunning experiments. Because in practice such a method would be expensive and time consuming, integrative methods which reuse existing data should be explored.

Though it is possible for the biology represented by any given data format to be completely orthogonal with other experimental types, more commonly, portions of the biology—but not necessarily the formats describing them—overlap. These common aspects of biological data representations create theoretical integration points among the representations, allowing information reuse and re-purposing. However, shared biological concepts do not necessarily result in shared definitions of the biology. Whereas differences in format result in syntactic heterogeneity, the differences in meaning of seemingly identical biological concepts across different formats results in semantic heterogeneity. A portion of the work presented in this thesis addresses syntactic heterogeneity both through the use of a common experimental metadata standard and by systematically integrating data for the purposes of systems biology model annotation. While syntactic heterogeneity can be resolved through the alignment of common structures, semantic heterogeneity is a more complex challenge. Once the meaning of the underlying biological concepts of interest for all data sources has been made explicit, the semantic heterogeneity can be identified. Further, if the semantics of a data format are accessible to machines, computational reasoning can be applied to find inferences and logical inconsistencies. The work described in this thesis includes the conversion of a systems biology standard specification—in multiple documents—into a single semantically aware model of that specification.

A wide variety of integration methodologies addressing various aspects of syntactic and semantic heterogeneity are available, often optimised for different situations. Many of these methods do not address semantic heterogeneity in systems biology data and of those that do, very few use existing technologies, address syntactic and semantic heterogeneity and make use both of simple syntactic conversions of non-semantic formats and semantically-meaningful models of the biological domain of interest. The work described in this thesis includes a method of semantic data integration called rule-based mediation which provides these features and which was developed as an aid to systems biology model annotation. Integrating resources with rule-based mediation accommodates multiple formats with different semantics and provides richly-modelled biological knowledge suitable for annotation of systems biology models.

Within this introductory chapter, Section 1.2 provides an overview of systems biology and the challenge presented when multiple formats are used. Section 1.3 describes how differences in format are not simply the result of different types of experiments, but are also due to the variety of ways systems are modelled. Section 1.4 describes the content, syntax and semantics standards relevant for systems biology. Data heterogeneity is an issue not limited to systems biology, and as such there is a large amount of previous work on data integration (see Section 1.6). As described in Section 1.5, existing technologies such as ontologies, rules and reasoning can bring together heterogeneous data in a homogeneous, meaning-rich, computationally-amenable manner.

Figure 1: How the work described in this thesis relates to the semantic systems biology life cycle described further in Section 1.2, Figure 3. SyMBA provides a common structure for experimental metadata and a archive for experimental data of any type, thus aiding data storage and analysis. Saint helps systems biology modellers annotate models with biological information in a standard way, thus enhancing the quality of models used in in silico experiments. MFO and rule-based mediation use Semantic Web technologies to formalise systems biology data and perform automated reasoning over that data. Rule-based mediation also semantically integrates information relevant to systems biology.

Figure 1 provides a summary of the work described in the chapters of thesis in the context of the semantic systems biology life cycle originally described by Antezana and colleagues [1, Fig.2]. Retrieval and storage of systems biology experimental metadata using a common syntax becomes easier with applications like the collaboratively-developed SyMBA (Chapter 2). The Saint Web application (Chapter 3) provides syntactic integration of systems biology data as well as a simple interface for viewing, manipulating and exporting new biological annotation to existing systems biology models. Saint is useful both in its own right and as a test case for determining the data sources, capabilities and requirements for the implementation of rule-based mediation. New data sources can easily be added, and therefore Saint has the capacity to provide access to data integrated via rule-based mediation.

In some cases, community standards and syntactic integration are not enough. Rules and restrictions on the use of a standard syntax are not always confined to the syntax itself; extra information can be present in human-readable documentation such as Word or PDF documents, but not directly accessible to computers. Therefore, even if the data is in a common syntax, there are limits on its computational accessibility. This problem is resolved for SBML models through the use of MFO (Chapter 4), an ontology which holds SBML data as well as rules and restrictions on the SBML structure. MFO provides a format through which reasoning can be applied to SBML models, and stands both on its own and as part of the semantic data integration methodology described in Chapter 5.

Semantic data integration via rule-based mediation, described in Chapter 5, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations. Syntactic heterogeneity is resolved through the conversion to a computationally-accessible syntactic ontology, and semantic heterogeneity is resolved by mapping the syntactic ontology to a biological domain of interest which is itself modelled using an ontology.

Chapter 6 discusses future possibilities for data use and reuse. Will data integration become a thing of the past? An increase in the uptake of standards will likely occur as communities mature. Experiments and their outputs could become better organised, cheaper and more open. Data usage would be much easier if integration were not required at all. Ultimately, however, science moves faster than standards and there are always new questions and new experiments. Hopefully, integration will become seamless and transparent to the user as semantic methods, combined with use of large, open resources on the Semantic Web, serve heterogeneous data via a homogeneous view.

Bibliography

[1]
Erick Antezana, Martin Kuiper, and Vladimir Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4):392–407, July 2009.
Categories
Data Integration

Additional Front Material (Thesis)

[Previous: Abstract]
[Next: 1.1 Overview]

Dedication

This thesis is dedicated to my parents, who encouraged me in all things, and is particularly dedicated to my husband and my son, without whose patience and support I could not have finished this research.

“Among those who have endeavoured to promote learning and rectify judgement, it has long been customary to complain of the abuse of words, which are often admitted to signify things so different, that, instead of assisting the understanding as vehicles of knowledge, they produce error, dissension, and perplexity….” Dr. Samuel Johnson, 1752, via Nature Structural & Molecular Biology 14, 681 (2007).

“Metadata is a love note to the future.” (NYPL Labs, http://twitpic.com/6ry6ar, via @CameronNeylon @kissane)

Metadata is a love note to the future

Acknowledgements

Many thanks go to my supervisors Dr. Anil Wipat, Dr. Phillip Lord and Dr. Matthew Pocock. A special thanks goes to those people who found extra commas and other errors: Phoebe Boatright, Dr. Dagmar Waltemath, Lucy Slattery, Mélanie Courtot and Dr. Paul Williams, Jr. Past and present colleagues at Newcastle University have provided much support and inspiration, including Dr. Frank Gibson and Dr. Katherine James.

I gratefully acknowledge the support of the BBSRC and the EPSRC for funding CISBAN at Newcastle University. I also acknowledge the support of the Newcastle University Systems Biology Resource Centre, the Newcastle University Bioinformatics Support Unit and the North East regional e-Science centre.

Declaration

I declare that the following work embodies the result of my own work, that it has been composed by myself and does not include work forming part of a thesis presented successfully for a degree in this or another University.

Allyson Lister

Contributions and Papers

During the course of this work, I have been involved both in the development of a number of standards efforts and in the publishing of a number of papers.

The list below describes the standards efforts to which I have contributed and the roles I have had within those efforts:

  • a developer of UniProt/TrEMBL [1];
  • a program developer and contributor to the FuGE [2, 3, 4], a standard XML syntax for describing experiments;
  • a core developer and coordinator of the OBI [5, 6, 7], a standard semantics for describing experiments;
  • an early developer of the ISA-TAB [8] tab-delimited syntax for describing experiments;
  • a developer of the minimal information checklist MIGS/MIMS for genomics and metagenomics information [9];
  • an advisor for the SBO, the KiSAO and the TEDDY;
  • a co-author of the SBML Level 3 Annotation package [10];
  • an advisor for the Cell Behavior Ontology1; and
  • an advisor in the nascent synthetic biology standards2.

I have co-authored 20 papers, specification documents, articles and technical reports during the course of this work:

  1. Mélanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novère. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011.
  2. Stephen G. Addinall, Eva-Maria Holstein, Conor Lawless, Min Yu, Kaye Chapman, A. Peter Banks, Hien-Ping Ngo, Laura Maringele, Morgan Taschuk, Alexander Young, Adam Ciesiolka, Allyson L. Lister, Anil Wipat, Darren J. Wilkinson, and David Lydall. Quantitative fitness analysis shows that NMD proteins and many other protein complexes suppress or enhance distinct telomere cap defects. PLoS Genet, 7(4):e1001362+, April 2011.
  3. Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1):23–33, January 2011.
  4. Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through rule-based semantic integration. Journal of biomedical semantics, 1 Suppl 1(Suppl 1):S3+, 2010.
  5. Andrew R. Jones and Allyson L. Lister. Managing experimental data using FuGE. Methods in molecular biology (Clifton, N.J.), 604:333–343, 2010.
  6. Allyson L. Lister. Semantic integration in the life sciences. Ontogenesis,
    http://ontogenesis.knowledgeblog.org/126. January 2010.
  7. Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live coverage of intelligent systems for molecular Biology/European conference on computational biology (ISMB/ECCB) 2009. PLoS Comput Biol, 6(1):e1000640+, January 2010.
  8. Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live coverage of scientific conferences using Web technologies. PLoS Comput Biol, 6(1):e1000563+, January 2010.
  9. Allyson Lister, Varodom Charoensawan, Subhajyoti De, Katherine James, Sarath Chandra C. Janga, and Julian Huppert. Interfacing systems biology and synthetic biology. Genome biology, 10(6):309+, 2009.
  10. Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009.
  11. Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: the minimum information to reference an external ontology term. In Barry Smith, editor, International Conference on Biomedical Ontology, pages 87–90. University at Buffalo College of Arts and Sciences, National Center for Ontological Research, National Center for Biomedical Ontology, July 2009.
  12. Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through Rule-Based semantic integration. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 49+, June 2009.
  13. The OBI Consortium. Modeling biomedical experimental processes with OBI. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 41+, June 2009.
  14. Andrew R. Jones, Allyson L. Lister, Leandro Hermida, Peter Wilkinson, Martin Eisenacher, Khalid Belhajjame, Frank Gibson, Phil Lord, Matthew Pocock, Heiko Rosenfelder, Javier Santoyo-Lopez, Anil Wipat, and Norman W. W. Paton. Modeling and managing experimental data using FuGE. OMICS: A Journal of Integrative Biology, 13(3):239–251, June 2009.
  15. Mélanie Courtot, William Bug, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. The OWL of biomedical investigations. In OWLED 2008, October 2008.
  16. Susanna-Assunta Sansone, Philippe Rocca-Serra, Marco Brandizi, Alvis Brazma, Dawn Field, Jennifer Fostel, Andrew G. Garrow, Jack Gilbert, Federico Goodsaid, Nigel Hardy, Phil Jones, Allyson Lister, Michael Miller, Norman Morrison, Tim Rayner, Nataliya Sklyar, Chris Taylor, Weida Tong, Guy Warner, and Stefan Wiemann. The first RSBI (ISA-TAB) workshop: ” can a simple format work for complex studies?”. OMICS: A Journal of Integrative Biology, 12(2):143–149, June 2008.
  17. Dawn Field, George Garrity, Tanya Gray, Norman Morrison, Jeremy Selengut, Peter Sterk, Tatiana Tatusova, Nicholas Thomson, Michael J. Allen, Samuel V. Angiuoli, Michael Ashburner, Nelson Axelrod, Sandra Baldauf, Stuart Ballard, Jeffrey Boore, Guy Cochrane, James Cole, Peter Dawyndt, Paul De Vos, Claude dePamphilis, Robert Edwards, Nadeem Faruque, Robert Feldman, Jack Gilbert, Paul Gilna, Frank O. Glockner, Philip Goldstein, Robert Guralnick, Dan Haft, David Hancock, Henning Hermjakob, Christiane Hertz-Fowler, Phil Hugenholtz, Ian Joint, Leonid Kagan, Matthew Kane, Jessie Kennedy, George Kowalchuk, Renzo Kottmann, Eugene Kolker, Saul Kravitz, Nikos Kyrpides, Jim Leebens-Mack, Suzanna E. Lewis, Kelvin Li, Allyson L. Lister, Phillip Lord, Natalia Maltsev, Victor Markowitz, Jennifer Martiny, Barbara Methe, Ilene Mizrachi, Richard Moxon, Karen Nelson, Julian Parkhill, Lita Proctor, Owen White, Susanna-Assunta Sansone, Andrew Spiers, Robert Stevens, Paul Swift, Chris Taylor, Yoshio Tateno, Adrian Tett, Sarah Turner, David Ussery, Bob Vaughan, Naomi Ward, Trish Whetzel, Ingio San Gil, Gareth Wilson, and Anil Wipat. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology, 26(5):541–547, May 2008.
  18. A. L. Lister, M. Pocock, and A. Wipat. Integration of constraints documented in SBML, SBO, and the SBML manual facilitates validation of biological models. Journal of Integrative Bioinformatics, 4(3):80+, 2007.
  19. Dawn Field, George Garrity, Tanya Gray, Jeremy Selengut, Peter Sterk, Nick Thomson, Tatiana Tatusova, Guy Cochrane, Frank O. Glöckner, Renzo Kottmann, Allyson L. Lister, Yoshio Tateno, and Robert Vaughan. eGenomics: Cataloguing our complete genome collection III. Comparative and Functional Genomics, 2007.
  20. A. L. Lister, A. R. Jones, M. Pocock, O. Shaw, and A. Wipat. CS-TR number 1016: Implementing the FuGE object model: a systems biology data portal and integrator. Technical report, Newcastle University, April 2007.

Bibliography

[1]
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucl. Acids Res., 36(suppl_1):D190–195, January 2008.
[2]
Andrew R. Jones, Michael Miller, Ruedi Aebersold, Rolf Apweiler, Catherine A. Ball, Alvis Brazma, James DeGreef, Nigel Hardy, Henning Hermjakob, Simon J. Hubbard, Peter Hussey, Mark Igra, Helen Jenkins, Randall K. Julian, Kent Laursen, Stephen G. Oliver, Norman W. Paton, Susanna-Assunta Sansone, Ugis Sarkans, Christian J. Stoeckert, Chris F. Taylor, Patricia L. Whetzel, Joseph A. White, Paul Spellman, and Angel Pizarro. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnology, 25(10):1127–1133, October 2007.
[3]
Andrew R. Jones and Allyson L. Lister. Managing experimental data using FuGE. Methods in molecular biology (Clifton, N.J.), 604:333–343, 2010.
[4]
Andrew R. Jones, Allyson L. Lister, Leandro Hermida, Peter Wilkinson, Martin Eisenacher, Khalid Belhajjame, Frank Gibson, Phil Lord, Matthew Pocock, Heiko Rosenfelder, Javier Santoyo-Lopez, Anil Wipat, and Norman W. W. Paton. Modeling and managing experimental data using FuGE. Omics : a journal of integrative biology, 13(3):239–251, June 2009.
[5]
Mélanie Courtot, William Bug, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. The OWL of Biomedical Investigations. In OWLED 2008, October 2008.
[6]
The OBI Consortium. Modeling biomedical experimental processes with OBI. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 41+, June 2009.
[7]
Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1):23–33, January 2011.
[8]
Susanna-Assunta Sansone, Philippe R. Serra, Marco Brandizi, Alvis Brazma, Dawn Field, Jennifer Fostel, Andrew G. Garrow, Jack Gilbert, Federico Goodsaid, Nigel Hardy, Phil Jones, Allyson Lister, Michael Miller, Norman Morrison, Tim Rayner, Nataliya Sklyar, Chris Taylor, Weida Tong, Guy Warner, and Stefan Wiemann. The First RSBI (ISA-TAB) Workshop: “Can a Simple Format Work for Complex Studies?”. OMICS: A Journal of Integrative Biology, 12(2):143–149, 2008.
[9]
Dawn Field, George Garrity, Tanya Gray, Norman Morrison, Jeremy Selengut, Peter Sterk, Tatiana Tatusova, Nicholas Thomson, Michael J. Allen, Samuel V. Angiuoli, Michael Ashburner, Nelson Axelrod, Sandra Baldauf, Stuart Ballard, Jeffrey Boore, Guy Cochrane, James Cole, Peter Dawyndt, Paul De Vos, Claude dePamphilis, Robert Edwards, Nadeem Faruque, Robert Feldman, Jack Gilbert, Paul Gilna, Frank O. Glockner, Philip Goldstein, Robert Guralnick, Dan Haft, David Hancock, Henning Hermjakob, Christiane Hertz-Fowler, Phil Hugenholtz, Ian Joint, Leonid Kagan, Matthew Kane, Jessie Kennedy, George Kowalchuk, Renzo Kottmann, Eugene Kolker, Saul Kravitz, Nikos Kyrpides, Jim Leebens-Mack, Suzanna E. Lewis, Kelvin Li, Allyson L. Lister, Phillip Lord, Natalia Maltsev, Victor Markowitz, Jennifer Martiny, Barbara Methe, Ilene Mizrachi, Richard Moxon, Karen Nelson, Julian Parkhill, Lita Proctor, Owen White, Susanna-Assunta Sansone, Andrew Spiers, Robert Stevens, Paul Swift, Chris Taylor, Yoshio Tateno, Adrian Tett, Sarah Turner, David Ussery, Bob Vaughan, Naomi Ward, Trish Whetzel, Ingio San Gil, Gareth Wilson, and Anil Wipat. The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology, 26(5):541–547, May 2008.
[10]
Dagmar Waltemath, Neil Swainston, Allyson Lister, Frank Bergmann, Ron Henkel, Stefan Hoops, Michael Hucka, Nick Juty, Sarah Keating, Christian Knuepfer, Falko Krause, Camille Laibe, Wolfram Liebermeister, Catherine Lloyd, Goksel Misirli, Marvin Schulz, Morgan Taschuk, and Nicolas Le Novère. SBML Level 3 Package Proposal: Annotation. Nature Precedings, (713), January 2011.
Categories
Data Integration

Thesis Abstract

[Previous: Converting a Latex Thesis to Multiple WordPress Posts]
[Next: Additional Front Material]

Studying and modelling biology at a systems level requires a large amount of data of different experimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. While the use of community data standards can reduce the need for specialised, independent formats by providing a common syntax, standards uptake is not universal and a single standard cannot yet describe all biological data. In the work described in this thesis, a variety of integrative methods have been developed to reuse and restructure already extant systems biology data.

SyMBA is a simple Web interface which stores experimental metadata in a published, common format. The creation of accurate quantitative SBML models is a time-intensive manual process. Modellers need to understand both the systems they are modelling and the intricacies of the SBML format. However, the amount of relevant data for even a relatively small and well-scoped model can be overwhelming. Saint is a Web application which accesses a number of external Web services and which provides suggested annotation for SBML and CellML models. MFO was developed to formalise all of the knowledge within the multiple SBML specification documents in a manner which is both human and computationally accessible. Rule-based mediation, a form of semantic data integration, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of ontology-based integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations.

The work described in this thesis is one step towards the formalisation of biological knowledge useful to systems biology. Experimental metadata has been transformed into common structures, a Web application has been created for the retrieval of data appropriate to the annotation of systems biology models and multiple data models have been formalised and made accessible to semantic integration techniques.

Categories
Data Integration

Converting a Latex Thesis to Multiple WordPress Posts

A few months ago I finished my thesis, passed my viva and then submitted the hardbound copies to the library, and all was right with the world. However, after a few weeks I realised that I only had my thesis in either PDF form, which is very hard to read, and in Latex, which is unintelligible to people who don’t know how to use it. Therefore, with the kind permission of Phil Lord, I am trying out his latex to wordpress software which he and others have been developing for use with the various knowledgeblogs.

The output of this conversion work is now available here. The thesis was separated into chapters or sections (depending on size) and posted individually. You can get them all in one place via the “thesis” tag in each of the posts or via the list below. Alternatively, you can download the human friendly (but computer unfriendly) PDF version of the thesis.

Thesis Posts

Current limitations of the conversion

There are a few things that aren’t quite right with the conversion at the minute. These were fixed through manual changes to the resulting HTML.

  1. The URLs in the bibliography sections were not being displayed automatically.
  2. The footnotes were created in the main text but not displayed at the end of the text. This will not be fixed programmatically as it is too awkward – they are just in the text as “(Note: […])” instead.

Technical Conversion Details

Please only read this section if you’re interested in a similar conversion process using the knowledgeblogs code.

This code is still in development, so it isn’t easy for someone not familiar with Make to understand. However, if you have a working knowledge of how makefiles work, then please read on.

  1. First, you need to check out the knowledgeblog codebase. You can use mercurial to check out the project at http://code.google.com/p/knowledgeblog/. The code can be found in the trunk/tooling/latextowordpress subdirectory.
  2. You need to install plastex. If you’re running Ubuntu or similar this can be done with the standard sudo apt-get install python-plastex command.
  3. Everything runs using the Makefile available in the latextowordpress directory. Test that things are OK by running the make simple_test command.
  4. Make a directory to store your .tex file input (and put your tex files in there) and another directory to store your HTML output.
  5. Comment out (e.g. use a “##” at the beginning of the line) the self[‘cite’] = self.do_cite line in knowledgeblog/wordpress/__init__.py within the latextowordpress directory.
  6. There are a few changes you may need to make to your tex file prior to compiling it and running the makefile. Ensure your tex file is a complete document (and not included in a parent document, for example). You need those \begin{document} […] \end{document} tags in the file. I also add the following line at the very top to sort out the problem with footnotes: \newcommand{\footnot}[1]{ (Note: \textit{#1})}. Then I can do a global search and replace for “\footnote” and replace it with “\footnot”, ensuring that footnotes are present, even if not visually ideal. Finally, ensure you add your bibliography tags to the end of the file if you don’t already have them (e.g. if the are normally in a parent tex document). I also had to replace “compactitem” references with “itemize”, as I didn’t want to include the package. You may have similar replacements to make.
  7. You’ll need to run your normal latex/pdflatex command that you would run to compile your latex code. This will do things like create a .bbl file for latextowordpress to make use of when generating the HTML.
  8. Add a new command within your Makefile to ensure that your tex files are being converted, and run that command. As an example, here is my new rule within the makefile (make sure that your tabs are correct!):
    my_thesis :
    $(LTWP) -d ‘thesis-output’ tex-files/Abstract.tex
  9. Once you have your html file, you’ll notice that for internal hrefs (here, just for citations), a full url rather than a local url. You’ll have to do a global search-and-replace to make sure those appear correctly. For example, replace “Abstract.html#Smith2000” with “#Smith2000” by removing all references to “Abstract.html”.
  10. Create a new wordpress post, and put the HTML into the post. Upload images for any figures to wordpress.
  11. I manually added “Previous” and “Next” links to each post.