In my relatively short career (approximately 12 years – wait, how long?) in bioinformatics, I have been involved to a greater or lesser degree in a number of standards efforts. It started in 1999 at the EBI, where I worked on the production of the protein sequence database UniProt. Now, I’m working with systems biology data and beginning to look into synthetic biology. I’ve been involved in the development (or maintenance) of a standard syntax for protein sequence data; standardized biological investigation semantics and syntax; standardized content for genomics and metagenomics information; and standardized systems biology modelling and simulation semantics.
(Bear with me – the reason for this wander through memory lane becomes apparent soon.)
How many standards have you worked on? How can there be multiple standards, and why do we insist on creating new ones? Doesn’t the definition of a standard mean that we would only need one? Not exactly. Take the field of systems biology as an example. Some people are interested in describing a mathematical model, but have no need for storing either the details of how to simulate that model or the results of multiple simulation runs. These are logically separate activities, yet they fall within a single community (systems biology) and are broadly connected. A model is used in a simulation, which then produces results. So, when building a standard, you end up with the same separation: have one standard for the modelling, another for describing a simulation, and a third for structuring the results of a simulation. All that information does not need to be stored in a single location all the time. The separation becomes even more clear when you move across fields.
But this isn’t completely clear cut. Some types of information overlap within standards of a single domain and even among domains, and this is where it gets interesting. Not only do you need a single community talking to each other about standard ways of doing things, but you also need cross-community participation. Such efforts result in even more high-level standards which many different communities can utilize. This is where work such as OBI and FuGE sit: with such standards, you can describe virtually any experiment. The interconnectedness of standards is a whole job (or jobs) in itself – just look at the BioSharing and MIBBI projects. And sometimes standards that seem (at least mostly) orthogonal do share a common ground. Just today, Oliver Ruebenacker posted some thoughts on the biopax-discuss mailing list where he suggests that at least some of BioPAX and SBML share a common ground and might be usefully “COMBINE“d more formally (yes, I’d like to go to COMBINE; no, I don’t think I’ll be able to this year!). (Scroll down that thread for a response by Nicolas Le Novère as to why that isn’t necessarily correct.) So, orthogonality, or the extent to which two or more standards overlap, is sometimes a hard thing to determine.
So, what have I learnt? As always, we must be practical. We should try to develop an elegant solution, but it really, really should be one which is easy to use and intuitive to understand. It’s hard to get to that point, especially as I think that point is (and should be) a moving target. From my perspective, group standards begin with islands of initial research in a field, which then gradually develop into a nascent community. As a field evolves, ‘just-enough’ strategies for storing and structuring data become ‘nowhere-near-enough’. Communication with your peers becomes more and more important, and it becomes imperative that standards are developed.
This may sound obvious, but the practicalities of creating a community standard means such work requires a large amount of effort and continued goodwill. Even with the best of intentions, with every participant working towards the same goal, it can take months – or years – of meetings, document revisions and conference calls to hash out a working standard. This isn’t necessarily a bad thing, though. All voices do need to be heard, and you cannot have a viable standard without input from the community you are creating that standard for. You can have the best structure or semantics in the world, but if it’s been developed without the input of others, you’ll find people strangely reluctant to use it.
Every time I take part in a new standard, I see others like me who have themselves been involved in the creation of standards. It’s refreshing and encouraging. Hopefully the time it takes to create standards will drop as the science community as a whole gets more used to the idea. When I started, the only real standards in biological data (at least that I had heard of) were the structures defined by SWISS-PROT and EMBL/GenBank/DDBJ. By the time I left the EBI in 2006, I could have given you a list a foot long (GO, PSI, and many others), and that list continues to grow. Community engagement and cross-community discussions continue to be popular.
In this context, I can now add synthetic biology standards to my list of standards I’ve been involved in. And, as much as I’ve seen new communities and new standards, I’ve also seen a large overlap in the standardization efforts and an even greater willingness for lots of different researchers to work together, even taking into account the sometimes violent disagreements I’ve witnessed! The more things change, the more they stay the same…
At this stage, it is just a limited involvement, but the BBSRC Synthetic Biology Standards Workshop I’m involved in today and tomorrow is a good place to start with synthetic biology. I describe most of today’s talks in this post, and will continue with another blog post tomorrow. Enjoy!
For those with less time, here is a single sentence for each talk that most resounded with me:
- Mike Cooling: Emphasising the ‘re’ in reusable, and make it easier to build and understand large models from reusable components.
- Neil Wipat: For a standard to be useful, it must be computationally amenable as well as useful for humans.
- Herbert Sauro: Currently there is no formal ontology for synthetic biology, but one will need to be developed.
This meeting is organized by Jen Hallinan and Neil Wipat of Newcastle University. Its purpose is to set up key relationships in the synthetic biology community to aid the development of a standard for that community. Today, I listened to talks by Mike Cooling, Neil Wipat, and Herbert Sauro. I was – unfortunately – unable to be present for the last couple of talks, but will be around again for the second – and final – day of the workshop tomorrow.
Mike Cooling – Bioengineering Institute Auckland, New Zealand
Mike uses CellML (it’s made where he works, but that’s not the only reason…) in his work with systems and synthetic biology models. Among other things, it wraps MathML and partitions the maths, variables and units into reusable pieces. Although many of the parts seem domain specific, CellML itself is actually not domain specific. Further, unlike other modelling languages such as SBML, components in CellML are reusable and can be imported into other models. (Yes, a new package called comp in SBML Level 3 is being created to allow the importing of models into other models, but it isn’t mature – yet.)
How are models stored? There is the CellML repository, but what is out there for synthetic biology? The Registry of Standard Biological Parts was available, but only described physical parts. Therefore they created a Registry of Standard Virtual Parts (SVPs) to complement the original registry. This was developed as a group effort with a number of people including Neil Wipat and Goksel Misirli at Newcastle University.
They start with template mathematical structures (which are little parts of CellML), and then use the import functionality available as part of CellML to combine the templates into larger physical things/processes (‘SVPs’) and ultimately to combine things into system models.
They extended the CellMLRepository to hold the resulting larger multi-file models, which included adding a method of distributed version control and allow the sharing of models between projects through embedded workspaces.
What can these pieces be used for? Some of this work included the creation of a CellML model of the biology represented in Levskaya et al. 2005 and deposit all of the pieces of the model in the CellML repository. Another example is a model he’s working on about shear stress and multi-scale modelling for aneurysms.
Modules are being used and are growing in number, which is great, but he wants to concentrate more at the moment on the ‘re’ of the reusable goal, and make it easier to build and understand large models from reusable components. Some of the integrated services he’d like to have: search and retrieval, (semi-automated) visualization, semantically-meaningful metadata and annotations, and semi-automated composition.
All this work above converges on the importance of metadata. With the CellML Metadata Framework 1.0, not many used it. With version 2.0 they have developed a core specification with is very simple and then provide many additional satellite specifications. For example, there is a biological information satellite, where you use the biomodels qualifiers as relationships between your data and MIRIAM URNs. The main challenge is to find a database that is at the right level of abstraction (e.g. canonical forms of your concept of interest).
Neil Wipat – Newcastle University
Please note Neil Wipat is my PhD supervisor.
Speaking about data standards, tool interoperability, data integration and synthetic biology, a.k.a “Why we need standards”. They would like to promote interoperability and data exchange between their own tools (important!) as well as other tools. They’d also like to facilitate data integration to inform the design of biological systems both from a manual designer’s perspective and from the POV of what is necessary for computational tool use. They’d also like to enable the iterative exchange of data and experimental protocols in the synthetic biology life cycle.
A description of some of the tools developed in Neil’s group (and elsewhere) exemplify the differences in data structures present within synthetic biology. BacilloBricks was created to help get, filter and understand the information from the MIT registry of standard parts. They also created the Repository of Standard Virtual Biological Parts. This SVP repository was then extended with parts from Bacillus and was extended to make use of SBML as well as CellML. This project is called BacilloBricks Virtual. All of these tools use different formats.
It’s great having a database of SVPs, but you need a way of accessing and utilizing the database. Hallinan and Wipat have started a collaboration with Microsoft Research with the people who created a programming language for genetic engineering of living cells called the genetic engineering of cells (GEC) simulator. Some work a summer student did created a GEC compiler for SVPs from BacilloBricks virtual. Goksel has also created the MoSeC system where you can automatically go from a model to a graph to a EMBL file.
They also have BacillusRegNet, which is an information repository about transcription factors for Bacillus spp. It is also a source of orthogonal transcription factors for use in B. subtilis and Geobacillus. Again, it is very important to allow these tools to communicate efficiently.
The data warehouse they’re using is ONDEX. They feed information from the ONDEX data store to the biological parts database. ONDEX was created for systems biology to combine large experimental datasets. ONDEX views everything as a network, and is therefore a graph-based data warehouse. ONDEX has a “mini-ontology” to describe the nodes and edges within it, which makes querying the data (and understanding how the data is structured) much easier. However, it doesn’t include any information about the synthetic biology side of things. Ultimately, they’d like an integrated knowledgebase using ONDEX to provide information about biological virtual parts. Therefore they need a rich data model for synthetic biology data integration (perhaps including an RDF triplestore).
Interoperabiligy, Design and Automation: why we need standards.
Requirement 1. There needs to be interoperability and data exchange among these tools as well as among these tools and other external tools. Requirement 2. Standards for data integration aid the design of synthetic systems. The format must be both computationally amenable and useful for humans. Requirement 3. Automation of the design and characterization of synthetic systems, and this also requires standards.
The requirements of synthetic biology research labs such as Neil Wipat’s make it clear that standards are needed.
KEYNOTE: Herbert Sauro – University of Washington, US
Herbert Sauro described the developing community within synthetic biology, the work on standards that has already begun, and the Synthetic Biology Open Language (SBOL).
He asks us to remember that Synthetic Biology is not biology – it’s engineering! Beware of sending synthetic biology grant proposals to a biology panel! It is a workflow of design-build-test. He’s mainly interested in the bit between building and testing, where verification and debugging happens.
What’s so important about standards? It’s critical in engineering, where if increases productivity and lowers costs. In order to identify the requirement you must describe a need. There is one immediate need: store everything you need to reconstruct an experiment within a paper (for more on this see the Nature Biotech paper by Peccoud et al. 2011: Essential information for synthetic DNA sequences). Currently, it’s almost impossible to reconstruct a synthetic biology experiment from a paper.
There are many areas requiring standards to support the synthetic biology workflow: assembly, design, distributed repositories, laboratory parts management, and simulation/analysis. From a practical POV, the standards effort needs to allow researchers to electronically exchange designs with round tripping, and much more.
The standardization effort for synthetic biology began with a grant from Microsoft in 2008 and the first meeting was in Seattle. The first draft proposal was called PoBoL but was renamed to SBOL. It is a largely unfunded project. In this way, it is very similar to other standardization projects such as OBI.
DARPA mandated 2 weeks ago that all projects funded from Living Foundries must use SBOL.
SBOL is involved in the specification, design and build part of the synthetic biology life cycle (but not in the analysis stage). There are a lot of tools and information resources in the community where communication is desperately needed.
SBOL Semantic, SBOL Visual, and SBOL Script. SBOL Semantic is the one that’s going to be doing all of the exchange between people and tools. SBOL Visual is a controlled vocabulary and symbols for sequence features.
Have you been able to learn anything from SBML/SBGN, as you have a foot in both worlds? SBGN doesn’t address any of the genetic side, and is pretty complicated. You ideally want a very minimalistic design. SBOL semantic is written in UML and is relatively small, though has taken three years to get to this point. But you need host context above and beyond what’s modelled in SBOL Semantic. Without it, you cannot recreate the experiment.
Feature types such as operator sites, promoter sites, terminators, restriction sites etc can go into the sequence ontology (SO). The SO people are quite happy to add these things into their ontology.
SBOLr is a web front end for a knowledgebase of standard biological parts that they used for testing (not publicly accessible yet). TinkerCell is a drag and drop CAD tool for design and simulation. There is a lot of semantic information underneath to determine what is/isn’t possible, though there is no formal ontology. However, you can semantically-annotate all parts within TinkerCell, allowing the plugins to interpret a given design. A TinkerCell model can be composed of sub-models. Makes it easy to swap in new bits of models to see what happens.
WikiDust is a TinkerCell plugin written in Python which searches SBPkb for design components, and ultimately uploads them to a wiki. LibSBOLj is a library for developers to help them connect software to SBOL.
The physical and host context must be modelled to make all of this useful. By using semantic web standards, SBOL becomes extensible.
Currently there is no formal ontology for synthetic biology but one will need to be developed.
Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!
HL53: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project (ISMB 2009)
Standards are hugely dependent on their respective communities for reqs gathering, develppment, testing, uptake by stakeholders. In modeling the biosciences there are are a few generic features such as description of source material and experimental design components. Then there are biologically-delineated and technologically-delineated views of the world. These views are still common across many different areas of the life sciences. Much of it can fall under an ISA (Investigation-Study-Assay) structure.
You should then use three types of standards: syntax (images of FuGE, ISA-TAB etc), semantics, and scope. MIBBI is all about scope. How well are things working? Well, there is still separation, but things are getting better. There aren’t many carrots, though there are some sticks for using these standards. Why do we care about standards? Data exchange, comprehensibility, and scope for reuse. Many funders (esp public funders) are now requiring data sharing or ability for data storage and exchange.
“Metaprojects”: FuGE, OBI, ISA-TAB – draw together many different domains and present in structure/semantics useful across all. Many of the “MI” (Minimum information guidelines) are developed independently, and are sometimes defunct. It’s also hard to track what’s going on in these projects, can be redundant, difficult to obtain an overview of the full range of checklists. When the MI projects overlap, arbitrary decisions on wording and substructuring make integration difficult. This makes it hard to take parts of different guidelines – not very modular. Enter MIBBI. Two distinct goals: portal (registry of guidelines) and foundry (integration and modularization).
There’s lots of enthusiasm for the project (drafters, users, funders, journals). MIBBI raises awareness of various checklists and promotes gradual integration of checklists. Nature Biotechnology 26, 889 – 896 (2008) doi:10.1038/nbt0808-889 for the paper. He’s performed clustering and analysis of the different guidelines: displayed MIs in cytoscape and in fake phylogenetic tree. By the end of the year they’ll have a shopping-basked based tool, MICheckout, to get all concepts together and then you get your own specialized checklist as output. You can make use of isacreator and its configuration to set mandatory parameters etc.
The objections to fuller reporting. Why should I share? funders and publishers are starting to require a bare minimum of metadata – and researchers will just do the bare minimum then, however. Some people think that this is just a ‘make work’ scheme for bioinformaticians, or that bioinformaticians are parasitic. Some people don’t trust what others have done, but then that’s what the reporting guidelines are for in the first place – so you can figure out if you should trust it. Problems of quality are justified to an extent, but what of people lacking resource for large-scale work, or people who want to refer to proteomics data but don’t do proteomics? How should they follow theese guidelines? Perception is that there is no money for this, and no mature free tools, and worries about vendor support. Vendors will support what researchers say they need.
Credit: data sharing is more or less a given now, and need central registries of data sets that can record reuse (also openids, DOIs for data). Side benefits and challenges include clearing up problems with paper authorship wrt reporting who’s done which bit. Would also enable other kinds of credit, and may have to be self-policing. Finally, the problem of micro data sets and legacy data. Example of the former is EMBL entries – when searching against EMBL, you’re using the data in some way, even if you don’t pull it out for later analysis.
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!
[Update: Duncan's written a call for comments on the OBO Foundry criteria on his blog. Also posting on this are Melanie and Frank. Take a look! Update 2: I should have called the 10 criteria "principles" rather than "rules". My apologies. I think the title may be a little bit of a misnomer for the post. I'm not sure you need to choose between principles and checklists. It's nice to have the "short and sweet" and the detailed.]
The OBO Foundry Workshop (OBO Foundry paper) is coming up this weekend, and Duncan Hull and I were talking about the 10 criteria the Foundry has for member ontologies. We had been wondering what sort of questions we would ask the OBO Foundry people if we wanted to see the 10 criteria “upgraded” to a minimal checklist for OBO Foundry ontologies in the style of MIBBI. As a result of that, here are my thoughts on each criterion. Perhaps some of these have been answered in mailing lists or elsewhere, but they’re not visible on the OBO Foundry site. Hopefully this post would be useful as a starting point for a discussion on more complete definitions and explanations for the minimal requirements of an OBO Foundry ontology.
Each criterion is reproduced in bold, with my opinions after in italicised text. For any further text present in the criteria list, please see the list page itself.
- The ontology must be open and available to be used by all without any constraint other than (a) its origin must be acknowledged and (b) it is not to be altered and subsequently redistributed under the original name or with the same identifiers.
This is a license without a name or a strong structure. Is it a first attempt at an OBO-specific license? If so, it is too generic to be of much use. Alternatively, is it a requirements list for choosing an existing license? Or, as another option, are they suggesting that people choose their own licenses along these lines? I believe strongly that already-extant licenses should be used in biological research wherever possible. You can see a summary of a FriendFeed discussion and an email discussion with Science Commons in my blog post on Choosing a License for Your Ontology for my opinion on the subject. Therefore I would suggest option 2, with the Foundry choosing an appropriate license (or shortlist of compatible licenses) as soon as they could.
- The ontology is in, or can be expressed in, a common shared syntax. This may be either the OBO syntax, extensions of this syntax, or OWL.
Firstly, I would like clarification of what “extensions of the [OBO] syntax” means. Secondly, just saying “OWL” as a syntax is too vague; there’s OWL-Full, OWL-DL, and OWL-Lite, to name a few. Are all acceptable, or is the most commonly-used (OWL-DL) the one they want people to use?
- The ontologies possesses a unique identifier space within the OBO Foundry.
Aside from the (nitpicky) statement that it should be either “The ontologies possess” or “Each ontology possesses”, this is one of the most useful criteria. However, a little more detail would be useful here. What should come after the prefix? An underscore or some other dividing character? The rest of the identifier without a dividing character? Should the OBO Foundry assign a prefix to avoid confusion? By the way, a paper has just been published about the *naming* conventions for the OBO Foundry which is interesting. This isn’t the same thing as this criterion, which is about unique identifiers, but it’s still worth a read.
- The ontology provider has procedures for identifying distinct successive versions.
A little vague, but that probably cannot be helped, as you probably don’t want to legislate the type of versioning that takes place with each ontology. Links out to GO’s procedures or OBI’s procedures might provide some ideas to people who don’t know what versioning to use.
- The ontology has a clearly specified and clearly delineated content.
The “domain” of the ontology, used in the further description of this criterion, is a vague term. Yes, we all want orthogonality, but that is difficult to achieve in practice and a clearer description of how people can achieve it might be useful. How are two terms expressing the same concept in the different ontologies resolved? Via the mailing list? Is there an established procedure? It’s easy to say that no two terms should be covering the same concept, but harder to check. There’s been some recent papers in finding similar concepts within a single ontology (e.g. 10.1093/bioinformatics/btp195) might be applicable to multiple ontologies.
- The ontologies include textual definitions for all terms.
Good point. It would also be nice to say formal logic statements for classes would be useful (but not required), as it might help ensure the internal consistency of Foundry ontologies.
- The ontology uses relations which are unambiguously defined following the pattern of definitions laid down in the OBO Relation Ontology.
This says you have to define your relations “following the pattern” from the RO. Does this mean all your relations must be children of relations in RO, or just that you follow their style? Probably the latter, but this is unclear at the moment.
- The ontology is well documented.
Definitely! But how? Where? In the ontology file? On a website? Does the OBO website provide the ability to have lots of documentation, or should it just be links out?
- The ontology has a plurality of independent users.
I’m a bit of a failure here, as I don’t know what this means. I can think of at least 2-3 different ways of interpreting this. What are users in this context? What makes them independent? How can you tell what your users are?
- The ontology will be developed collaboratively with other OBO Foundry members.
Great idea. But what if you can’t find anyone who wants to help? Does that mean you can’t develop? Again, perhaps this just means regular reviews of the developing ontology by other OBO members, but could be made clearer.
Most of these opinions don’t try to provide an answer, but instead just raise some questions that the attendees at this week’s workshop might like to have in their minds. If the OBO Foundry, which exists to “align ontology development efforts” doesn’t provide clear guidance, there is a risk that each member ontology would come up with their own answers, thus negating some of the benefits provided by their membership (quote from the Nature Biotech paper).
Have a great workshop – wish I had the time to attend this year!
Last week I attended the first RSBI (Reporting Structure for Biological Investigations) Workshop, carrying with me a multitude of hats. RSBI is a working group committed to the progression of standardization in multi-omics investigations. The purpose of the workshop was to examine and offer suggestions on the initial draft of ISA-TAB (more on that in a moment).
My first hat was a FuGE-user's hat, as the triumvirate of standards upon which RSBI is built is the Functional Genomics Experiment Model (FuGE), the Minimum Information for Biological and Biomedical Investigations (MIBBI) Project, and the Ontology for Biomedical Investigations (OBI). I was asked to give a current status update on FuGE itself, and on any communities that have already built extensions to FuGE. Andy Jones from Liverpool provided me with all of the hot-off-the-press information (my FuGE slides) – thanks Andy!
My second hat was a SyMBA-developer's hat. SyMBA uses FuGE to build a database and web front-end for storing data and experimental metadata. We use it in-house to store all of our large, high-throughput 'omics data. The use of FuGE in the system made it relevant for the workshop (my SyMBA slides, more SyMBA slides).
My final hat was a CISBAN-employee's hat. I work in the Wipat group there, and CISBAN is one of the "leading groups" involved in RSBI. As such, I was CISBAN's representative to the workshop.
The reason for the workshop, as stated earlier, was the evaluation of ISA-TAB, a proposed tabular format whose purpose is to provide a standard format for data and metadata submission into the formative BioMAP database at the EBI. ISA-TAB would have two uses:
- Humans: As a tabular format, it is quite easy for people to view and manipulate such templates within spreadsheet software such as Excel.
- Computers: As an interim solution only, ISA-TAB would be used as a computational exchange format until such time as each of the FuGE-based community extensions are complete for Metabolomics, Proteomics, and Transcriptomics. At this time, ISA-TAB would remain available for human use, but there would be a conversion step into "FuGE-ML".
The scope for ISA-TAB is large, and this was reflected in the attendees of the meeting. Representatives from ArrayExpress, Pride, and BioMAP were of course present, but also attending were people from the Metabolomics community, the MIACA project, toxico- and environmental genomics, and the FDA's NCTR.
A full write-up of the results of the workshop will soon be available online at the project's RSBI Google Group, so I'll leave it there. It was an exciting meeting, with fantastic food and even better discussions on getting public databases organized quickly for simple, straightforward multi-omics investigation data and metadata submission.
You can contact the RSBI via email@example.com.