Design and reality in SynBio Host context and provenance: Neil Wipat, Bryan Bartley
In synthetic biology, you are performing engineering in biology, and it is a combination of wet lab and in silico work. Until now, SBOL has been primarily concerned with the design stage of the process, but SBOL should be able to travel around the entire engineering life cycle, capturing data as it goes. Every data set that is generated throughout the life cycle should be able to be captured within the SBOL structure.
Take as an example the build of a system that has been done as described in the original design, e.g. with the original strain of E.coli. But even if it’s the same design, you’ll get different experiments in different labs, even with the best of intentions – and therefore different experimental data. An SBOL design can be built by many labs and in many ways, in different host contexts. At the moment, SBOL doesn’t capture the difference among these host contexts.
Host context requires information about all of the details of the design – who/what/when/where/why/how, which is why provenance and host context are relevant together. As Bryan mentioned in his talk earlier, characterising a cell during “steady state” can often be subjective and difficult. Measurements of the output of a genetic circuit strongly depends on how well adapted your cells are to the environmental conditions. Further, human error must be taken into account, and it can be necessary to backtrack and check your assumptions. Some components that you’re using may have incomplete QC data.
There was a discussion of the difference between context and provenance: it was decided that the context was like the annotation on the nodes of a graph, and the provenance was how the edges between them were being walked. That is, provenance is how you got to a particular node, and context is about how you would re-create the conditions at that node.
The minimal information for the host context would be placing the host as a type of ModuleDefinition. The Host-specific annotation would be
- StrainId: reference
- VendorId: reference
- TaxonId: reference
- Genome: reference
- Genotype: Gnomic string
Gnomic is a machine readable way of representing genotypes (http://github.com/biosustain/gnomic). It was then suggested that we should directly RDFize all of the information contained within Gnomic rather than using a new format that would have to be learnt and parsed. Alternately, use proper ontological terms and reference them with URIs.
PROV-O, the provenance ontology defines 3 core classes: Entity, Activity and Agent. An agent runs an activity to generate one entity from another. Is there an ontology for the activity? Could use something like OBI, but realize that each activity instance is tied to a particular timestamp, and therefore an activity is only done once.
There is a contrasting opinion that the important thing is that an activity can be reused, and therefore there should be a class/definition for each activity which gets instantiated at particular times.
The proposal suggests that all sbol2:identified types be potentially annotated with provenance information. As such, the following additional classes should be added: prov:Derivation, prov:Activity, prov:Agent, prov:Association, prov:Usage. (Though I definitely saw a prov:role in one of the examples.)