COMBINE 2016 Day 3: From Grassroots community standards to ISO Standards

COMBINE 2016

Martin Golebiewski

You need standards at every stage of the systems biology life cycle. These standards need to work together, be interoperable. From modelling to simulation, to experimental data and back again – there are standards for each step. There are a large number of community standards for the life sciences, in many different subdomains (he references biosharing.org here.)

This presence of many standards for different domains creates quite a lot of overlap, which can cause issues. Even within a single domain, it is normal to see different standards for different purposes, e.g. for the model description and the simulation of the model, and the results of the simulation etc. The way in which the synbio and sysbio standards interrelate is complex.

In COMBINE, there are the official standards, the associated standardization efforts, and the related standardization efforts. The tasks in COMBINE for the board and the whole community are to: organize concerted meetings (COMBINE and HARMONY), training events for the application of the standards, coordinate standards development, develop common procedures and tools (such as the COMBINE archive) and provide a recognized voice.

A similar approach, but with a broader focus, is the European CHARME network, which has been created to harmonize standardization strategies to increase efficiency and competitiveness of European life-science research. This funds networking action for five years from March 2016. See http://www.cost-charme.eu.  There are 5 working groups within CHARME. WG2 involves innovation transfer, to have more involvement with industry.

NormSys is intended to bring together standards developers, research initiatives, publishers, industry, journals, funders, and standardization bodies. How should standards be published and distributed? How do we convince communities to apply standards, and how do we certify the implementation of standards? There is a nice matrix of the standards they are dealing with at http://normsys.h-its.org/biological/application/matrix.

NormSys is meant to be a bridge builder between research communities, industry and standardization bodies. There are actually a very large number of standardization bodies worldwide. ISO is the world’s largest developer of voluntary international standards. Anything that comes from ISO has to come out of a consensus of 164 national standards bodies, therefore finding such a consensus within ISO can be tricky. Most of the experts involved in the ISO standards are doing it voluntarily, or through dedicated non-ISO projects which fund it.

Within ISO, there are technical committees. These TCs might have further subgroups or working groups. There can also be national groups which have mirror committees, and then delegates from these committees are sent to the international committee meetings. The timeline for the full 6 stages of standard development with ISO can be around 36 months. However, this doesn’t include any of the preliminary work that needs to happen before the official stages begin.

There are three main ISO document types: IS (International standard), TS (Technical specification) and TR (Technical Report). Most relevant for us here is the ISO TC 276 for Biotechnology. Its scope is the standardization in the field of biotechnology processes that include the following: terms and definitions, biobanks and bioresources, analytical methods, bioprocessing, data processing including annotation, analysis, validation, comparability and integration, and finally meterology.

There are 5 WG for this TC: yerminology, biobanks, analytical methods, bioprocessing, and finally data processing and integration (WG5). ISO/IEC JTC 1/SC 29 involves the coding of audio, picture, multimedia and hypermedia information (this includes genome compression). ISO TC 276 WG5 was established in April 2015, and there are 60 experts from 13 countries. He says the next meeting is in Dublin, and there is still scope for people to join and help in this effort.

They’ve been working on standards for data collection, structuring and handling during deposition, preservation and distribution of microbes, recommended MI data set for data publication. One of the most important tasks of WG5 is the standardization of genome compression. This was identified as a need from the MPEG consortium.

The biggest deal for COMBINE is the focus on developing an ISO standard for applying and connecting community modelling standards. “Downstream data processing and integration workflows – minimal requirements for downstream data processing and integration workflows for interfacing and linking heterogeneous data, models and corresponding metadata.”

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

COMBINE 2016 Day 2: SBOL Breakout – Host Context

Design and reality in SynBio Host context and provenance: Neil Wipat, Bryan Bartley

In synthetic biology, you are performing engineering in biology, and it is a combination of wet lab and in silico work. Until now, SBOL has been primarily concerned with the design stage of the process, but SBOL should be able to travel around the entire engineering life cycle, capturing data as it goes. Every data set that is generated throughout the life cycle should be able to be captured within the SBOL structure.

Take as an example the build of a system that has been done as described in the original design, e.g. with the original strain of E.coli. But even if it’s the same design, you’ll get different experiments in different labs, even with the best of intentions – and therefore different experimental data. An SBOL design can be built by many labs and in many ways, in different host contexts. At the moment, SBOL doesn’t capture the difference among these host contexts.

Host context requires information about all of the details of the design – who/what/when/where/why/how, which is why provenance and host context are relevant together. As Bryan mentioned in his talk earlier, characterising a cell during “steady state” can often be subjective and difficult. Measurements of the output of a genetic circuit strongly depends on how well adapted your cells are to the environmental conditions. Further, human error must be taken into account, and it can be necessary to backtrack and check your assumptions. Some components that you’re using may have incomplete QC data.

There was a discussion of the difference between context and provenance: it was decided that the context was like the annotation on the nodes of a graph, and the provenance was how the edges between them were being walked. That is, provenance is how you got to a particular node, and context is about how you would re-create the conditions at that node.

The minimal information for the host context would be placing the host as a type of ModuleDefinition. The Host-specific annotation would be

  • StrainId: reference
  • VendorId: reference
  • TaxonId: reference
  • Genome: reference
  • Genotype: Gnomic string

Gnomic is a machine readable way of representing genotypes (http://github.com/biosustain/gnomic). It was then suggested that we should directly RDFize all of the information contained within Gnomic rather than using a new format that would have to be learnt and parsed. Alternately, use proper ontological terms and reference them with URIs.

PROV-O, the provenance ontology defines 3 core classes: Entity, Activity and Agent. An agent runs an activity to generate one entity from another. Is there an ontology for the activity? Could use something like OBI, but realize that each activity instance is tied to a particular timestamp, and therefore an activity is only done once.

There is a contrasting opinion that the important thing is that an activity can be reused, and therefore there should be a class/definition for each activity which gets instantiated at particular times.

The proposal suggests that all sbol2:identified types be potentially annotated with provenance information. As such, the following additional classes should be added: prov:Derivation, prov:Activity, prov:Agent, prov:Association, prov:Usage. (Though I definitely saw a prov:role in one of the examples.)

COMBINE 2016 Day 2: Version and Variant Control for Synthetic Biology

COMBINE 2016

Bryan Bartley

Synthetic biology, as with many projects, gets complex quickly and could be improved through the use of versioning systems. SBOL currently supports versioning of designs, but not constructs. Further, the versioning for synthetic biology needs to track provenance and contextual information. But how do we approach versioning in biological systems? In biology, branching tends to be how its done (constructing in parallel). Feature branches are much more the rule in biology than successive commits.

Variant Control is based on phylogenetic analysis of DNA sequences. (Scoring matrix -> multiple sequence alignment -> pairwise distance matrix -> phylogenetic tree). In Variant Control, the composition of genetic circuits are encoded as sequences. Then you can do a MSA on these sequences of circuits, performing a parts-based phylogenetic analysis. From this, you get a tree of variants.

Next, add semantic annotations to score the alignments. Going up the hierarchy to reach a common SO term creates a penalty score. Variant control clusters similar designs by both sequence and functional similarity (e.g. repressors together).

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

COMBINE 2016 Day 2: How to Remember and Revisit Many Genetic Design Variants Automatically

COMBINE 2016

Nicholas Roehner

In other words, a version control system for variations on genetic design.

A 4-gene cluster can be encoded (even with just a library of 16 parts) over 684,000 variants. Clearly, a GenBank files are not appropriate here. Their solution is Knox, where the genetic design space is only about 200k, rather than gigabytes. This “genetic design space” is a format where each edge is labelled with a *set* of parts, from which you can create paths. Design spaces can be concatenated via graph operations using Knox, merged in a variety of different ways.

If you build up a series of these operations, you can then create a Very Large Things. A single design would encode all of the various paths. These design spaces can be stored, and versioned, like is done with git. Combining design spaces in Knox also merges version histories. You can also branch a design space, giving you two different versions to work with. Reversion is also supported.

There is a RESTful API to allow connection between the web application and the graph database. Finch and Eugene are two products which use Knox. In Finch, you can encode variable length designs as it uses regular expressions. This makes it more machine-comparable and mergeable. This can make it harder for humans though, which is where Eugene is beneficial, as it is a more human readable and writeable language, though it is less expressive than Finch and has a fixed design length.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

COMBINE 2016 Day 2: A new thermodynamics-based approach to aid the design of natural-product biosynthetic pathways

COMBINE 2016

Hiroyuki Kuwahara

The design of biosynthetic systems involves a large search space, therefore it is essential to have a computational tool to predict productive pathways to aid in that design. There are a number of pre-existing approaches including flux-based analysis based (host often limited to e.coli), reaction count-based, and thermodynamic favorability based (but the effects of competing reactions cannot be captured, and ranking doesn’t depend on the host’s metabolic system). They wanted to be able, given a starting material, a target product, and a host organism, to find promising biosynthetic routes by allowing the introduction of foreign metabolic enzymes into the host.

They have a host-dependent weighting scheme in which the ranking of pathways based on this can be widely different from the thermodynamic favorability approach. They first compute the weight for each edge in the function, such that they can have different weights even if the energy value is identical. In this way, you can include in the model additional further steps that may lower otherwise high-scoring reactions if their routes lead to undesirable consequences.

They have also developed SBOLme.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

COMBINE 2016 Day 2: Data Integration and Mining for Synthetic Biology Design

COMBINE 2016

Goksel Misirli

How can we use ontologies to facilitate synthetic biology? Engineering biological systems is challenging, and integrating the data about them is even more so. Information may be spread out in different databases, different formats, and different semantics. This information should be integrated to inform and constrain biological design. Therefore onto Gruber, and his “specification of a conceptualization” definition of ontologies. Ontologies are useful for capturing different relationships between biological parts and to facilitate data mining. They are already used widely in bioinformatics, including GO, SO, SBO, SBOL etc.

They have created the Synthetic Biology Ontology (SyBiOnt), available at http://w3id.org/synbio/ont. The SyBiOnt knowledgebase includes information about sequences, annotations, metabolic pathways, gene regulatory networks, protein-protein interactions, and gene expression. Once the KB was built, you examine it via a set of competency questions. For example, which parts can be used as inducible promoters? When an appropriate query was run, 51 promoters were classified as inducible within the KB.

They also performed an automatic identification of biological parts, and classified according to activator sites, repressor sites, inducible promoters, repressible promotors, SigA promoters, SigB promoters, constitutive promoters, repressor encoding CDSs, activator encoding CDSs, response regulator encoding CDs and more.

There were many other competency questions that could be, and were, asked.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

COMBINE 2016 Day 2: Creating a SynBio Software Commons

COMBINE 2016

 

Curtis Madsen & Nicholas Roehner

Nona was created to address an issue with academic software concerning the software development cycle: built, develop, publish and then get lost as people move around in academia. However, academics can work with Nona to get feedback and develop a community which can help with the maintenance process of your software.

How do you participate? http://nonasoftware.org and browse currently-available software. Software is broken down into specification, design, data management and integration types. You can transfer the software to Nona and have them host it, or you can host the software and they will provide a link to both the homepage and the github or similar repository.

When you’re ready to submit software to Nona, you start by choosing a license (to work with Nona, you must have an Open Source license). Then you provide a link to the github repo (or simply give a tarball to Nona, who will put it on github). Nona will provide promotional materials, FAQs, forums etc for your software.

In February 2017 there will be a 2 1/2 day hackathon (Nona Works) where they bring together biologists and computer scientists.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!