Archive

Archive for the ‘Standards’ Category

What should you think about when you think about standards?

July 12, 2011 Leave a comment

The creation of a new standard is very exciting (yes, really). You can easily get caught up in the fun of the moment, and just start creating requirements and minimal checklists and formats and ontologies…. But what should you be thinking about when you start down this road? Today, the second and final day of the BBSRC Synthetic Biology Standards Workshop, was about discussing what parts of a synthetic biology standard are unique to that standard, and what can be drawn from other sources. And, ultimately, it was about reminding ourselves not to reinvent the wheel and not to require more information than the community was willing to provide.

Matthew Pocock had a great introduction into this topic when he summarized what he thinks about when he thinks about standards.  Make sure you don’t miss my notes on his presentation further down this post.

(If you’re interested, have a look at yesterday’s blog post on the first day of this workshop: The more things change, the more they stay the same.)

Half a day was a perfect amount of time to get the ball rolling, but we could have talked all day and into the next. Other workshops are planned for the coming months, and it will be very interesting to see what happens as things progress, both in person and via remote discussions.

Once again, for the time constrained among us, here are my favorite sentences from the presentations and discussions of the day:

  1. Dick Kitney: Synthetic biology is already important in industry, and if you want to work with major industrial companies, you need to get acceptance for your standards, making the existing standard (DICOM) very relevant to what we do here.
  2. Matthew Pocock: Divide your nascent standard into a continuum of uniqueness, from the components of your standard which are completely unique to your field, through to those which are important but have overlap with a few other related fields , and finally to the components which are integral to the standard but which are also almost completely generic.
  3. Discussion 1: Modelling for the purposes of design is very different from modelling for the purposes of analysis and explanation of existing biology.
  4. Discussion 2: I learnt that, just as in every other field I’ve been involved in, there are terms in synthetic biology so overloaded with meaning (for example, “part”) it is better to use a new word when you want to add those concepts to an ontology or controlled vocabulary.

Dick Kitney – Imperial College London: “Systematic Design and Standards in Synthetic Biology”

Dick Kitney discussed how SynBIS, a synthetic biology web-based information system with an integrated BioCAD and modelling suite, was developed and how it is currently used. There are three parts to the CAD in SynBIS: DNA assembly, characterization, and chassis (data for SynBIS). They are using automation in the lab as much as possible. With BioCAD, you can use a parallel strategy for both computer modelling and the synthetic biology itself.

With SynBIS, you can get inputs from other systems as well as part descriptions, models and model data from internal sources. SynBIS has 4 layers: an Interface/HTML layer, a communication layer, an application layer and and a database layer.

Information can be structured into four types: the biological “continuum” (or the squishy stuff), modalities (experimental types, standards relating to such), (sorry – missed this one), and ontologies. SynBIS incorporates the DICOM standard for their biological information. DICOM can be used and modified to store/send parts and associated metadata, related images, and related/collected data. They are interested in DICOM because of the industrialization of synthetic biology. Most major industries and companies already use the DICOM standard. If you want to work with major industrial companies, you need to get acceptance for your standards, making DICOM very important. The large number of users of DICOM are a result of large amounts of effort going into the creation of this modular, modality-friendly standard.

Images are getting more and more important for synthetic biology. If you rely on GFP fluorescence, for example, then you need high levels of accuracy in order to replicate results. DICOM helps you do this. It isn’t just a file format, and includes transfer protocols etc. Each image in DICOM has its own metadata.

What are the downsides of DICOM? DICOM is very complex, and most academics might not have the resources to make use of it (it has a huge 3,000-page document). In actuality, however, it is a lot easier to use then you might think. There are libraries, viewers and standard packages that hide most of the complexity. What is the most popular use of DICOM right now? MRCT, ultrasound, light microscopy, lab data, and many other modalities. In a hospital, most machines’ outputs are compliant with DICOM.

As SBOL develops and expands, they plan to incorporate it into SynBIS.

Issues relating to the standard – Run by Matthew Pocock

The rest of the workshop was structured discussion on the practical aspects of building this standard. Matthew Pocock corralled us all and made sure we remained useful, and also provided the discussion points.

To start, Matt provided some background. What does he ponder when he thinks about standards? Adoption of the standard for one, and who your adopters might be. Such people would be both/either providers of data and/or consumers of data. Also, both machines and humans will interact with the standard. The standard should be easy-to-implement, with a low buy-in.

You need to think about copyright and licensing issues: who owns it, maintains it. Are people allowed to change it for their own or public use? Your standard needs to have a clearly-defined scope: you don’t want it to force you to think about what you’re not interested in. To do this, you should have a list of competency questions.

You want the standard to be orthogonal with other standards and compose into it any other related standards you wish to use but which don’t belong in your new standard. You should have a minimal level of compliance in order for your data to be accepted.

Finally, above all, users of your standard would like it to be lightweight and agile.

What are the technical areas that standards often cover? You should have domain-specific models of what you’re interested in (terminologies, ontologies, UML): essentially, what your data looks like. You also need to have a method of data persistence and protocols, e.g. how you write it down (format, XML, etc.). You also need to think about transport of the data, or how you move it about (SOAP, REST, etc.). Access has to be thought about as well, or how you query for some of the data (SQL, DAS, custom API, etc.).

Within synthetic biology, there is a continuum from incredibly generic, useful standards through to things that are absolutely unique to our (synthetic biology) use case, and then in between is stuff that’s really important, but which might be shared with some other areas such as systems biology. For example, LIMS, and generic metadata are completely generic and can be taken care of by things like Dublin Core. DNA sequence and features are important to synthetic biology, but are not unique to it. Synthetic biology’s peculiar constraints include things like a chassis. You could say that host is synonymous with chassis, but in fact they are completely different roles. Chassis is a term used to describe something very specific in synthetic biology.

Some fields relevant to synthetic biology: microscopy, all the ‘omics, genetic and metabolic engineering, bioinformatics.

Discussion 1

Consider the unique ↔ generic continuum: where do activities in the synthetic biology lifecycle lie on the diagram? What standards already exist for these? What standards are missing?

The notes that follow are a merge of the results from the two groups, but it may be an imperfect merge and as a consequence, there may be some overlap.

UNIQUE (to synthetic biology)

  • design (the composition of behaviour (rather than of DNA, for example)).
    • modelling a novel design is different than modelling for systems biology, which seeks to discover information about existing pathways and interactions
    • quantification for design
  • Desired behaviour: higher-level design, intention. I am of the opinion that other fields also have an intention when performing an experiment, which may or may not be realized during the course of an experiment. I may be wrong in this, however. And I don’t mean an expected outcome – that is something different again.
  • Device (reusable) / parts / components
  • Multi-component, multiple-stage assembly
    • biobricks
    • assembly and machine-automated characterization, experiments and protocols (some of this might be covered in more generic standards such as OBI)
  • Scale and scaling of design
  • engineering approaches
  • characterization
  • computational accessibility
  • positional information
  • metabolic load (burden)
  • evolutionary stability

IMPORTANT

  • modelling (from systems biology): some aspects of both types of modelling are common.
    • you use modelling tools in different ways when you are starting from a synbio viewpoint
    • SBML, CellML, BioPAX
  • module/motifs/components – reusable models
  • Biological interfaces (rips, pops)
  • parts catalogues
  • interactions between parts (and hosts)
  • sequence information
  • robustness to various conditions
  • scaling of production

GENERIC

  • Experimental (Data, Protocols)
    • OBI + FuGE
  • sequence and feature metadata
    • SO, GO
  • LIMS
  • success/performance metrics (comparison with specs)
  • manufacturing/production cost

Discussion 2

From the components of a synthetic biology standard identified in discusison 1, choose two and answer:

  • what data must be captured by the standard?
  • What existing standards should it leverage?
  • Where do the boundaries lie?

Parts and Devices

What data must be captured by the standard? Part/device definition/nomenclature, sequence data, type (enumerated list), relationships between parts (enumerated list / ontology), part aggregation (ordering and composition of nested parts), incompatibilities/contraindications (including range of hosts where the chassis is viable), part buffers and interfaces/Input/Output (as a sub-type of part), provenance, curation level. Any improvements (include what changes were made, and why they were made (e.g. mcherry with the linkers removed)); versioning information (version number, release notes, feature list, and known issues); equivalent parts which are customized for other chassis (codon optimization and usage, chassis-agnostic part); Provenance information including authorship, originating lab, and the date/age of the part (much covered by the SBOL-seq standard); the derivation of the part from other parts or other biological sequence databases, and a human- and machine-readable description of the derivation.

What existing standards? SBOL, DICOM, SO, EMBL, MIBBI

Boundaries: Device efficiency (only works in the biological contexts it’s been described in), chassis and its environment, related parts could be organized into part ‘families’ (perhaps use GO for some of this), also might be able to attach other quantitative information that could be common across some parts.

Characterization

We need to state the type of the device, and we would need a new specification for each type of device, e.g. a promoter is not a GFP. We need to know some measurement information such as statistics, experimental conditions required to record, lab, protocols. Another important value is whether or not you’re using a reference part or device. The context information would include the chassis, in vitro/in vivo, conditions, half-life, and interactions with other devices/hosts.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

The more things change, the more they stay the same

July 11, 2011 1 comment

…also known as Day 1 of the BBSRC Synthetic Biology Standards Workshop at Newcastle University, and musings arising from the day’s experiences.

In my relatively short career (approximately 12 years – wait, how long?) in bioinformatics, I have been involved to a greater or lesser degree in a number of standards efforts. It started in 1999 at the EBI, where I worked on the production of the protein sequence database UniProt. Now, I’m working with systems biology data and beginning to look into synthetic biology. I’ve been involved in the development (or maintenance) of a standard syntax for protein sequence data; standardized biological investigation semantics and syntax; standardized content for genomics and metagenomics information; and standardized systems biology modelling and simulation semantics.

(Bear with me – the reason for this wander through memory lane becomes apparent soon.)

How many standards have you worked on? How can there be multiple standards, and why do we insist on creating new ones? Doesn’t the definition of a standard mean that we would only need one? Not exactly. Take the field of systems biology as an example. Some people are interested in describing a mathematical model, but have no need for storing either the details of how to simulate that model or the results of multiple simulation runs. These are logically separate activities, yet they fall within a single community (systems biology) and are broadly connected. A model is used in a simulation, which then produces results. So, when building a standard, you end up with the same separation: have one standard for the modelling, another for describing a simulation, and a third for structuring the results of a simulation. All that information does not need to be stored in a single location all the time. The separation becomes even more clear when you move across fields.

But this isn’t completely clear cut. Some types of information overlap within standards of a single domain and even among domains, and this is where it gets interesting. Not only do you need a single community talking to each other about standard ways of doing things, but you also need cross-community participation. Such efforts result in even more high-level standards which many different communities can utilize. This is where work such as OBI and FuGE sit: with such standards, you can describe virtually any experiment. The interconnectedness of standards is a whole job (or jobs) in itself – just look at the BioSharing and MIBBI projects. And sometimes standards that seem (at least mostly) orthogonal do share a common ground. Just today, Oliver Ruebenacker posted some thoughts on the biopax-discuss mailing list where he suggests that at least some of BioPAX and SBML share a common ground and might be usefully “COMBINE“d more formally (yes, I’d like to go to COMBINE; no, I don’t think I’ll be able to this year!). (Scroll down that thread for a response by Nicolas Le Novère as to why that isn’t necessarily correct.) So, orthogonality, or the extent to which two or more standards overlap, is sometimes a hard thing to determine.

So, what have I learnt? As always, we must be practical. We should try to develop an elegant solution, but it really, really should be one which is easy to use and intuitive to understand. It’s hard to get to that point, especially as I think that point is (and should be) a moving target. From my perspective, group standards begin with islands of initial research in a field, which then gradually develop into a nascent community. As a field evolves, ‘just-enough’ strategies for storing and structuring data become ‘nowhere-near-enough’. Communication with your peers becomes more and more important, and it becomes imperative that standards are developed.

This may sound obvious, but the practicalities of creating a community standard means such work requires a large amount of effort and continued goodwill. Even with the best of intentions, with every participant working towards the same goal, it can take months – or years – of meetings, document revisions and conference calls to hash out a working standard. This isn’t necessarily a bad thing, though. All voices do need to be heard, and you cannot have a viable standard without input from the community you are creating that standard for. You can have the best structure or semantics in the world, but if it’s been developed without the input of others, you’ll find people strangely reluctant to use it.

Every time I take part in a new standard, I see others like me who have themselves been involved in the creation of standards. It’s refreshing and encouraging. Hopefully the time it takes to create standards will drop as the science community as a whole gets more used to the idea. When I started, the only real standards in biological data (at least that I had heard of) were the structures defined by SWISS-PROT and EMBL/GenBank/DDBJ. By the time I left the EBI in 2006, I could have given you a list a foot long (GO, PSI, and many others), and that list continues to grow. Community engagement and cross-community discussions continue to be popular.

In this context, I can now add synthetic biology standards to my list of standards I’ve been involved in. And, as much as I’ve seen new communities and new standards, I’ve also seen a large overlap in the standardization efforts and an even greater willingness for lots of different researchers to work together, even taking into account the sometimes violent disagreements I’ve witnessed! The more things change, the more they stay the same…

At this stage, it is just a limited involvement, but the BBSRC Synthetic Biology Standards Workshop I’m involved in today and tomorrow is a good place to start with synthetic biology. I describe most of today’s talks in this post, and will continue with another blog post tomorrow. Enjoy!

For those with less time, here is a single sentence for each talk that most resounded with me:

  1. Mike Cooling: Emphasising the ‘re’ in reusable, and make it easier to build and understand large models from reusable components.
  2. Neil Wipat: For a standard to be useful, it must be computationally amenable as well as useful for humans.
  3. Herbert Sauro: Currently there is no formal ontology for synthetic biology, but one will need to be developed.

This meeting is organized by Jen Hallinan and Neil Wipat of Newcastle University. Its purpose is to set up key relationships in the synthetic biology community to aid the development of a standard for that community. Today, I listened to talks by Mike Cooling, Neil Wipat, and Herbert Sauro. I was – unfortunately – unable to be present for the last couple of talks, but will be around again for the second – and final – day of the workshop tomorrow.

Mike Cooling – Bioengineering Institute Auckland, New Zealand

Mike uses CellML (it’s made where he works, but that’s not the only reason…) in his work with systems and synthetic biology models. Among other things, it wraps MathML and partitions the maths, variables and units into reusable pieces. Although many of the parts seem domain specific, CellML itself is actually not domain specific. Further, unlike other modelling languages such as SBML, components in CellML are reusable and can be imported into other models. (Yes, a new package called comp in SBML Level 3 is being created to allow the importing of models into other models, but it isn’t mature – yet.)

How are models stored? There is the CellML repository, but what is out there for synthetic biology? The Registry of Standard Biological Parts was available, but only described physical parts. Therefore they created a Registry of Standard Virtual Parts (SVPs) to complement the original registry. This was developed as a group effort with a number of people including Neil Wipat and Goksel Misirli at Newcastle University.

They start with template mathematical structures (which are little parts of CellML), and then use the import functionality available as part of CellML to combine the templates into larger physical things/processes (‘SVPs’) and ultimately to combine things into system models.

They extended the CellMLRepository to hold the resulting larger multi-file models, which included adding a method of distributed version control and allow the sharing of models between projects through embedded workspaces.

What can these pieces be used for? Some of this work included the creation of a CellML model of the biology represented in Levskaya et al. 2005 and deposit all of the pieces of the model in the CellML repository. Another example is a model he’s working on about shear stress and multi-scale modelling for aneurysms.

Modules are being used and are growing in number, which is great, but he wants to concentrate more at the moment on the ‘re’ of the reusable goal, and make it easier to build and understand large models from reusable components. Some of the integrated services he’d like to have: search and retrieval, (semi-automated) visualization, semantically-meaningful metadata and annotations, and semi-automated composition.

All this work above converges on the importance of metadata. With the CellML Metadata Framework 1.0, not many used it. With version 2.0 they have developed a core specification with is very simple and then provide many additional satellite specifications. For example, there is a biological information satellite, where you use the biomodels qualifiers as relationships between your data and MIRIAM URNs. The main challenge is to find a database that is at the right level of abstraction (e.g. canonical forms of your concept of interest).

Neil Wipat – Newcastle University

Please note Neil Wipat is my PhD supervisor.

Speaking about data standards, tool interoperability, data integration and synthetic biology, a.k.a “Why we need standards”. They would like to promote interoperability and data exchange between their own tools (important!) as well as other tools. They’d also like to facilitate data integration to inform the design of biological systems both from a manual designer’s perspective and from the POV of what is necessary for computational tool use. They’d also like to enable the iterative exchange of data and experimental protocols in the synthetic biology life cycle.

A description of some of the tools developed in Neil’s group (and elsewhere) exemplify the differences in data structures present within synthetic biology. BacilloBricks was created to help get, filter and understand the information from the MIT registry of standard parts. They also created the Repository of Standard Virtual Biological Parts. This SVP repository was then extended with parts from Bacillus and was extended to make use of SBML as well as CellML. This project is called BacilloBricks Virtual. All of these tools use different formats.

It’s great having a database of SVPs, but you need a way of accessing and utilizing the database. Hallinan and Wipat have started a collaboration with Microsoft Research with the people who created a programming language for genetic engineering of living cells called the genetic engineering of cells (GEC) simulator. Some work a summer student did created a GEC compiler for SVPs from BacilloBricks virtual. Goksel has also created the MoSeC system where you can automatically go from a model to a graph to a EMBL file.

They also have BacillusRegNet, which is an information repository about transcription factors for Bacillus spp. It is also a source of orthogonal transcription factors for use in B. subtilis and Geobacillus. Again, it is very important to allow these tools to communicate efficiently.

The data warehouse they’re using is ONDEX. They feed information from the ONDEX data store to the biological parts database. ONDEX was created for systems biology to combine large experimental datasets. ONDEX views everything as a network, and is therefore a graph-based data warehouse. ONDEX has a “mini-ontology” to describe the nodes and edges within it, which makes querying the data (and understanding how the data is structured) much easier. However, it doesn’t include any information about the synthetic biology side of things. Ultimately, they’d like an integrated knowledgebase using ONDEX to provide information about biological virtual parts. Therefore they need a rich data model for synthetic biology data integration (perhaps including an RDF triplestore).

Interoperabiligy, Design and Automation: why we need standards.

Requirement 1. There needs to be interoperability and data exchange among these tools as well as among these tools and other external tools. Requirement 2. Standards for data integration aid the design of synthetic systems. The format must be both computationally amenable and useful for humans. Requirement 3. Automation of the design and characterization of synthetic systems, and this also requires standards.

The requirements of synthetic biology research labs such as Neil Wipat’s make it clear that standards are needed.

KEYNOTE: Herbert Sauro – University of Washington, US

Herbert Sauro described the developing community within synthetic biology, the work on standards that has already begun, and the Synthetic Biology Open Language (SBOL).

He asks us to remember that Synthetic Biology is not biology – it’s engineering! Beware of sending synthetic biology grant proposals to a biology panel! It is a workflow of design-build-test. He’s mainly interested in the bit between building and testing, where verification and debugging happens.

What’s so important about standards? It’s critical in engineering, where if increases productivity and lowers costs. In order to identify the requirement you must describe a need. There is one immediate need: store everything you need to reconstruct an experiment within a paper (for more on this see the Nature Biotech paper by Peccoud et al. 2011: Essential information for synthetic DNA sequences). Currently, it’s almost impossible to reconstruct a synthetic biology experiment from a paper.

There are many areas requiring standards to support the synthetic biology workflow: assembly, design, distributed repositories, laboratory parts management, and simulation/analysis. From a practical POV, the standards effort needs to allow researchers to electronically exchange designs with round tripping, and much more.

The standardization effort for synthetic biology began with a grant from Microsoft in 2008 and the first meeting was in Seattle. The first draft proposal was called PoBoL but was renamed to SBOL. It is a largely unfunded project. In this way, it is very similar to other standardization projects such as OBI.

DARPA mandated 2 weeks ago that all projects funded from Living Foundries must use SBOL.

SBOL is involved in the specification, design and build part of the synthetic biology life cycle (but not in the analysis stage). There are a lot of tools and information resources in the community where communication is desperately needed.

SBOL Semantic, SBOL Visual, and SBOL Script. SBOL Semantic is the one that’s going to be doing all of the exchange between people and tools. SBOL Visual is a controlled vocabulary and symbols for sequence features.

Have you been able to learn anything from SBML/SBGN, as you have a foot in both worlds? SBGN doesn’t address any of the genetic side, and is pretty complicated. You ideally want a very minimalistic design. SBOL semantic is written in UML and is relatively small, though has taken three years to get to this point. But you need host context above and beyond what’s modelled in SBOL Semantic. Without it, you cannot recreate the experiment.

Feature types such as operator sites, promoter sites, terminators, restriction sites etc can go into the sequence ontology (SO). The SO people are quite happy to add these things into their ontology.

SBOLr is a web front end for a knowledgebase of standard biological parts that they used for testing (not publicly accessible yet). TinkerCell is a drag and drop CAD tool for design and simulation. There is a lot of semantic information underneath to determine what is/isn’t possible, though there is no formal ontology. However, you can semantically-annotate all parts within TinkerCell, allowing the plugins to interpret a given design. A TinkerCell model can be composed of sub-models. Makes it easy to swap in new bits of models to see what happens.

WikiDust is a TinkerCell plugin written in Python which searches SBPkb for design components, and ultimately uploads them to a wiki. LibSBOLj is a library for developers to help them connect software to SBOL.

The physical and host context must be modelled to make all of this useful. By using semantic web standards, SBOL becomes extensible.

Currently there is no formal ontology for synthetic biology but one will need to be developed.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

BioSharing and M3 Special Interest Group at ISMB 2010

March 31, 2010 Leave a comment

Last year, I attended ISMB 2009 in Stockholm. Prior to the main ISMB conference, there were a number of Special Interest Groups, which are a little like mini-conferences lasting 1-2 days. I really enjoyed going to a number of these SIGs, and they were one of the highlights of the conference. I even presented a paper at one of them.

Last year also saw the first M3 SIG. This year they’re expanding the scope of this year’s SIG, including sessions on data sharing and standards in biology, and highlights the BioSharing community, and those interested in open data and data standards. Further, during this SIG, the organisers:

…aim to launch the BioSharing forum, as discussed at M3 2009 to enable a broader dialogue among funders, journals, standards developers, technology developers and researchers on the critical issue of data sharing within the metagenomics community and beyond. (source)

Although I will not be able to attend ISMB or the SIGs this year, I highly recommend attending this SIG, and the conference in general. For more information, please see the BioSharing post advertising the SIG.

Science Commons provide a list of considerations for researchers looking to license their ontology

November 12, 2009 Leave a comment

Back in March, I wrote a blog post about my experiences trying to find out a) if ontologies should be licensed, b) if ontologies could be licensed, and c) what sort of license would be appropriate. After all, it isn’t clear what sort of thing an ontology is: is it software, or is it a document, or is it something else completely? In this post, I included a response I had received from the nice folks over at Science Commons, giving their perspective on the situation.

Today, I came across a Science Commons blog post by Kaitlin Thaney announcing OWL 2. In it, she also mentions that Science Commons now have a Reading Room article on Ontology Copyright Licensing Considerations which is well worth a read. It updates the information contained in my March post, and provides some useful thoughts on how we should go about licensing ontologies. The section below was the part that particularly caught my eye:

For sharing ontologies in a community or publicly, it would be prudent to think about copyright and licensing. For example, the ontology creator could say that “to the extent I may have copyright in my ontology, I license it in the following way.” In that way, she can reassure the community that even in the event copyright is later found to exist, they may rely upon her offer of a license. This provides an important “safety net” for the community of users, given the uncertainty about whether a given ontology may be copyrightable.

The above section seems to be the biggest new point compared with their earlier statement. While they primarily recommend CC0, they do acknowledge that many researchers may wish to choose an attribution-based licences such as the CC Attribution license.

If you create ontologies, then you should read this article: it’s short, easy to understand, and gives you the information you need to make your own decisions.

Breakout 3: Author identity – Creating a new kind of reputation online (Science Online London 2009)

August 22, 2009 1 comment

Duncan Hull, Geoffrey Bilder, Michael Habib, Reynold Guida

ResearcherID, Contributor ID, Scopus Author ID, etc. help to connect your scientific record. How do these tools connect to your online identity, and how can OpenID and other tools be integrated? How can we build an online reputation and when should we worry about our privacy?

Geoff Bilder:

Almost every aspect of a person can change without the person themselves changing. So, you want to have an identifier that is a hook to you, and which is better than a name (which is changeable). What about retinal scans? Fingerprints? OpenID? Where does your profile come in? A profile is a collection of attributes that you use to describe who you are. With author identity, what we want is the ability to get at the profile of a person in an unambiguous manner. Until we have such a thing, how do you tell people what your canonical profile is? To complicate matters even more, each user will want multiple personas, each with their own profiles.

When talking about identity, two issues are often conflated: identity authentication and knowledge discovery identity system. That is, you must be more rigorous in determining swho someone is (logging into your identity) than in figuring out who wrote a paper. Further complications occur in the lossy conversion between languages of authors’ names.

Whatever is done, has to be done on an international scale, must be interdisciplinary, and must be interinstitutional. The oldest content cited thus far in CrossRef (with a DOI) is from the 1600s. What happens when you die to your identifier? A final issue is scale: there are about 200K new DOIs per month, and even if we guess at 5 authors per DOI, then there could be between 5-21K failures of identification per month if you estimate a 96-97% success rate for author identification.

Duncan Hull:

He spoke about openID is science, among other things. Currently, authentication of people is very different in most online applications, and is generally only done with a simple username and password combination. Simon Willison (The Guardian) estimates that the average online user has at least 18 user accounts and 3.49 passwords. OpenID is trying to end up with a situation where there are fewer usernames AND passwords.

OpenID works by redirecting you to your openID provider to log in, then sends you back to the location you started at. However, having a URL as a username is not very intuitive. Further, logging in via redirection can be confusing. Therefore while adoption of openId is growing, it may not properly take off until browsers and other vendors support it better. Mentioned myExperiment as something which accepts openId.

Michael Habib:

Michael presented a nice diagram: a square divided into 4 parts, with “about me” and “not about me” across the top, and “by me” and “not by me” down the side. It is the “not” category for both where the disambiguation of people is the most important. He used the example of Einstein and the LC Authority Files to figure out what all of the different versions of his name are.

Completely different from the LC Authority files, which is manually and carefully checked by only certain people, is ClaimID. ClaimID is a way to collect all aspects of your identity in one place. However, it is dependent upon each individual being truthful about what they have ownership over.

Another approach is the Scopus Author ID, which is completely machine aggregated. It is validated by publications, and scales well. It has 99% precision and 95% recall. The cons is that it is impersonal, and those precision and recall values really aren’t very good when you consider that this is about ownership of an article, and that there are a very large number of people.

There is also 2collab, where you can combine author ids (that you know about) into one identity. Then, you can add any other item on the web that is about you.

Reynold Guida (from Thomson Reuters):

They’ve built software to try to address author identity and attribution. If you look at the literature since 2000, communication and scientific collaboration has really changed. What we notice is that the number of multi-author papers has started to increase, while single-author papers have decreased. A google search for common surnames really highlights the problems associated with identity. Name ambiguity is a real problem. The connection between the reseacher and the institution and the community is a real problem. Two of the most important parts in this discussion are who do I know, and who do I want to know? The connections a person makes affects all aspects of their career.

Therefore they have created researcherId (free, secure, open). Privacy options are controlled by the user, even if the institution created the record. There is integration with EndNote, Web of Knowledge, and other systems to help build publication lists. You can link to / visualize your researcher id profile really easily from your own websites.

Discussion:

Question: Has anyone thought through the security implications of these single ID systems: one slip-up and your entire identity has been hacked? GB: Multiple identities encourages poor behaviour, as the thought of changing your password everywhere is so overwhelming that people don’t do it. But yes, these problems exist. However, the tradeoffs make it worthwhile to their minds. You should NOT conflate knowledge issues with security issues. This is because information for your scholarly profile is, by definition, public anyway.

Question: Do different openId providers and author id and researcher id know about each other in the computational sense? Not really yet.

Question: What about just making the markup of the web more semantically friendly? DH: The Google approach is a good one. RG: It’s all about getting the information into the workflow.

Question (Phil Lord): What worries me is that there has been a big land grab for author identity space: for example, you cannot log into Yahoo with any other open id than a Yahoo open id. There’s a lot of value in being in control of someone’s id. Therefore there is a big potential danger. GB: For every distributed system, you need a centralized indexing thing to get it to work correctly. Therefore we need to make sure that if a centralized system appears, there should be accountability.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

HL53: Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project (ISMB 2009)

July 2, 2009 3 comments

Chris Taylor

Standards are hugely dependent on their respective communities for reqs gathering, develppment, testing, uptake by stakeholders. In modeling the biosciences there are are a few generic features such as description of source material and experimental design components. Then there are biologically-delineated and technologically-delineated views of the world. These views are still common across many different areas of the life sciences. Much of it can fall under an ISA (Investigation-Study-Assay) structure.

You should then use three types of standards: syntax (images of FuGE, ISA-TAB etc), semantics, and scope. MIBBI is all about scope. How well are things working? Well, there is still separation, but things are getting better. There aren’t many carrots, though there are some sticks for using these standards. Why do we care about standards? Data exchange, comprehensibility, and scope for reuse.  Many funders (esp public funders) are now requiring data sharing or ability for data storage and exchange.

“Metaprojects”: FuGE, OBI, ISA-TAB – draw together many different domains and present in structure/semantics useful across all. Many of the “MI” (Minimum information guidelines) are developed independently, and are sometimes defunct. It’s also hard to track what’s going on in these projects, can be redundant, difficult to obtain an overview of the full range of checklists. When the MI projects overlap, arbitrary decisions on wording and substructuring make integration difficult. This makes it hard to take parts of different guidelines – not very modular. Enter MIBBI. Two distinct goals: portal (registry of guidelines) and foundry (integration and modularization).

There’s lots of enthusiasm for the project (drafters, users, funders, journals). MIBBI raises awareness of various checklists and promotes gradual integration of checklists. Nature Biotechnology 26, 889 – 896 (2008) doi:10.1038/nbt0808-889 for the paper. He’s performed clustering and analysis of the different guidelines: displayed MIs in cytoscape and in fake phylogenetic tree. By the end of the year they’ll have a shopping-basked based tool, MICheckout, to get all concepts together and then you get your own specialized checklist as output. You can make use of isacreator and its configuration to set mandatory parameters etc.

The objections to fuller reporting. Why should I share? funders and publishers are starting to require a bare minimum of metadata – and researchers will just do the bare minimum then, however. Some people think that this is just a ‘make work’ scheme for bioinformaticians, or that bioinformaticians are parasitic. Some people don’t trust what others have done, but then that’s what the reporting guidelines are for in the first place – so you can figure out if you should trust it. Problems of quality are justified to an extent, but what of people lacking resource for large-scale work, or people who want to refer to proteomics data but don’t do proteomics? How should they follow theese guidelines? Perception is that there is no money for this, and no mature free tools, and worries about vendor support. Vendors will support what researchers say they need.

Credit: data sharing is more or less a given now, and need central registries of data sets that can record reuse (also openids, DOIs for data). Side benefits and challenges include clearing up problems with paper authorship wrt reporting who’s done which bit. Would also enable other kinds of credit, and may have to be self-policing. Finally, the problem of micro data sets and legacy data. Example of the former is EMBL entries – when searching against EMBL, you’re using the data in some way, even if you don’t pull it out for later analysis.

http://www.mibbi.org

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

TT26: BioModels Database, a database of curated and annotated quantitative models with Web Services and analysis tools (ISMB 2009)

June 30, 2009 Leave a comment

Nicolas Le Novère

Lots of things are called models. He’s NOT going to talk about HMM, Bayesian models, sailboat models, supermodels :) For him, a model is computer-readable, simulatable, and covers biological pathways. Models and their description/metadata need to be accessible. The models in BioModels are from peer-reviewed literature. THey check the model is OK and simulate them before accepting it into the database. Models can be either submitted by curators themselves (e.g. re-implemented from literature), or directly submitted by authors, or a few other ways.

Models also have to be encoded in SBML and follow the MIRIAM guidelines, which are reporting guidelines for the encoding and annotation of models, and is limited at the moment to models that can be quantitatively evaluated. There are seven basic requirements for MIRIAM compliance, which are available online. Within the model, MIRIAM annotations are identified by URIs and are stored as RDF. There’s been a steady increase in the numbers of models in BioModels. There are about 35000 reactions and about 400 models. Standard search functionality available from their website at the EBI (http://www.ebi.ac.uk/biomodels).

Can export in CellML, BioPAX and others (though the SBML is the curated, perhaps more “trusted”, version). There are also two simple simulators available directly from the entry’s webpage, and if you want to change parameters you can click through to JWS online. You can also just extract portions of the models: these will end up as valid SBML models in their own right.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Follow

Get every new post delivered to your Inbox.

Join 494 other followers