Increasingly accurate biochemical knowledge representation with precise, structure-based chemical identifiers (ISMB Bio-Ont SIG 2009)

Michael Dumontier et al.

Problem: identifiers are a name for some biochemical entity. Records offer a rich description of the named entity. When viewing data, sometimes it’s difficult to know which form of a chemical the site is referring to. Peoples use identifiers when reporting experimental results, but it’s often unclear which species they’re referring to, and there can be erroneous/underspecified reporting of results. They’d like to generate stable identifiers based on explicit, machine-understandable descriptions which are unchanging and fully self-describing. With this style, different molecules must have different identifiers. For example, InChI strings are good but need specialized software to parse the InChI string.

Some formats that already exist are SDF and CML, whereas existing identifiers that contain chemical information are InChI and SMILES. So, what happens if you ask CML the differences between 3 very similar chemical species that only differ  in their stereochemistry? It isn’t really possible. He’d like to reason betwen relations and class membership, and to classification tasks.

In the vein of functional groups, they’d like to capture some form of generalisation: experimental conditions necessitate a certain level of structural (un)certainty. So, more flexible and accurate representation of biochemical knowledge beyond the exact structure. Classes would include: specification, minimum, combination, possibilities/uncertainties, exclusion.

Ultimately, what we want to do is to generate the useful identifier to point to accurate and unchanging descriptions. So, take what was done with InChI and generate something that can be self-explained. We need OWL description -> identifier. So they have a prototype service that allows you to submit an OWL snippet and get back an identifier. This means that if the description changes, the identifier changes. They will add new knowledge into the linked data web through Bio2RDF.

Benefits of this system include no curation being required, can make identifiers for knowledge at various levels of granularity. Situational modeling enables the careful separation of what is known under particular circumstances.

FriendFeed discussion:

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s