Semantics and Ontologies

How does your ontologizing style compare with the style of others?

Do you run the reasoner after every axiom addition, or do you bravely go minutes or even hours before clicking on “Synchronize Reasoner”? Do you add synonyms, definitions, definition sources and other annotation avidly, or lazily (at least compared with your compatriots)?

Find this image in all its original glory at

Ever wondered if you built your ontology the same way as everyone else? Not in a competitive way (OK, maybe a little bit in a competitive way), but in your stylistic choices and natural rhythms? I have wondered exactly this, and last month I got the chance to provide some data to some researchers who are studying the styles and behaviours of ontologists while they are, well, ontologizing (Robert Stevens says it’s a word, and I believe him!). Markel Vigo (work page, blog site, Twitter) and Robert Stevens (work page, blog site) at the University of Manchester are looking for more ontologists to do the same as me and load up Protege 4 in The Name of Science (well, more science than you were already doing by producing the ontology in the first place).

[If you’re already sold, download the information you need from this Dropbox folder or email Markel Vigo.]

And, for only approximately 90 non-consecutive minutes of your time, you can contribute to their research too! You can pick it up and put it down as you have time; I did a few minutes here and there in about 5 or 6 sessions. You simply download their version of Protege with their event recorder built in, load up your favorite ontology and just work exactly as you would normally work. Although, saying that, I did feel like I was working with Robert sitting beside me, which did make me sit up straighter and feel vaguely like I was in an exam – in a good way…!

As Robert originally told me:

Protege4US (the name of their version of Protege which contains the event recorder) is a standard version of Protege 4, but it logs what people are doing – button presses, menu options used, axioms written etc. They then analyse these logs for patterns of activity. You can see a blog post that describes a paper about a recent study they did with Protege4US that used a pre-determined, defined task.

This study in which you would take part if you’re interested does more or less the same thing (although with no screen capture and no eye-tracking), but this time with participants (that’s you) will be doing their own ontology task in their own time as they would usually do it.

Markel Vigo is running the study and you can ask him or Robert Stevens any questions you might have. Markel has supplied the Protege4US extensions and a readme and so on in a Dropbox folder. Markel has been very quick to answer whenever I had any questions, and even made an Ubuntu version for me as soon as I asked.

If you regularly work with ontologies, please consider donating some of your expertise to this task – I’m sure the results will be very interesting!

Housekeeping & Self References Meetings & Conferences Science Online Semantics and Ontologies

Ontogenesis: rapid reviewing and publishing of articles on semantics and ontologies

What happens when an ontologist or two gets frustrated at the long-scale publication process that is the norm when publishing scientific books? You get Ontogenesis: a quick-turnaround, low-maintenance solution using WordPress blog software. Next, a bunch of other ontologists are invited to a two-day, intensive writing and peer-reviewing workshop and the initial content is created. Result? Well, my favourite result was Kendall Clark tweeting this: “#Ontogenesis is awesome:“.

What is Ontogenesis?

Phil Lord had the idea, and together with Robert Stevens and others organised the 2-day Ontogenesis workshop that occurred last week, 21-22 January 2010. Why look around for an alternative to traditional publishing methods? When writing a book, accepting the invitation might take 5 minutes, but getting around to doing it can take 6 months or more. You may only spend a couple of days writing the article, but then need to wait months for reviews (and do reviews for the other authors’ articles). Then there is the formatting and camera-ready copy. Then you may wait many months for proofs and then only get a few days to make corrections. Then, you can wait year or so for actual publication, by which time it is possibly out of date. Not ideal, but still necessary for some forms of publishing.

There are a number of benefits to using blog software, and to the Ontogenesis model in general:

  • stable, permanent URLs: Permanent URLs for articles and peer reviews. DOIs have been discussed as well, and are being considered.
  • automatic linking of peer reviews and related online articles. The WordPress software automatically adds trackbacks, pingbacks, etc. as comments on the relevant articles, making it easy for interested readers to visit the peer reviews written for that article.
  • completely open review system. Unlike many peer-review systems in use today, the reviewer (publicly) publishes his/her article in Ontogenesis.
  • less work and quick turnaround time for the editors, reviewers, and authors. Once you have written your article (in whatever format you like, other than a few broad suggestions about licensing and intention), you publish it as “Uncategorised” in the system, and then once reviewers have agreed to look at it, move it to “Under Review”. Once reviews are complete, and the editors have checked everything, it is moved to “Reviewed”. Pretty simple.

A blog that isn’t a blog

But is Ontogenesis a blog? Not really. Is it a book? Not in the traditional sense. While it seems to be correct to call it a blog, how the blog software is being used isn’t the way many people use it. And, though Duncan has called it “blogging a book”, this isn’t quite right either: while content, once completed, will not be changed, new content will be continually added. Phil discussed this point in his introduction to the workshop. He stated that wikis are best suited for certain styles of articles, but not for this sort of infrequently-updated information. Further, in wikis in general, crediting is poor. Google Knol is a nice idea, but not many people are using it. If it’s just a plain website, then there is no real way to have (and to show, more importantly), peer review.

To me, and to the general agreement of the people at the workshop, Ontogenesis can be viewed as a title/proper noun, in the same way as Nature is a title of a journal. Ontogenesis is the first of a class of websites called Knowledge Blogs. It is has more in common with the high-quality, article-style blogging of ScienceBlogs or Research Blogging than it does with the short, informal blogging style that is used by most bloggers. Each article stands on its own, is of a high quality, and describes a topic of interest to both ontologists and novices in the ontology world. Each article is aimed at a general life science readership, ensuring accessibility of knowledge and broad appeal.

My experiences as a contributor

I was lucky enough to be invited to the workshop last week, and had a great time. After an introductory set of presentations, we all got started writing our articles. The idea was that, once written, each article would be peer reviewed by at least 2 others at the workshop. Once the peer reviews were complete, the article would be re-categorised from “Under Review” to “Review”. As Phil said in a recent blog post, we wrote a large number of articles, though the number that have gone through the full review process was not as high. We expect that over the next few days, the number of completed articles will rise.

My article on Semantic Integration in the Life Sciences was the first to come out of peer review. Thanks are very much due to Frank Gibson, Michel Dumontier, and David Shotton for their peer reviewing and constructive criticism: it is a much better article for their input. I also reviewed a couple of articles (1,2) by Helen Parkinson and James Malone, which should be moved to a Reviewed status soon.

Ok, but what’s the downside?

Well, it is new, and there are some kinks to work out. This workshop highlighted a number of them, such as the difficulty people unfamiliar with WordPress had using its UI. Sean has posted a useful summary of his thoughts on the pluses and minuses, which I encourage you to have a read of and comment on. Here are a few thoughts on how to improve the experience in future, as mentioned during the meeting:

  • Enable the plugin for autogenerating related articles to improve cross-links.
  • The Table of Contents has been started, but different “pathways” for different intended readerships to help guide them through the articles would be helpful.
  • Reviewers should be able to change Categories in any article so they can mark when it is Under Review, rather than waiting for the Authors to do this.
  • The article-specific Table of Contents are very helpful, but it might be better to move it to different location in the post (e.g. the top rather than the bottom).
  • Have a way to mark yourself as willing to accept papers to review, for instance if you have some time in your schedule that week: authors could then preferentially choose you.
  • The ability for your name in the byline of an article to link to your profile on Ontogenesis. Currently, the profiles are private and some authors have put their profiles into the article text as a temporary alternative.
  • Add the Stats wordpress plugin.
  • Comments do not have the author of the comment within them, e.g. pingbacks to reviews have to be clicked through to find out who wrote the review.
  • Dealing with references/citations will be done better in future, when an appropriate plugin is found. Currently, basic HTML links to DOIs is the standard way to go.

Conclusions? Be an author yourself, and try it out!

This method of publishing is new, interesting, and quick. If you have a topic you’d like to write about, are interested in peer reviewing, or are just interested in reading the articles then please visit Ontogenesis and have a go, and then let us know what you think!

Please note: as mentioned in the main text, I am one of the authors of articles and peer reviews in Ontogenesis.

CISBAN Semantics and Ontologies Software and Tools

SBML in OWL: some thoughts on Model Format OWL (MFO)

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.
A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.
Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.
How a single reaction looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

Meetings & Conferences Semantics and Ontologies

PTO6: Ontology Quality Assurance Through Analysis of Term Transformations (ISMB 2009)

Karin Verspoor

This work came out of a meeting talking about OBO quality assurance in GO. The work described here is applicable to any controlled vocabulary. The key quality concerns is univocality, or a shared interpretation of the nature of reality, and was originally coined from Spinoza in 1677. David Hill intended it to mean something slightly different, which is consistency of expression of concepts within an ontology. This facilitates human usability and computational tools can utilize this regularity.

Try to identify cases where there were violations of univocality: two semantically similar terms with different structure in their term labels. GO is generally very high quality: need computational tools to identify inconsistencies. They chose a simplistic approach of term transformation and clustering, as it’s good to start with the simplest stuff first. First step is abstraction, which is substitution of embedded GO and ChEBI terms with variables GTERM and CTERM, respectively. Then there was stopword removal (high frequency words like the, of, via). Next is alphabetic reordering (to deal with word order variation in the terms). They tried all different combinations of transformation ordering, to see how they were different.

20% of abstraction was due to CTERM, and 30% due to GTERMs. If you look at the distribution of the cluster sizes before and after transformation has radically changed. Max cluster before transformation was 29, and after the max cluster size was ~3000. In the end, found 237 clusters that may contain a univocality violation. Looked for terms that were in different cluster after abstraction, but merged together after one of the other transformations (that’s how they got the 237 clusters). A further 190 clusters that had to be manually assessed – this has reduced the number of things that had to be looked at manually. Discovered 67 true positive violations (35% ) of univocality. Already have ideas for improvements of this step.

The 67 clusters constitutes 317 GO terms. 45% of true positive inconsistences were {Y of X} | {Y in X}. There were a further 16% of TP where there were determiners in one version (e.g. “the”) and not in another version. Some of the smaller number of TP dealing with inverses, etc. 50% of FP were the semantic import of a stopword (some of the stopwords actually carry meaning and shouldn’t have been removed) and by removing it they’ve removed the difference between the two words.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!