Semantics and Ontologies Software and Tools

Software Ontology – a New Release and a Shiny New Build Procedure

I had noticed that it had been a while since we had last updated SWO – the Software Ontology. To be honest, it was a little more than “a while”, but…

  • we’re a merry band of volunteers, primarily Robert Stevens (blog, Computer Science at Manchester), Helen Parkinson (EBI), James Malone (blog, SciBite), which means we are time limited
  • our build process was outdated, slow, and tricky. I’ll admit, I had to ask James to finish our 1.6 release as it just wasn’t working for me!
A small snippet of SWO – see EBI’s OLS for the full graph

Does your release spark joy?

We all enjoy talking about software, and I have particularly enjoyed beginning to work on the lovely Licence Hierarchy within SWO that’s been coming along nicely. But every time I thought about updating the external ontologies we imported, or building the release files, I got a bit of a sinking feeling. Then, feeling like I was the last in the class to notice, I had a good read about ROBOT (website, publication), an ontology build tool that lots of projects had been using. I say build tool, but it does all sorts of lovely things. I use it for the following purposes:

  • SPARQL queries: I use SELECT to create summary statistics of quite complex subdivisions of my ontology
  • Bulk annotation: UPDATE commands can also be run, allowing me to add bulk annotations to my file.
  • Bulk imports via spreadsheets: a separate project I’ve been involved in began their ontology development with a spreadsheet and then we bulk converted it to OWL with ROBOT.
  • Merging imports – going from a development file with multiple imports to a single release file
  • Release building – checking and building a release file with appropriate annotation and versioning.

And to top it all off, ROBOT suggests that you use a Makefile to control your build. What joy! The last time I used one was during my time at the EBI, and a really do enjoy using them. They are a lightweight, fun way to control a set of commands and dependencies that you need to run, and it was awesome to get back to it.


As it had been a while since we released SWO, it needed a spring clean. With MIREOT and Ontofox, I wasn’t tied to a simple (but crowded) import of entire ontologies. In previous versions, ontologies like EDAM were imported en masse and this causes major versioning issues when release get out of step. MIREOT solves that by outlining a procedure (implemented by Ontofox) which allows for the selective import of classes and hierarchies of interest from external ontologies.

So, we stripped out all of our external classes, and re-imported just the ones we needed. We also took the opportunity to resolve a number of inconsistencies with our IRI naming scheme (and a bunch of other housekeeping issues listed in our GitHub milestone).

Release and Indexing

We released 1.7 at the end of October, and our lovely friends at OLS, BioPortal and Ontobee quickly indexed it. Please feel free to browse it at any of these locations, or to say hello over at our GitHub repo (you’ll always find our latest release here). And with our build procedure now as streamlined as our ontology, updates will be easier and quicker – so let us know what you’d like!

Semantics and Ontologies Software and Tools

Summer Ontology Work: The Software Ontology

In the summer of 2012, I worked with Robert Stevens and Duncan Hull (among others) on additions to the Software Ontology. As it was a short-term appointment, I kept detailed notes on the Software Ontology blog. A summary of those notes is available as its own post.

I had loads of fun doing the work, and would love to head back over there and do some more work on it. We had gotten most of the way through a merge of EDAM and SWO, and it would have been nice to finish that off, if time constraints had not been what they were.  Thanks very much to Robert Stevens for giving me the chance to do such interesting work!

Papers Software and Tools

Google Scholar Citations and Microsoft Academic Search: first impressions

There’s been a lot of chat in my scientific circles lately about the recent improvements in freely-available, easily-accessible web applications to organise and store your publications. Google Scholar Citations (my profile) and Microsoft Academic Search (my profile) are the two main contenders, but there are many other resources to use which are mighty fine in their own right (see my publication list on Citeulike for an example of this). Some useful recent blog posts and articles on this subject are:

  • Egon Willighagen’s post includes lots of good questions about the future of free services like GSC and MAS, and how they relate to his use of more traditional services such as the Web Of Science.
  • Alan Cann’s impressions of GSC, including a nice breakdown of his citations by type.
  • Jane Tinkler’s comparison of GSC and MAS, together with a nice description of what happens when a GSC account is made (HT Chris Taylor @memotypic for the link to the article).
  • Nature News’ comparison of GSC and MAS, and impressions of how these players might change the balance of power between free and non-free services.

While I do like each of these technologies for a number of reasons, there are also reasons to be less happy with them. Firstly, their similarities (please be aware I am not trying to make an exhaustive list – just my impressions after using each product for a few days). They both allow the merging of authors, a feature that was very useful to me as I changed my name when I got married. Neither service has a fantastic interface for the merge, but it worked. Both provided some basic metrics: GSC has the h-index and the i10 index, while MAS uses the g and h indexes. Both tell you how many other papers have cited each of your publications. Both seem to get things mostly right (and a little bit wrong) when assigning publications to me – I had to manually fix both apps. Both provide links to co-authors, though GSC’s is rather limited, as you have to actively create a profile there while with MAS profiles are built automatically.

Things I like about Microsoft Academic Search:

  1. Categorisation of publications. You can look down the left-hand side and see your papers categorised by type, keyword, etc.
  2. Looks nicer. Yes, I like Google. But Microsoft’s offering is just a lot better looking.
  3. Found more ancillary stuff. It found my work page (though the URL has since changed), and from there a picture of me. Links out to Bing (of course) and nice organisation of basic info really just makes it look more professional than GSC.
  4. Bulk import of citations in Bibtex format. I really like this feature – I was able to bulk add the missing citations in one fell swoop using a bibtex upload. Shiny!

Things I don’t like about Microsoft Academic Search:

  1. Really slow update time. It insists on vetting each change with a mysterious Microsoftian in the sky. I’ve made a bunch of changes to the profile, updated and added publications, and days later it still hasn’t shown those changes. It’s got to get better if it doesn’t want to irritate me. Sure, do a confirmation step to ensure I am who I say I am, but then give me free rein to change things!
  2. Silverlight. I’ve tried installing moonlight, which seemed to install just fine, but then the co-author graph just showed up empty. Is that a fault with moonlight, or with the website?
  3. Did I mention the really slow update time?

Things I like about Google Scholar Citations:

  1. Changes are immediately visible. Yes, if I merge authors or remove publications or anything else, it shows up immediately on my profile.
  2. No incompatibilities with Linux. All links work, no plugins required.

Things I don’t like about Google Scholar Citations:

  1. Lack of interesting extras. The graphs, fancy categorisations etc. you get with MAS you don’t (yet) get with the Google service.
  2. No connection with the Google profile. Why can’t I provide a link to my Google profile, and then get integration with Google+, e.g. announcements when new publications are added? This is a common complaint with Google+, as many other Google services (e.g. Google reader) aren’t yet linked with it, but hopefully this will come eventually.
  3. Not as pretty. Also, I’m not sure if it’s just my types of papers, but the links in GSC to the individual citations are difficult to read, and it’s hard to determine the ultimate source of the article (e.g. journal or conference name).

I will still use Citeulike as my main publication application. I use it to maintain the library of my own papers and other people’s papers. Its import and export features for bibtex are great, and it can convert web pages to citations with just one click (or via a bookmarklet). It has loads of other bells and whistles as well. While I’m writing up my thesis, I visit it virtually every day to add citations and export bibtex.

So, between Google and Microsoft, which do I like better? They’re very similar, but Microsoft Academic Search wins right now. But both services are improving daily, and we’ll have to see how things change in future.

And the thing that really annoys me? I now feel the need to keep my publications up to date on three systems: Citeulike (it’s the thing I actually use when writing papers etc.), Microsoft Academic Search, and Google Scholar Citations. No, I don’t *have* to maintain all three, but people can find out about me from all of them, and I want to try to ensure what they see is correct. Irritating. Can we just have some sensible researcher IDs in common use, and from that an unambiguous way to discover which publications are mine? I know efforts are under way, but it can’t come soon enough.

Data Integration Semantics and Ontologies Software and Tools

Current Research into Reasoning over BioPAX and SBML

What’s going on these days in the world of reasoning and systems biology modelling? What were people’s experiences when trying to reason over systems biology data in BioPAX and/or SBML format? These were the questions that Andrea Splendiani wanted to answer, and so he collected three of us with some experience in the field to give 10-minute presentations to interested parties at a BioPAX telecon. About 15 people turned up for the call, and there were some very interesting talks. I’ll leave you to decide for yourselves if you’d class my presentation as interesting: it was my first talk since getting back from leave, and so I may have been a little rusty!

The first talk was given by Michel Dumontier, and covered some recent work that he and colleagues performed on converting SBML to OWL and reasoning over the resulting entities.

Essentially, with the SBMLHarvester project, the entities in the resulting OWL file can be divided into two broad categories: in silico entites covering the model elements themselves, and in vivo entities covering the (generally biological) subjects the model elements represent. They copied all of BioModels into the OWL format and performed reasoning and analysis over the resulting information. Inconsistencies were found in the annotation of some of the models, and additionally queries can be performed over the resulting data set.

I gave the second talk about my experiences a few years ago converting SBML to OWL using Model Format OWL (MFO) (paper) and then, more recently, using MFO as part of a larger semantic data integration project whose ultimate aim is to annotate systems biology models as well as create skeleton (sub)models.

I first started working on MFO in 2007, and started applying that work to the wider data integration methodology called rule-based mediation (RBM) (paper) in 2009. As with SBMLHarvester, libSBML and the OWLAPI are used in the creation of the OWL files based on BioModels entries. All MFO entries can be reasoned over and constraints present within MFO from the SBML XSD, the SBML Manual, and from SBO do provide some useful checks on converted SBML entries. The semantics of SBMLHarvester are more advanced than that of MFO, however MFO is intended to be a conversion of a format only, so that SWRL mappings can be used to input/output data from MFO to/from the core of the rule-based mediation. Slide 8 of the above presentation provides a graphic of how rule-based mediation works. In summary, you start with a core ontology which should be small and tightly-scoped to your biological domain of interest. Data is fed to the core from multiple syntactic ontologies using SWRL mappings. These syntactic ontologies can be either direct format conversions from other, non-OWL, formats or pre-existing ontologies in their own right. I use BioPAX in this integration work, and while I have mainly reasoned over MFO (and therefore SBML), I do also work with BioPAX and plan to work more with it in the near future.

The final presenter was Ismael Navas Delgado, whose presentation is available from Dropbox. His talk covered two topics: reasoning over BioPAX data taken from Reactome, and the use of a database back-end called DBOWL for the OWL data. By simply performing reasoning over a large number of BioPAX entries, Ismael and colleagues were able to discover not just inconsistencies in the data entries themselves, but also in the structure of BioPAX. It was a very interesting summary of their work, and I highly recommend looking over the slides.

And what is the result of this TC? Andrea has suggested that, after discussion on the mailing list (contact Andrea Splendiani if you are not on it and want to be added) and then have another TC in a couple of weeks. Andrea has also suggested that it would be nice to “setup a task force within this group to prepare a proof of concept of reasoning on BioPAX, across BioPAX/SBML, or across information resources (BioPAX/OMIM…)”. I think that would be a lot of fun. Join us if you do too!

Software and Tools

Using the glossaries package in Latex and Linux, Kile

I was recently frustrated by the limitations of the acronym and glossary packages: I wanted to have something that joined the functionality of both together. Luckily, I found that with the glossaries package, which actually states that it is the replacement for the now-obsolete glossary package.

In order to make this tutorial, I have used the following resources, which you may also find useful: the CTAN glossaries page; the glossaries INSTALL file; (one, two) links from the latex community pages; and a page from the Cambridge Uni Engineering Department. These instructions work for Ubuntu Karmic Koala: please modify where necessary for your system.

Installing glossaries

Note for Windows users: While the makeglossaries command is a perl script for Unix users, there is also a .bat version of the file for Windows users. However, I don’t know how to set up MIKTex or equivalent to use this package. Feel free to add a comment if you can add information about this step.

  1. Get and unzip the glossaries package. I downloaded it from here. Though you can download the source and compile, I found it much easier to simply download the tex directory structure (tds) zip file.  Unfortunately, the texlive-latex-extra package available on ubuntu or kubuntu does not contain the glossaries package – it only contains glossary and acronym. I unzipped the contents of the zip file into a directory called “texmf” in my home directory. You’ll also want to run “texhash ~/texmf/” to update the latex database, according to the INSTALL instructions.
  2. (Optionally) get the xfor package. If your system is like mine, after you’ve installed the glossaries package latex will complain that it doesn’t have the xfor package (which also is not available via apt-get in Ubuntu). Download this package from here.
  3. Open the glossaries zip as root in a nautilus window, terminal window, or equivalent. You’ll be copying the contents to various locations in the root directory structure, and will need root access to do this.
  4. Find the location of your root texmf directory. In Karmic, this is /usr/share/texmf/, though it may be in another location on your system.
  5. Copy the contents of the tex and doc directories from the glossaries zip into the matching directory structure in your texmf directory. For me, this meant copying the “doc/latex/glossaries” subdirectory in the zip file to “/usr/share/texmf/doc/latex/”, and the same for the tex directory (copy “tex/latex/glossaries” subdirectory in the zip file to “/usr/share/texmf/tex/latex/”). In theory, you can also copy the scripts/ directory in the same way, but I did step 6 instead, as this is what was suggested in the INSTALL document.
  6. Update the master latex database. Simply run the command “sudo mktexlsr”
  7. Add the location of your scripts/glossaries directory to your $PATH. This gives programs access to makeglossaries, the perl script you will be using (if you’re in linux/unix). If you followed my default instructions in step 1, this location will be “/home/yourname/texmf/scripts/glossaries”.
  8. Test the installation. Change into the directory you created in step 1, into the “doc/latex/glossaries/samples/” subdirectory. There, run “latex minimalgls”. If you get an error about xfor, please see step 9. Otherwise, run “makeglossaries” and then “latex minimalgls” again. If everything works, the package is set up for command-line use. You may wish to modify your Kile setup to use glossaries – go to step 10 if this is the case.
  9. Set up the xfor package. Run steps 3-6 again, but with the file instead of the glossaries zip file. This package is simpler than glossaries, and does not contain a scripts/ subdirectory, so you will not need to do step 7. After installation, try running step 8 again: everything should work.
  10. Setting up Kile. Though I’m using Ubuntu, I find the Kubuntu latex editor Kile to be my favourite (just “sudo apt-get install kile” on Ubuntu). To set up Kile for using glossaries, you need to add another build tool that runs makeglossaries.
    1. Go to Settings -> Configure Kile
    2. Select the “Build” choice, which is a submenu of “Tools” on the left-hand menu.
    3. This brings up a “Select a Tool” pane and a “Choose a configuration for the tool …” pane.
    4. Click on the “New” button at the bottom of the “Select a Tool” pane.
    5. Type in a descriptive name for the tool such as “MakeGlossaries”, and click “Next”.
    6. The next prompt will be to select the default behaviour (or class) of the tool. I based MakeGlossaries on MakeIndex, as they both run in similar ways. Click “Finish” to finish.
    7. For some reason for me, Kile wasn’t initially picking up my changes in my $PATH, so in the General tab of the “Choose a configuration for the tool MakeGlossaries” pane, I put the full path plus the name of the “makeglossaries” script in the “Command” field. You may only need to put in “makeglossaries”.
    8. In the “Options” field of the same tab and pane as step 7, just put in ‘%S’.
    9. Change the selected tab from “General” to “Menu”. In the “Add tool to Build menu:” field, select “Compile” from the pull-down menu. This allows it to appear in the quick compile menu in the main editor window.
    10. I didn’t change any other options. Press “OK” in the main Settings window.
    11. You should now be able to access MakeGlossaries within Kile. Remember, you have to run latex (e.g. PDFLatex) as normal first, to generate the initial file; then run MakeGlossaries; then run PDFLatex or similar again.

Good luck, and I hope this helps people!

Tips on using glossaries

I usually keep all of my acronyms/glossary entries read by the glossaries file in a glossaries-file.tex or similar, and use “\include” to pass it to my main tex file. The links I posted at the top of this tutorial contain a number of useful examples, and included below are my favorites from those locations as well as a few of my own.

Note on usage within your document: Please note that to reference these entries, use \gls{entrylabel} for both referencing an acronym or a glossary entry. Further, to access the plural version of either, use \glspl{entrylabel}. By default, you do NOT need to put in a plural form of an acronym: latex will add an “s” to the expanded form and to the short form when you reference the acronym with \glspl{TLA} rather than \gls{TLA}.

A plain glossary entry that is not also an acronym. The first “sample” is the label used to reference this entry. The second “name={sample}” is the name of the glossary entry, as viewed once the glossary is compiled. The description is the actual definition for the glossary entry:

\newglossaryentry{sample}{name={sample},description={a sample entry}}

A plain acronym entry that is not also a glossary entry. The TLA acronym below illustrates the very basic acronym form. The “aca” example after it illustrates how to add non-normal plurals to the short and long form of the acronym. Then, again, the first instance of “aca” is the label with which to reference the acronym, and the second instance is the name as viewed in the compiled document. The final {} section is the expanded form of the acronym:

\newacronym[]{TLA}{TLA}{Three-letter acronym}
\newacronym[\glsshortpluralkey=cas,\glslongpluralkey=contrivedacronyms]{aca}{aca}{a contrived acronym}

Using an acronym within the glossary definition of a glossary entry. If you wish to make use of an acronym within the glossary definition, and have that acronym indexed properly within the glossary as well as the main text, here is what you do. First, make the acronym. Note that there is nothing special about this acronym:

\newacronym[]{DL}{DL}{Description Logics}

Second, make a normal glossary entry, and reference the acronym as normal. No special work necessary! Please also note that you can put in \cite references within a glossary entry with no problem at all:

\newglossaryentry{TBox}{name={TBox},description={This component of a \gls{DL}-based ontology describes
"intensional knowledge", or knowledge about the problem domain in general. The "T" in TBox could,
therefore, mean "Terminology" or "Taxonomy". It is considered to be unchanging
knowledge~\cite[pg. 12]{2003Description}. Deductions, or \textit{logical implications},
for TBoxes can be done by verifying that a generic relationship follows logically
from TBox declarations~\cite[pg. 14]{2003Description}.}}

Using an acronym as the name of a glossary entry. You sometimes want to use a defined acronym as the name for a glossary entry – this allows you to create a definition for an acronym. In this case, build your acronym as follows. Note that you need to add a “description” field to the square brackets:

\newacronym[description={\glslink{pos}{Part of Speech}}]{POS}{POS}{Part Of Speech}

Then, reference the acronym in the glossary entry as follows (notice the different label for this entry):

\newglossaryentry{pos}{name=\glslink{POS}{Part Of Speech},text=Part of Speech,
description={``Part of Speech''Description}}

Good luck, and have fun. 🙂

Data Integration Semantics and Ontologies Software and Tools

Short Tutorial on using Pellet + OWLAPI + SWRL Rules

I’ve been looking through Pellet and OWLAPI documentation over the past few days, looking for a good example of running existing SWRL rules via the OWLAPI using Pellet’s built-in DL-safe SWRL support. SWRL is used in ontology mappping, and is a powerful tool. Up until now, I’ve just used the SWRLTab, but needed to start running my rules via plain Java programs, and so needed to code the running of the mapping rules in the OWLAPI (which I’m more familiar with than Jena). Once I clean up the test code, I’ll link it from here so others can take a look if they feel like it.

This example uses the following versions of the software:

Pre-existing Examples

Pellet provides a SWRL rule example ( in the Pellet download), but only for Jena, and not the OWLAPI. The OWLAPI covers the creation of SWRL rules, but not their running. Therefore, to help others who may be walking the same path as I, a short example of OWLAPI + Pellet + SWRL follows.

New Example

This example assumes that you already have the classes, individuals, and rules mentioned below in an OWL file or files. Here is how the test ontology looks, before running the rule (you can use reasoner.getKB().printClassTree() to get this sort of output):

- (source:indvSourceA)
source:SourceB - (source:indvSourceB)

The example SWRL rule is this (the rule.toString() method prints this kind of output, while iterating over ontology.getRules()):

Rule( antecedent(SourceA(?x)) consequent(TargetA(?x)) )

Please note that if you want to modularise your OWL files, as I do (I have different files for the source classes, the target classes, the source individuals, the target individuals, and the rules) then make sure your owl:imports in the primary OWL ontology are correct, and that you’ve mapped them correctly with the SimpleURIMapper class and the manager.addURIMapper(mapper)method. I will update this post with some unit tests of this setup once I’ve cleaned up the code for public consumption.

Once you have your ontology properly loaded into an OWLAPI OWLOntology class, you should simply realize the ontology with the following command to run the SWRL rules:


After this command, all that’s left to do is save the new inferences. In this simple case, one individual is asserted to also be a child of the TargetA class, as follows:

source:SourceA - (source:indvSourceA)
source:SourceB - (source:indvSourceB)
target:TargetA - (source:indvSourceA)

You can do this by using the following code to explicitly save the new inferences to a separate ontology file. You can modify InferredOntologyGenerator to just save a subset of the inferences, if you like. Have a look in the OWLAPI code or javadoc for more information. Alternatively, you could just iterate over the ABox and just save the new individuals to a file. Here’s the code for saving the ontology to a new location:

OWLOntology exportedOntology = manager.createOntology( URI.create( outputLogicalUri ) );
InferredOntologyGenerator generator = new InferredOntologyGenerator( reasoner );
generator.fillOntology( manager, exportedOntology );
manager.saveOntology( exportedOntology, new RDFXMLOntologyFormat(), URI.create( outputPhysicalUri ) );

I hope this helps some people!

Housekeeping & Self References Science Online Software and Tools

Live blogging with Wave: not so live when you can’t make the Wave public

I live blogged Cameron Neylon‘s talk today at Newcastle University, and I did it in a Wave. There were a few pluses, and a number of minuses. Still, it’s early days yet and I’m willing to take a few hits and see if things get better (perhaps by trying to write my own robots, who knows?). In effect, today was just an exercise, and what I wrote in the Wave could have equally well been written directly in this blog.

(You’ll get the context of this post if you read my previous post on trying to play around with Google Wave. Others, since, have had a similar experience to mine. Even so, I’m still smiling – most of the time 🙂 )

Pluses: The Wave was easy to write in, and easy to create. It was a very similar experience to my normal WordPress blogging experience.

Minuses: I wanted to make the Wave public from the start, but have yet to succeed in this. Adding or just didn’t work: nothing I tried was effective. Also, the copying and pasting simply failed to work when copying the content of the Wave from Iron into my WordPress post in Firefox: while I could copy into other windows and editors, I simply couldn’t copy into WordPress. When I logged into Wave via Firefox, the copy-and-paste worked, but automatically included the highlighting that occurred due to my selecting the text, and then I couldn’t un-highlight the wave! What follows is a very colorful copy of my notes. I’ve removed the highlighting now, to make it more readable.

I’d like to embed the Wave here directly. In theory, I can do this with the following command:

[wave id=”!w%252BtZ-uDfrYA.2″]

Unfortunately, it seems this Wavr plugin is not available via the setup. So, I’ll just post the content of the Wave below, so you can all read about Cameron Neylon’s fantastic presentation today, even if my first experiment in Wave wasn’t quite what I expected. Use the Wave id above to add this Wave to your inbox, if you’d like to discuss his presentation or fix any mistakes of mine. It should be public, but I’m having some issues with that, too!

Cameron Neylon’s talk on Capturing Process and Science Online. Newcastle University, 15 October 2009.

Please note that all the mistakes are mine, and no-one else’s. I’m happy to fix anything people spot!

We’re either on top of a dam about to burst, or under it about to get flooded. He showed a graph of data entering GenBank. Interestingly, the graph is no longer exponential, and this is because most of the sequence data isn’t goinginto GenBank, but is being put elsehwere.

The human scientist does not scale. But the web does scale! The scientist needs help with their data, with their analysis etc. They’ll go to a computer scientist to help them out. The CS person gives them a load of technological mumbo jumbo that they are suspicious of. What they need is someone to interpolate the computer stuff and the biologist. They may try an ontologist, however, that also isn’t always too productive: the message they’re getting is that they’re being told how to do stuff, which doesn’t go down very well. People are shouting, but not communicating. This is because all the people might want different things (scientists want to record what’s happening in the lab, the ontologist wants to ensure that communication works, and the CS person wants to be able to take the data and do cool stuff with it).

Scientists are worried that other people might want to use their work. Let’s just assume they think that sharing data is exciting. Science wants to capture first and communicate second, ontologists want to communicate, and CS wants to process. There are lots of ways to publish on the web, in an appropriate way. However, useful sharing is harder than publishing. We need the agreed structure to do the communication, because machines need structure. However, that’s not the way humans work: humans tell stories. We’ve created a disconnect between these two things. The journal article is the story, but isn’t necessarily providing access to all the science.

So, we need to capture research objects, publish those objects, and capture the structure through the storytelling. Use the MyTea project as a example/story: a fully semantic (RDF-backed) laboratory record for synthetic chemistry. This is a structured discipline which has very consistent workflows. This system was tablet-based. It is effective and is still being used. However, what it didn’t work for was molecular biology / bioengineering etc — a much wider range of things than just chemistry. So Cameron and others got some money to modify the system: take MyTea (highly structured and specific system) and extend it into molecular biology. Could they make it more general, more unstructured? One thing that immediately stands out for unstructured/flexible is blogs. So, they thought that they could make a blog into a lab notebook. Blogs already have time stamps and authors, but there isn’t much revision history therefore that got built into the new system.

However, was this unstructured system a recipe for disaster? Well, yes it is — to start with. What warrants a post, for example? Should a day be one post? An experiment? There was little in the way of context or links. People who also kept a physical lab book ended up having huge lists of lab book references. So, even though there was a decent amount of good things (google indexing etc) it was still too messy. However, as more information was added, help came from an unexpected source: post metadata. They found that pull-down menus for templates were being populated by the titles of the posts. They used the metadata from the posts and used that to generate the pull-down menu. In the act of choosing that post, a link is created from that post to the new page made by the template. The templates depend on the metadata, and because the templates are labor saving, users will put in metadata! Templates feed on metadata, which feed the templates, and so on: a reinforcing system.

An ontology was “self-assembled” out of this research work and the metadata used for the templates. Their terms were compared to the Sequence Ontology and found some exact matches and some places where they identified some possible errors in the sequence ontology (e.g. conflation of purpose into one term). They’re capturing first, and then the structure gets added afterwards. They can then map their process and ontologies onto agreed vocabularies for the purpose of a particular story. They do this because we want to communicate to other communities and researchers that are interested in their work.

So, you need tools to do this. Luckily, there are tools available that exploit structure where it already exists (like they’ve done in their templates, aka workflows). You can imagine instruments as bloggers (take the human out of the loop). However, we also need tools to tell stories: to wire up the research objects into particular stories / journal articles. This allows people who are telling different stories to connect to the same objects. You could aggregate a set of web objects into one feed, and link them together with specific predicates such as vocabs, relationships, etc. This isn’t very narrative, though. So, we need tools that interact with people while they’re doing things – hence Google Wave.

An example is Igor, the Google Wave citation robot. You’re having a “conversation” with this Robot: it’s offering you links, choices, etc while having it look and feel like you’re writing a document. Also is the ChemSpider Robot, written by Cameron. Here, you can create linked data without knowing you’ve done it. The Robots will automatically link your story to the research objects behind it. Robots can work off of each other, even if they aren’t intended to work together. Example: Janey-robot plus Graphy. If you pull the result from a series of robots into a new Wave, the entire provenance from the original wave is retained, and is retained over time. Workflows, data, or workflows+data can be shared.

Where does this take us? Let’s say we type “the new rt-pcr sample”. The system could check for previous rt-pcr samples, and choose the most recent one to link to in the text (after asking them if they’re sure). As a result of typing this (and agreeing with the robot), another robot will talk to a MIBBI standard to get the required minimum information checklist and create a table based on that checklist. And always, adding links as you type. Capture the structure – it’s coming from the knowledge that you’re talking about a rt-pcr reaction. This is easier than writing out by hand. As you get a primer, you drop it into your database of primers (which is also a Wave), and then it can be automatically linked in your text. Allows you to tell a structured story.

Natural user interaction: easy user interaction with web services and databases. You have to be careful: you don’t want to be going back to the chemical database every time you type He, is, etc. In the Wave, you could somehow state that you’re NOT doing arsenic chemistry (the robot could learn and save your preferences on a per-user, per-wave basis. There are problems about Wave: one is the client interface, another is user understanding. In the client, some strange decisions have been made – it seems to have been made the way that people in Google think. However, the client is just a client. Specialized clients, or just better clients, will be some of the first useful tools. In terms of user understanding, all of us don’t quite understand yet what Wave is.

We’re not getting any smarter. Experimentalists need help, and many recognize this and are hoping to use these new technologies. To provide help, we need structure so machines can understand things. However, we need to recognize and leverage the fact that humans tell stories. We need to have structure, but we need to use that structure in a narrative. Try to remember that capturing and communication are two different things.

Science Online Software and Tools

The sound of two hands Waving

The Life Scientists Wave in Iron

I got a Google Wave account (grin) via Cameron Neylon on Monday morning (thanks, Cameron!). I’m trying not to get caught up in all the hype, but I can’t help grinning when I’m using it, even though I don’t really know what I’m doing, and even after seeing the Science Online Demo and a couple Google videos.

But where and how will we get the benefit of the Wave?

I’ve read a few articles, and played around a little, and chatted with people, but I’m still a complete novice. So, I’m not going to talk about technical aspects of waving here. However, even now I can see that the power of Wave will not be in what’s available by default (as was the case with Gmail – you got an account, started using it, and that was pretty much it). It will be in the new applications, interfaces and most especially the Robots that will be riding the Wave with us where the most value will be. OK, so I’ve only had an account for one day, but I think even as a beginner, I can see it is in what we will create for ourselves and our communities to use that will make or break this new thing. And, as ‘we‘ are so much a requirement for this to work, my next point becomes pretty important.

What it will really take to get the best out of Wave for us researchers and scientists?

It will take many, many scientists participating. Social networking needs to get a lot more important to people who currently may just make use of e-mail and web browsing. This is exciting, but we’ll need their help. A very good slideshow by Sacha Chua about this can be found on Slideshare. Use it to convince your friends!

First steps.

As for me, I’ll be waving with both hands this Thursday at 2pm, when Cameron Neylon comes to talk about open science, Google Wave, and more. Unless Cameron is a fantastic multitasker, I may be the only one with an account at the presentation. Not sure how interesting it will be if I am the only one waving. I’ll keep you updated, and post my experience with live blogging with Wave here, and let you know how it goes.

I’m also hoping that I can get some of my research out there into the wider world via Wave robots. I have an interest in structured information (ontologies, data standards etc) and think this may lead to some interesting things.

So, the sound of two hands waving? Pretty quiet, I think. But add another few hundred pairs of hands, and things may get a lot louder.

CISBAN Semantics and Ontologies Software and Tools

SBML in OWL: some thoughts on Model Format OWL (MFO)

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.
A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.
Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.
How a single reaction looks in MFO when viewed with Protege 3.4.

How a single species looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

Meetings & Conferences Semantics and Ontologies Software and Tools

TT47: Semantic Data Integration for Systems Biology Research (ISMB 2009)

Chris Rawlings, Also speaking: Catherine Canevet and Paul Fisher

BBSRC-funded research collaboration in Newcastle, Manchester, and Rothamsted : ONDEX and Taverna. Demo: Integration and augmentation of yeast metabolome model (Nature Biotech October 2008 26(10). Presented: Taverna and ONDEX. In ONDEX, everything can be seen as a network. To help with this, ONDEX contains an ontology of concept classes, relation types, and additional properties. Their example is yeast jamboree data integration. They have both specific (e.g. KEGG) and generic (e.g. tab delimited) parsers to load in data.

When ONDEX works with Taverna, instead of using the pipeline manager you use the ONDEX web services and access ONDEX from Taverna. This means you can use Taverna to pull in data into ONDEX. So, first parse jamboree data into ONDEX and remove currency metabolites (e.g. ATP, NAD). Add publications to the graph, from which domain experts can view and manually curate that data. Finally, annotate the graph using network analysis results. Then switch to taverna and identify orphans discovered in ONDEX. Retrieve the enzymes relating to the orphans and assemble the PubMed query and then add hits back to the ONDEX graph. Finally, have a look at the completed visualization. Use the ONDEX pipeline manager to upload data – it’s all in a GUI, which is good.

Then followed a live demo.


Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!