Archive for the ‘Semantics and Ontologies’ Category

SBML in OWL: some thoughts on Model Format OWL (MFO)

August 21, 2009 1 comment

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.

A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.

How a single reaction looks in MFO when viewed with Protege 3.4.

How a single species looks in MFO when viewed with Protege 3.4.

How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

Attribution vs Citation: Do you know the difference?

July 10, 2009 1 comment

This is a cross-posted, two-author item available both from my and Frank Gibson’s blog (his post).

Often the words “attribution” and “citation” are used interchangeably. However, in the context of ensuring your work gets the referencing it deserves when others make use of it, it is important that the differences between these two concepts are clear. This article outlines the differences between attribution and citation, and suggests that what most scientists are interested in is not attribution, which can be ensured via licensing restrictions, but instead citation, which is a much tougher nut to crack.

From xkcd, at

From xkcd, at

At ISMB last week, there were a number of conversations about the difference between attribution and citation. This topic was brought up again yesterday in a conversation between the authors of this post. It is an important distinction which is explored in this post.

First, some definitions for attribution and citation. These are not the only definitions possible, but for the purposes of this discussion, please keep these in mind.

Attribution: Acknowledgement of the use of someone else’s information, data, or other work. Crucially, while Wikipedia has a fairly straightforward definition of citation, it does NOT mention even common ways that attribution should be implemented (see Wikipedia attribution page).

Citation: When you publish a paper that makes use of someone else’s information (data, ontology, etc.), you include in that paper a reference to the work of that other person or group. Wikipedia states that it is a “reference to a published or unpublished source” whose prime purpose is of “intellectual honesty”.

Distinguishing between attribution and citation.
You can imagine that citation is a specific type of attribution, but attribution itself can be performed in any number of ways. For scientists, citation is much more useful to their careers as a result of the publish or perish environment.

So, what could attribution consist of? First, let’s take as an example the re-use of someone else’s ontology or specific sub-parts or classes of that ontology. Each class in an ontology is identified by a URI. Therefore, is importing the URL enough? With a URI is it clear where you got the class from? If it’s not enough, where do you put that reference or statement that you are re-using other classes: within the overall metadata of your own ontology? Alternatively, when attributing data is a reference to the originating paper or URL from where you downloaded the data enough? Where do you put that reference: within the metadata of your own document? As a citation? How much is enough attribution?

These questions cannot easily be answered.

A common-sense answer to the question of properly fulfilling requirements is to, at a minimum, first cite their information in your paper, and second include URL(s)/URI(s) in your metadata. But here we get to the crux of the matter: we’ve now stated that a useful way to ensure attribution is to cite the other person. But, if you think carefully, what’s more important for your impact assessments, and your work? It’s actually the citation itself. Sure, acknowledgement via extra referencing in the metadata of the person using your information is great, but what you really need is a citation in their work. If we aren’t careful, we will all make the easy mistake of conflating citation in papers with importing a licensed piece of information and how to mark its inclusion: the former is what we often are scored on and what we would really like, while the latter is the only thing a license enforces. Licensing with attribution requirements is not citation; you can make use of a licensed ontology, but this does not require you to cite it in a paper.

Attribution: the legal entity.

Important point: It’s easy to use a license such as the CC-BY, thinking that you’ll ensure citation, when in fact all you’re doing is ensuring attribution.

What are the implications of attribution? It can quickly get out-of-control and difficult to manage.
By requiring attribution in an ontology or data file, if someone imports information (such as a class from an ontology) into their own document, the new one must attribute the original. Continuing the ontology analogy, if there are 20-30 ontologies being used for a single project (which is not inconceivable in the coming years), there could be great difficulty in maintaining attribution for them all.

Important point: While licenses such as the CC-BY allow the attribution to be performed “in the manner specified by the author or licensor”, this could lead to 30 different licensors requiring potentially 30 different methods of attribution, and attribution stacking isn’t pretty.

Citation: the gentlemen’s club.

Can citation be assured? No. Well, maybe.
You can imagine citation as a gentlemen’s club, as propriety dictates that you should cite another’s work that you use, but there is no legal requirement to do so. Indeed, many believe that citation should not be enforced anyway. In contrast, attribution as required by licenses is a legal statement. However, let’s revisit the clause in CC-BY that states the author/licensor can specify the manner in which the attribution is given.

Important point: Could you use a license such as CC-BY, and state that the attribution must come in the form of, at a minimum, citation in the paper which describes the work being performed by the licensee?

Bottom line: which one is more important to you, as a scientist? Depends on the context.
This is difficult to answer. There aren’t very many guidelines available for us to analyse. The OBO Foundry does have a set of principles, the first of which states that “their [the ontology(ies) and their classes] original source is always credited and that after any external alterations, they must never be redistributed under the same name or with the same identifiers”. However, how this credit is attained is unclear, as described in various blog posts (Allyson, Frank, Melanie). As a result, the following conclusions came out of the OBO Foundry workshop this summer (Monday outcomes): it is “unclear if each ontology should develop their own bespoke license or use develop ‘CC-by’; how to give attribution? Generally use own judgment, here MIREOT mechanism can help when importing external terms into an ontology, giving class level attribution” (MIREOT web page, see also OWLED 2008 paper). Therefore, while they are aware of the problem, they don’t offer a consensus solution(s).

The flipside of this is that in order to use an ontology, you first have to write a paper and cite the classes you wish to import, then get on with the work. If you never get a paper and therefore a citation, is you ontology/data illegal? If you take the example of OBI, which imports several other ontologies and is an open community of developers, would a license restriction requiring citation actually prevent the work starting? This is probably a bit of a chicken-and-egg scenario, if it were ever to come a reality. In short, while there are some tempting possibilities, there doesn’t yet seem to be a useful solution.

In summary, it’s generally not attribution that people want (which can be licensed, even if you don’t like the layers of attribution that will require once you’re using multiple sources) but citation, which isn’t so easily licensed – yet. When deciding what sort of license to use (e.g. an open one like CC0 or an attribution-based one like CC-BY), you need to take into account expected usage. In some cases, for a leaf ontology, perhaps CC-BY is appropriate, as it isn’t intended to be imported by others, but you never know when your leaf will turn into something others import. Science Commons also believes that attribution is a very different beast, and shouldn’t be required when licensing data. They provided me with an answer to how to license ontologies recently that favored CC0.

So, if you really want citation and not attribution, consider an open license such as CC0 and make a gentlemanly (gentle-science-person-ly) request that if someone uses it AND publishes a paper on it, please cite it in the way you suggest. Alternatively, I’d be interested to hear if it would be possible to use an attribution-based license such as CC-BY and then require the attribution method be citation in a paper. Would this method work, and would it be polite? Your comments, please.

FriendFeed Discussion

TT47: Semantic Data Integration for Systems Biology Research (ISMB 2009)

July 2, 2009 Leave a comment

Chris Rawlings, Also speaking: Catherine Canevet and Paul Fisher

BBSRC-funded research collaboration in Newcastle, Manchester, and Rothamsted : ONDEX and Taverna. Demo: Integration and augmentation of yeast metabolome model (Nature Biotech October 2008 26(10). Presented: Taverna and ONDEX. In ONDEX, everything can be seen as a network. To help with this, ONDEX contains an ontology of concept classes, relation types, and additional properties. Their example is yeast jamboree data integration. They have both specific (e.g. KEGG) and generic (e.g. tab delimited) parsers to load in data.

When ONDEX works with Taverna, instead of using the pipeline manager you use the ONDEX web services and access ONDEX from Taverna. This means you can use Taverna to pull in data into ONDEX. So, first parse jamboree data into ONDEX and remove currency metabolites (e.g. ATP, NAD). Add publications to the graph, from which domain experts can view and manually curate that data. Finally, annotate the graph using network analysis results. Then switch to taverna and identify orphans discovered in ONDEX. Retrieve the enzymes relating to the orphans and assemble the PubMed query and then add hits back to the ONDEX graph. Finally, have a look at the completed visualization. Use the ONDEX pipeline manager to upload data – it’s all in a GUI, which is good.

Then followed a live demo.


Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

TT16: Ontology Services for Semantic Applications in Healthcare and Life Sciences (ISMB 2009)

June 30, 2009 1 comment

Patricia Whetzel, Outreach Coordinator for NCBO

Trish has recorded her talk as a screencast as she wanted to do a demo, and she can’t trust the wireless – true enough! RESTful web services have been developed at the NCBO within BioPortal. (Note this is the prefix for all services, and if you just go to this URL there isn’t anything visible). Chose RESTful services as they are lightweight and easy to use. The main BioPortal website is All information on the BioPortal site is retrieved using those web services. Can store ontologies in OWL, OBO and Protege frames formats.

You can search ontologies based on a number of parameters. Much help information is available via mouseover text. You can also download ontologies that are available on BioPortal. When browsing your ontologies you can see the structure, the metadata, definitions and more. There are also ontology widgets that you can put on your own site, including jump-to feature and term selection widget. This latter one is very useful because it allows your web app to use term auto-complete without having to code it yourself!

To go into the search web services a little bit more, for instance search for “protocol”. The search can be paramaterized and filtered in many ways: which ontology to use, exact or non-exact matching, etc. The search function is especially important for ontology re-use. For instance, if you’re developing a new domain ontology, then you want to make sure you don’t reinvent the wheel and this is a good way to find out what’s out there. The next bit of the video showed using these searches via programmatic means.

BioPortal also allows you to annotate, or add notes, to ontologies. There is also an annotation tag /term cloud in the interface, which is nice :) You may see duplicates in the tag cloud – designed to be this way to show that more than one ontology has that term..  There are also hierarchy services. You can view the parent terms of a particular term, and do other sorts of queries that allow you to explore the hierarchy around a term programmatically. On the web app, they have a visualization of the hierarchy that is dynamic and you can play with.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

HL13: The Human Phenotype Ontology (ISMB 2009)

June 29, 2009 Leave a comment

Peter Robinson

MIM – started in 1966 and online (OMIM) for over a decade. It has been extremely difficult to use computationally in a large-scale fashion. Thehierarchical structure of OMIM does not reflect that two terms are more cloesly related than a third. In constructing the HPO, all descriptions used at least twice (~7000) were assigned to HPO. It now has about 9000 terms and annotations for 4813 diseases. They have a procedure which calculates phenotypic similarity by finding their most-specific common ancestor.

You can visualize the human phenome using HPO. They also have a query system that allows physicians to query what’s in the ontology. Also there is the Phenomizer, which is “next-generation diagnostics”. You can get a prioritized list of candidates.To validate the approach, they took 44 syndromes and went to literature to look at their frequency, then generate patients at random using the features of the disease. For each simulated patient, queries were generated using HPO terms. Ranks of the disease returned by the phenomizer were compared to the original diagnosis. Comparisons were performed with phenotypic noise. In an ideal situation, their approach has some advantage (when no noise and imprecision). When add noise or imprecision, the p-value stays ok but other measures drop. They also use the information to get disease-gene families.

HPO and PATO are talking to each other. HPO is being used as a link between cellular networks and HP. They also want you to annotate your data with HPO. If you’re interested, find out more about the HPO Consortium.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

PTO6: Ontology Quality Assurance Through Analysis of Term Transformations (ISMB 2009)

June 29, 2009 1 comment

Karin Verspoor

This work came out of a meeting talking about OBO quality assurance in GO. The work described here is applicable to any controlled vocabulary. The key quality concerns is univocality, or a shared interpretation of the nature of reality, and was originally coined from Spinoza in 1677. David Hill intended it to mean something slightly different, which is consistency of expression of concepts within an ontology. This facilitates human usability and computational tools can utilize this regularity.

Try to identify cases where there were violations of univocality: two semantically similar terms with different structure in their term labels. GO is generally very high quality: need computational tools to identify inconsistencies. They chose a simplistic approach of term transformation and clustering, as it’s good to start with the simplest stuff first. First step is abstraction, which is substitution of embedded GO and ChEBI terms with variables GTERM and CTERM, respectively. Then there was stopword removal (high frequency words like the, of, via). Next is alphabetic reordering (to deal with word order variation in the terms). They tried all different combinations of transformation ordering, to see how they were different.

20% of abstraction was due to CTERM, and 30% due to GTERMs. If you look at the distribution of the cluster sizes before and after transformation has radically changed. Max cluster before transformation was 29, and after the max cluster size was ~3000. In the end, found 237 clusters that may contain a univocality violation. Looked for terms that were in different cluster after abstraction, but merged together after one of the other transformations (that’s how they got the 237 clusters). A further 190 clusters that had to be manually assessed – this has reduced the number of things that had to be looked at manually. Discovered 67 true positive violations (35% ) of univocality. Already have ideas for improvements of this step.

The 67 clusters constitutes 317 GO terms. 45% of true positive inconsistences were {Y of X} | {Y in X}. There were a further 16% of TP where there were determiners in one version (e.g. “the”) and not in another version. Some of the smaller number of TP dealing with inverses, etc. 50% of FP were the semantic import of a stopword (some of the stopwords actually carry meaning and shouldn’t have been removed) and by removing it they’ve removed the difference between the two words.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

PTO4: Alignment of the UMLS Semantic Network with BioTop Methodology and Assessment (ISMB 2009)

June 29, 2009 2 comments

Stefan Schulz

Ontology alignment is the linking of two ontologies by detecting semantic correspondences in their representational units (RUs), e.g. classes. Mainly done via equivalence and subsumption. BioTop is a recent development created to provide formal definitions of upper-level types and relations for the biomedical domain. It is compatible with both BFO and DOLCE lite. It links to OBO ontologies. UMLS Semantic Network (SN) is an upper-level semantic categorization framework for all concepts of the UMLS Metathesaurus. It is mainly unchanged in the last 20 years: a tree of 135 semantic types.

If you compare the two, the main difference is in the semantics, as the BioTop semantics are explicit and use Description Logics (DL), which means you’re also subscribing to the open-world assumption (OWA). The semantics of UMLS-SN is more implicit, frame-like and may be closed world. It also has the possibility to block relation inheritance, which isn’t possible with DL.

The methodology is first to provide DL semantics to the UMLS SN, and second build the bridge between BioTop and UML SN. How do we do the first step?  For semantic types: types extend to classes of individuals; subsumption hierarchies are assumed to be is_a hierarchies; and there are no explicit disjoint partitions. For semantic relations: reified as classes, NOT represented as OWL object properties. For triples: transformed into OWL classes with domain and range restrictions. Why did we convert relations to classes? Didn’t want to inflate the number of BioTop relations, and there are other structural reasons. If you reify the relation, you can provide complex restrictions on that relation. Also, it means you can formally represent the UMLS SN tags such as “defined not inherited” in a more rigorous way.

Mapping is fully manual using Protege 4, consistency check with Fact++ and Pellet supported by the explanation plugin (Horridge ISWC 2008) – they spent most of their time fighting against inconsistent TBoxes. It was an iterative process. Assessment is next. Using SN alone there is very low agreement with expert rating. Using SN+BioTop there were very few rejections (only 3) but agreed with all expert ratings. Possible reasons could be to do with the DL’s OWA and for the false positives that the expert rating was done on NE but system judgments were done on something else. There were inconsistent categorizations of UMLS SN objects which exposed hidden ambiguities (e.g. that Hospital was both a building and an organisation).

Allyson’s questions: Why decide to create BioTop and not use BFO or DOLCE lite? It’s not that I would necessarily suggest that these be used, I am just curious. Also, subsumption hierarchies are assumed to be is_a hierarchies, but is that a safe assumption in UMLS SN? For instance, in older versions of GO this would have been a problem (some things marked as subsumption were not in fact is_a, though I am pretty sure GO has fixed all of this now).

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!


Get every new post delivered to your Inbox.

Join 520 other followers