PTO6: Ontology Quality Assurance Through Analysis of Term Transformations (ISMB 2009)

Karin Verspoor

This work came out of a meeting talking about OBO quality assurance in GO. The work described here is applicable to any controlled vocabulary. The key quality concerns is univocality, or a shared interpretation of the nature of reality, and was originally coined from Spinoza in 1677. David Hill intended it to mean something slightly different, which is consistency of expression of concepts within an ontology. This facilitates human usability and computational tools can utilize this regularity.

Try to identify cases where there were violations of univocality: two semantically similar terms with different structure in their term labels. GO is generally very high quality: need computational tools to identify inconsistencies. They chose a simplistic approach of term transformation and clustering, as it’s good to start with the simplest stuff first. First step is abstraction, which is substitution of embedded GO and ChEBI terms with variables GTERM and CTERM, respectively. Then there was stopword removal (high frequency words like the, of, via). Next is alphabetic reordering (to deal with word order variation in the terms). They tried all different combinations of transformation ordering, to see how they were different.

20% of abstraction was due to CTERM, and 30% due to GTERMs. If you look at the distribution of the cluster sizes before and after transformation has radically changed. Max cluster before transformation was 29, and after the max cluster size was ~3000. In the end, found 237 clusters that may contain a univocality violation. Looked for terms that were in different cluster after abstraction, but merged together after one of the other transformations (that’s how they got the 237 clusters). A further 190 clusters that had to be manually assessed – this has reduced the number of things that had to be looked at manually. Discovered 67 true positive violations (35% ) of univocality. Already have ideas for improvements of this step.

The 67 clusters constitutes 317 GO terms. 45% of true positive inconsistences were {Y of X} | {Y in X}. There were a further 16% of TP where there were determiners in one version (e.g. “the”) and not in another version. Some of the smaller number of TP dealing with inverses, etc. 50% of FP were the semantic import of a stopword (some of the stopwords actually carry meaning and shouldn’t have been removed) and by removing it they’ve removed the difference between the two words.

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Advertisements

1 thought on “PTO6: Ontology Quality Assurance Through Analysis of Term Transformations (ISMB 2009)”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s