Meetings & Conferences Semantics and Ontologies

Patrick Lambrix: Ontology alignment and applications

1 February 2010, Workshop on Modeling biological systems: Bio-model similarities and differences based on semantic information
Presentation by Patrick Lambrix

Abstract: “The use of ontologies is a key technology for the Semantic Web and in particular in the biomedical field many ontologies have already been developed. Many of these ontologies, however, contain overlapping information and to make full use of the advantages of ontologies it is important to know the inter-ontology relationships, i.e. we need to align the ontologies. Knowledge of these alignments would lead to improvements in search, integration and analysis of biomedical data. In this talk we give an overview of techniques for ontology alignment with a focus on approaches that compute similarity values between terms in the different ontologies. Further, we discuss the results of evaluations of these techniques using biomedical ontologies. We also give examples of the use of ontology alignment in applications such as literature search.”

Ontologies are used for communication between people and organisations (to paraphrase Michael Ashburner, it’s easier for biologists to share their toothbrushes than to share their definition of a gene). Ontologies are also used for enabling knowledge re-use and sharing, and as a basis for interoperability between systems. They are repositories of information, and as such can be queried or used as a query model for information sources. They are considered a key technology for the Semantic Web.

(He then described GO and OBO.) Examples of large-scale biomedical ontology efforts are the OBO foundry and the NCBO. In systems biology, SBO, BioPAX, and PSI efforts too.

There are many biomedical ontologies, and a large number of those are used by researchers. As happens often when multiple ways of describing things are used, many ontologies have overlapping concepts. For instance, two ontologies may have different views on the same domain / levels of granularity; in other cases, you may have built an in-house custom ontology that you wish to align with a standard one; finally, you may be building an bottom-up ontology that you wish to attach to an upper-level ontology.

Lambrix and others have developed a standard alignment framework. A number of different matchers are used to compare ontologies. Suggestions are validated by the user, and accepted and rejected choices are stored and used in later iterations. Preprocessing of the ontologies begins with the selection of relevant features of the ontologies and the selection of the search space (you can suggest appropriate “highly mappable” areas).

The matchers have a number of strategies. The first depend on linguistic matching, e.g. names, definitions and synonyms. This linguistic matchers look at the edit distance (number of different characters) and the N-gram set. N consecutive characters in a string, and similarity is based on a comparison of n-grams. Next come the structure-based strategies. Here, if it is known that two concepts in different ontologies are equivalent, then chances are that the children and parents have similarities as well. This is called propagation of similarities. Anchored matching is where two pairs in different levels in the hierarchy can be called equivalent. The third type of matcher strategies are constraint-based approaches. On their own, these approaches aren’t good at finding new mappings, but they are good at helping to reject prospective matches. A fourth type is instance-based matchers, where you could use life-science literature as instances: you can try to use entities annotated with a particular concept as instances of that concept. You could define a similarity measure between concepts using a basic naïve Bayes matcher (one per ontology). Here, a concept was used as a query term in PubMed, and then retrieved the most recent PubMed abstracts. Each Bayes classifier took the abstracts related to one ontology and classified it according to the concept in the other ontology with the highest posterior probability. A final type of matcher is one that uses auxiliary information. For example, you could use WordNet to find synonyms.

Most systems that use ontology alignment use different matcher methodologies. Have written a paper listing many of these systems, but more recently a book chapter. Most systems use single-value threshold filtering, where pairs of concepts with a similarity higher or equal to threshold are mapping suggestions. In this method, if you set a high threshold, you don’t find that many good suggestions, but those you find are good. How to find the best threshold level?

Lambrix and others suggest a method called double-threshold filtering. Here, pairs of concpets higher or equal to the upper threshold are suggestions. Those between the lower and upper threshold are mapping suggestions if they make sense with respect to the structure of the ontologies and the suggestions according to those above the upper threshold. This approach works very well if your structure is good (by which he means complete). From this, you can build an alignment system: theirs is called SAMBO.

The biggest evaluation done for alignments is via the Ontology Alignment Evaluation Initiative (OAEI), which has been around since 2004. There are different tracks. One track which most participate in is the comparison/benchmark track (open – you don’t know what the result should be). Most of the others are blind – you don’t know what the answer should be ahead of time. In 2007 17 groups participated, and AOAS came first in anatomy (from the UMLS people) and SAMBO came second (though it was better than AOAS in terms of non-trivial matches). Still 40% of the non-trivial matches were not found.

In the 2008 anatomy track, the idea was to align mouse anatomy and the NCI anatomy. The organisers say that there are 1544 mappings, of which 934 are trivial as these two groups have been working together already. The tasks include align and optimise, and to see if you could use already-given mappings to improve your results. SAMBO didn’t participate, but the organisers think that they would have won if they had entered. Interestingly, their “improved” SAMBO (SAMBOdtf) didn’t perform well. They think this was because the structure wasn’t complete, and the SAMBO people found a number of missing is-a relationships in the starting ontology(ies). 50% of mappings were found by systems using background knowledge (BK) and those not. 13% was found by each type of system but not by the other. The rest (approx 25%) were not found by either type of system. SAMBOdtf uses the structure, and found that they could improve both precision and recall in other circumstances.

Current issues in this field include: complex ontologies, use of instance-based techniques, the use of alignment types (determining the type of relations rather than just equivalence), complex mappings (1-n, m-n), and determining which alignment strategies work best for connection ontology types. Evaluation of alignment strategies will be done in future with SEALS (Semantic Evaluation At Large Scale). Another topic for current/future discussion include recommending ‘best’ alignment strategies, and the use of a partial reference alignment. Also, they’ve just started on the integration of ontology alignment and repair of the structure of the ontologies (they published something on repair last year, and are working on the integration now).

Ontology alignment can be used to aid literature searches. How do you know what is in the repository (lack of knowledge of the domain? How do you compose an expressive query, if you don’t know the language/method to do it? Commonly now, you do a keyword search and what you get back is not knowledge, but documents. If you’re interested in the relationships, then you use multiple keywords with or without boolean expressions. Still, you just get back documents. But what would be better is to get back knowledge together with their provenance documents. For multiple search terms, you could constrain the query to only allow terms with a certain degree of relatedness.

To do this type of improved querying, you need to first define what a relevant query is: in an ontology, this would be a subgraph of your ontology which contains both/all of your query terms. All of the relevant queries are called a “slice” (there may be different routes encompassing both terms).

In a framework implementing these, there would be a number of instantiated ontologies which are connected to the knowledge base, which in turn is connected to a query formulator. There is also an associated natural language query generator, which is used once a user enters keywords. This system doesn’t exist yet, though they’ve implemented each of the components separately. They tested the system independently (non-integrated, sending data manually to each of the participating groups). They looked at information concerning lipids in ovarian cancer. Input ontologies are the lipid ontology and alignment ontology (Allyson: ? Not sure I got the second ontology name right).

To instantiate the KB, they got the content of the document, extracted sentences, detected relevant sentences, recognised appropriate entities (term identification), normalised the results, extracted the relations; classified to identify ontology classes; populated an OWL ontology using the JENA API. After KB instantiation comes slice generation and alignment (ontologies connected using shortest path). Then these slices are used to generate nRQL queries and from there into natural language queries (use triples to generate sentences). Once you click on one of the NL sentences, the nRQL query is sent to the database, and the result retrieved. There is a tradeoff in terms of query generation: how many queries do you generate? Which ones do you show to the user? You probably need to perform some relevance matching and query ranking, which isn’t done yet.

More information:

How is the mapping implemented/executed in SAMBO? Have you used SWRL at all? They just use what is used in the competition, which is essentially OWL-like subsumption / equivalence / is-a statements.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!