Categories
Meetings & Conferences Semantics and Ontologies

UKON 2016: The Use of Reformation to Repair Faulty Analogical Blends

These are my notes for Alan Bundy’s and Ewen Maclean’s talk at the UK Ontology Network Meeting on 14 April, 2016.

This talk is divided into two parts: Merging Ontologies via Analogical Blending, and Repairing Faulty Ontologies using Reformation.

Can you merge ontologies successfully using analogical blending? It would be quite easy to get things wrong, and therefore they are using the reformation technique to repair any mistakes made in the merging process.

T1 and T2 are the parent theories, and B is the blend between them. Suppose T1 and T2 are two retailer ontologies. T1 has relationships for owning, and part numbers, and product A and product B have the same part number as they are different instances of the same product. In T2, the relationship is sold_to and there are serial numbers rather than part numbers. So, things are similar but not identical. It would be easy to automatically align these concepts incorrectly. When the ontology is merged, the two products are incorrectly given the same serial number (when they only have the same part number). This makes the ontology inconsistent.

How can the reformation technique help you recover? Reformation works from reasoning failures. Here, we’re looking into inconsistencies. Using the proof of inconsistency, reformation tries to break the proof to prevent it from getting to the inconsistency, and therefore creating a suggested repair, in this case rename the two occurrences of the serial number. The resulting new blended ontology has replaced the serial number type with a part number type, and part and serial number are two different types, thus correcting the ontology.

Ontologies can be merged by analogical blending, but some blends can be faulty. Faults can be revealed by reasoning failures. Reformation uses such failures to diagnose and repair faulty ontologies. This work is still in the early stages.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences Semantics and Ontologies

Patrick Lambrix: Ontology alignment and applications

1 February 2010, Workshop on Modeling biological systems: Bio-model similarities and differences based on semantic information
Presentation by Patrick Lambrix

Abstract: “The use of ontologies is a key technology for the Semantic Web and in particular in the biomedical field many ontologies have already been developed. Many of these ontologies, however, contain overlapping information and to make full use of the advantages of ontologies it is important to know the inter-ontology relationships, i.e. we need to align the ontologies. Knowledge of these alignments would lead to improvements in search, integration and analysis of biomedical data. In this talk we give an overview of techniques for ontology alignment with a focus on approaches that compute similarity values between terms in the different ontologies. Further, we discuss the results of evaluations of these techniques using biomedical ontologies. We also give examples of the use of ontology alignment in applications such as literature search.”

Ontologies are used for communication between people and organisations (to paraphrase Michael Ashburner, it’s easier for biologists to share their toothbrushes than to share their definition of a gene). Ontologies are also used for enabling knowledge re-use and sharing, and as a basis for interoperability between systems. They are repositories of information, and as such can be queried or used as a query model for information sources. They are considered a key technology for the Semantic Web.

(He then described GO and OBO.) Examples of large-scale biomedical ontology efforts are the OBO foundry and the NCBO. In systems biology, SBO, BioPAX, and PSI efforts too.

There are many biomedical ontologies, and a large number of those are used by researchers. As happens often when multiple ways of describing things are used, many ontologies have overlapping concepts. For instance, two ontologies may have different views on the same domain / levels of granularity; in other cases, you may have built an in-house custom ontology that you wish to align with a standard one; finally, you may be building an bottom-up ontology that you wish to attach to an upper-level ontology.

Lambrix and others have developed a standard alignment framework. A number of different matchers are used to compare ontologies. Suggestions are validated by the user, and accepted and rejected choices are stored and used in later iterations. Preprocessing of the ontologies begins with the selection of relevant features of the ontologies and the selection of the search space (you can suggest appropriate “highly mappable” areas).

The matchers have a number of strategies. The first depend on linguistic matching, e.g. names, definitions and synonyms. This linguistic matchers look at the edit distance (number of different characters) and the N-gram set. N consecutive characters in a string, and similarity is based on a comparison of n-grams. Next come the structure-based strategies. Here, if it is known that two concepts in different ontologies are equivalent, then chances are that the children and parents have similarities as well. This is called propagation of similarities. Anchored matching is where two pairs in different levels in the hierarchy can be called equivalent. The third type of matcher strategies are constraint-based approaches. On their own, these approaches aren’t good at finding new mappings, but they are good at helping to reject prospective matches. A fourth type is instance-based matchers, where you could use life-science literature as instances: you can try to use entities annotated with a particular concept as instances of that concept. You could define a similarity measure between concepts using a basic naïve Bayes matcher (one per ontology). Here, a concept was used as a query term in PubMed, and then retrieved the most recent PubMed abstracts. Each Bayes classifier took the abstracts related to one ontology and classified it according to the concept in the other ontology with the highest posterior probability. A final type of matcher is one that uses auxiliary information. For example, you could use WordNet to find synonyms.

Most systems that use ontology alignment use different matcher methodologies. Have written a paper listing many of these systems, but more recently a book chapter. Most systems use single-value threshold filtering, where pairs of concepts with a similarity higher or equal to threshold are mapping suggestions. In this method, if you set a high threshold, you don’t find that many good suggestions, but those you find are good. How to find the best threshold level?

Lambrix and others suggest a method called double-threshold filtering. Here, pairs of concpets higher or equal to the upper threshold are suggestions. Those between the lower and upper threshold are mapping suggestions if they make sense with respect to the structure of the ontologies and the suggestions according to those above the upper threshold. This approach works very well if your structure is good (by which he means complete). From this, you can build an alignment system: theirs is called SAMBO.

The biggest evaluation done for alignments is via the Ontology Alignment Evaluation Initiative (OAEI), which has been around since 2004. There are different tracks. One track which most participate in is the comparison/benchmark track (open – you don’t know what the result should be). Most of the others are blind – you don’t know what the answer should be ahead of time. In 2007 17 groups participated, and AOAS came first in anatomy (from the UMLS people) and SAMBO came second (though it was better than AOAS in terms of non-trivial matches). Still 40% of the non-trivial matches were not found.

In the 2008 anatomy track, the idea was to align mouse anatomy and the NCI anatomy. The organisers say that there are 1544 mappings, of which 934 are trivial as these two groups have been working together already. The tasks include align and optimise, and to see if you could use already-given mappings to improve your results. SAMBO didn’t participate, but the organisers think that they would have won if they had entered. Interestingly, their “improved” SAMBO (SAMBOdtf) didn’t perform well. They think this was because the structure wasn’t complete, and the SAMBO people found a number of missing is-a relationships in the starting ontology(ies). 50% of mappings were found by systems using background knowledge (BK) and those not. 13% was found by each type of system but not by the other. The rest (approx 25%) were not found by either type of system. SAMBOdtf uses the structure, and found that they could improve both precision and recall in other circumstances.

Current issues in this field include: complex ontologies, use of instance-based techniques, the use of alignment types (determining the type of relations rather than just equivalence), complex mappings (1-n, m-n), and determining which alignment strategies work best for connection ontology types. Evaluation of alignment strategies will be done in future with SEALS (Semantic Evaluation At Large Scale). Another topic for current/future discussion include recommending ‘best’ alignment strategies, and the use of a partial reference alignment. Also, they’ve just started on the integration of ontology alignment and repair of the structure of the ontologies (they published something on repair last year, and are working on the integration now).

Ontology alignment can be used to aid literature searches. How do you know what is in the repository (lack of knowledge of the domain? How do you compose an expressive query, if you don’t know the language/method to do it? Commonly now, you do a keyword search and what you get back is not knowledge, but documents. If you’re interested in the relationships, then you use multiple keywords with or without boolean expressions. Still, you just get back documents. But what would be better is to get back knowledge together with their provenance documents. For multiple search terms, you could constrain the query to only allow terms with a certain degree of relatedness.

To do this type of improved querying, you need to first define what a relevant query is: in an ontology, this would be a subgraph of your ontology which contains both/all of your query terms. All of the relevant queries are called a “slice” (there may be different routes encompassing both terms).

In a framework implementing these, there would be a number of instantiated ontologies which are connected to the knowledge base, which in turn is connected to a query formulator. There is also an associated natural language query generator, which is used once a user enters keywords. This system doesn’t exist yet, though they’ve implemented each of the components separately. They tested the system independently (non-integrated, sending data manually to each of the participating groups). They looked at information concerning lipids in ovarian cancer. Input ontologies are the lipid ontology and alignment ontology (Allyson: ? Not sure I got the second ontology name right).

To instantiate the KB, they got the content of the document, extracted sentences, detected relevant sentences, recognised appropriate entities (term identification), normalised the results, extracted the relations; classified to identify ontology classes; populated an OWL ontology using the JENA API. After KB instantiation comes slice generation and alignment (ontologies connected using shortest path). Then these slices are used to generate nRQL queries and from there into natural language queries (use triples to generate sentences). Once you click on one of the NL sentences, the nRQL query is sent to the database, and the result retrieved. There is a tradeoff in terms of query generation: how many queries do you generate? Which ones do you show to the user? You probably need to perform some relevance matching and query ranking, which isn’t done yet.

More information:

How is the mapping implemented/executed in SAMBO? Have you used SWRL at all? They just use what is used in the competition, which is essentially OWL-like subsumption / equivalence / is-a statements.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences Semantics and Ontologies

PTO4: Alignment of the UMLS Semantic Network with BioTop Methodology and Assessment (ISMB 2009)

Stefan Schulz

Ontology alignment is the linking of two ontologies by detecting semantic correspondences in their representational units (RUs), e.g. classes. Mainly done via equivalence and subsumption. BioTop is a recent development created to provide formal definitions of upper-level types and relations for the biomedical domain. It is compatible with both BFO and DOLCE lite. It links to OBO ontologies. UMLS Semantic Network (SN) is an upper-level semantic categorization framework for all concepts of the UMLS Metathesaurus. It is mainly unchanged in the last 20 years: a tree of 135 semantic types.

If you compare the two, the main difference is in the semantics, as the BioTop semantics are explicit and use Description Logics (DL), which means you’re also subscribing to the open-world assumption (OWA). The semantics of UMLS-SN is more implicit, frame-like and may be closed world. It also has the possibility to block relation inheritance, which isn’t possible with DL.

The methodology is first to provide DL semantics to the UMLS SN, and second build the bridge between BioTop and UML SN. How do we do the first step?  For semantic types: types extend to classes of individuals; subsumption hierarchies are assumed to be is_a hierarchies; and there are no explicit disjoint partitions. For semantic relations: reified as classes, NOT represented as OWL object properties. For triples: transformed into OWL classes with domain and range restrictions. Why did we convert relations to classes? Didn’t want to inflate the number of BioTop relations, and there are other structural reasons. If you reify the relation, you can provide complex restrictions on that relation. Also, it means you can formally represent the UMLS SN tags such as “defined not inherited” in a more rigorous way.

Mapping is fully manual using Protege 4, consistency check with Fact++ and Pellet supported by the explanation plugin (Horridge ISWC 2008) – they spent most of their time fighting against inconsistent TBoxes. It was an iterative process. Assessment is next. Using SN alone there is very low agreement with expert rating. Using SN+BioTop there were very few rejections (only 3) but agreed with all expert ratings. Possible reasons could be to do with the DL’s OWA and for the false positives that the expert rating was done on NE but system judgments were done on something else. There were inconsistent categorizations of UMLS SN objects which exposed hidden ambiguities (e.g. that Hospital was both a building and an organisation).

Allyson’s questions: Why decide to create BioTop and not use BFO or DOLCE lite? It’s not that I would necessarily suggest that these be used, I am just curious. Also, subsumption hierarchies are assumed to be is_a hierarchies, but is that a safe assumption in UMLS SN? For instance, in older versions of GO this would have been a problem (some things marked as subsumption were not in fact is_a, though I am pretty sure GO has fixed all of this now).

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
CISBAN Meetings & Conferences Semantics and Ontologies Standards

Pre-Building an Ontology: What to think about before you start

There are a few big questions that need to be kept firmly in mind when starting down the road of ontology building. These are questions of:

  1. Goals: What are you trying to achieve with this ontology?
  2. Competency/Scope: What are you trying to describe?
  3. Granularity: To what depth will you need to go?

The rest of this post relates directly to and is organised around these three topics. These topics have a lot of overlap, and aren’t intended to be mutually exclusive: they’re just ideas to get the brain going. I use the upcoming Cell Behavior Ontology (CBO) workshop to illustrate the points. The questions I single out below may already have been answered by the workshop organizers, but haven’t been published on the CBO wiki yet. I’ll be attending this workshop, and will aim to post my notes each day. It should be fun!

Goals

If a main goal is eventual incorporation within another ontology (e.g. Gene Ontology (GO) for the case of CBO) or even just alignment with the other ontology’s tenets, you have to be sure you’re happy with the limitations this may put on your own ontology. It may be that these limitations are not acceptable, and as a result you choose to reduce the dependencies on the other ontologies.

For CBO, the important questions relate to possible alignment to GO and therefore, ultimately, Basic Formal Ontology (BFO):

Question: Do you wish to ultimately include some CBO terms under, for example, biological processes of GO? GO contains only canonical/non-pathological terms. How does this fit with the goals of CBO?

GO has the express intent of creating terms covering only canonical / non-pathological biology. Therefore, would cell behavior during cancer (e.g. uncontrolled cell proliferation or metastatis, which aren’t in GO) be appropriate if CBO is meant to, in its entirety, be included within GO? They are important terms, so if some amount of incorporation with GO is appropriate, would it only end up being a partial alignment?

Question: Are there any plans to use an Upper Level Ontology (ULO) such as the OBO Foundry-recommended BFO? Though BFO may not need to be considered immediately, it does place certain restrictions on an ontology. Are you happy with those restrictions?

One example of the restrictions placed by the use of BFO is that within BFO, qualities cannot be linked via the Relations Ontology to processes. That is, if you have a property called has_rate which is a child property of “bears”, then you are not allowed to make a statement such as “cell division has_rate some_rate”, where cell division is a process, and some_rate is a quality. There is a good post available about ULOs by Phil Lord.

Question: How richly do we want to describe cell behaviors?

Another important general goal is the level of richness that is needed with CBO. Competency questions, discussed later, will answer this to some extent. We can think about richness using GO as an example. The goal of the GO developers is the integration of multiple resources through the use of shared terms. GO does this very well. But, if you want rich descriptions and semantic interoperability, then this is something that is not a goal of GO.

Competency/Scope

While it is often a tempting idea to start from the top of an ontology and work downward, consideration should be given to an initial listing of leaf terms that you are sure that you need in the ontology. Not only does this ensure you have terms that people need from the start, the bagging and grouping exercises you would then go through to create the hierarchy will often highlight any potential problems with your expected hierarchy. If you have clear use-cases, then a bottom-up approach, at least in the early stages, can be useful in figuring out what the scope of your ontology is.

This brings us to the importance of having scope – and a set of competency questions – ready from the beginning of ontology development. What do you want to describe?

Question: What is the definition of cell behavior in the context of CBO?

For instance, for CBO, what is meant by the word “behavior”? A specific description of what is, and isn’t, a behavior that the CBO is interested in, is an important first step.

The last thing that would be relevant to the overall goals (but which could equally well be considered in the Granularity section below) is the type of terms to be added:

Question: Should the terms be biological terms only, or also bioinformatics/clinical terms?

To better explain the above question, you could consider the stages of cancer progression. “Stage 2” is a fictitious name for a clinical/bioinformatics description of a stage of a cancer. This is not a biological term. Which type of term should go into CBO? I would guess that the biological term should go in which describes the biology of a cell at “stage 2”, and then perhaps use synonyms to link to bioinformatics/clinical terms. There probably shouldn’t be a mix of the two types of terms as the primary labels.

Additionally, competency questions can help determine the scope. You can make a list of descriptive sentences that you want the ontology to be able to describe, such as “The behavior of asymmetric division (e.g. stem cell division)”. By listing a number of such sentences, you can determine which are out of scope and which must be included, thus building up a clear definition of the scope.

Granularity

For me the granularity question has two aspects: first, and more generally, is how fine-grained do you want to be with your terms; second, and more interestingly, is in the context of CBO, are we interested in the behavior of cells and/or the behavior in cells? The examples given in the workshop material seem to come from both of these areas (see http://cbo.compucell3d.org/index.php/Breakout_Session_Groups).

Question: Should CBO deal with the behavior OF cells and/or the behavior IN cells?

For the above question we can use as examples cell polarization and cell movement. Both are listed in the link to the wiki provided just above, so both are considered within the scope of CBO. However, cell movement is a characteristic behavior of a cell, while polarization is something that happens in a cell (e.g. polarization within a S.cerevisiae cell with regards to the budscar). Both of these types of behaviors are relevant, but they are different classes of behavior and may be an appropriate separation within the CBO hierarchy.

As an aside, is cell division a behavior? It is covered in the CBO material, so with respect to CBO, it is. I think that the CBO is intended to deal with single cells, so I’m not sure where cell division fits in.

These questions should be considered, but you should also try not to let them reduce the effectiveness and efficiency of ontology development. However, as with many biological domains, try to ensure that everyone is on the same page with their goals, scope, and granularity and there will be (I believe!) fewer arguments and more results.

Also, I am positive I’ve missed stuff out, so please add your suggestions in the comments!

With special thanks to Phil Lord for the useful discussions surrounding ontology building that formed the basis for this post.