Categories
Meetings & Conferences

3rd Integrative Bioinformatics Workshop Day 1 (4 September 2006)

The 3rd Integrative Bioinformatics Workshop began yesterday at Rothamsted Research Centre in Harpenden, England. Rothamsted Research is a lovely campus reminescent of the Wellcome Trust Genome Campus just south of Cambridge. It has a long history of plant research as well as being the workplace of the mathematician and statistician Ronald Fisher.
Keynote speech
Day One was a single afternoon session with one keynote and seven 25-minute talks. The keynote was an entertaining overview of current methods in semantic integration, and an update on the status of SRS by Steve Gardner from Biowisdom, the new owners of SRS (Lion Biosciences has folded). His classifications of integration methods are:

  • rules-based (eg SRS)
  • data warehousing – his opinion is that this method is best for repetitive analysis, or “same analysis, different data”. As this is not the sort of work that is often done in bioinformatics, he believes this technique is not as useful as the others. However, it is a common method of integration in bioinformatics anyway.
  • federated middleware
  • ad-hoc query optimization – query normalization and distribution across multiple databases

What these methods are not is scaleable – it is his opinion that none allow you to understand the meaning of the data. If you don’t understand the concepts or relationships inherent in the data, then it is difficult to do a useful integration.

Semantics is about 1) disambiguation and 2) aggregation. He says that relationships should be more than is_equivalent_of, is_same_as, is_a and is_part_of, and that more descriptive relationships should be used. He posits that current tools don’t have useful search results anymore. When you search pubmed, you are not getting meaningful answers. You get loads of hits, but only take a few out of the “top 10”. ended with a self-explanatory statement about the benefits of semantic integation: “If you can build on semantic consistency, you can get quite a lot for free.”

It was an interesting talk, but I am not sure I agree with everything he says. For one thing, I believe that data warehousing does have a place in modern bioinformatics: look at ONDEX, for example. However, various discussions during day 2 of this workshop made it clear that his definition of “data warehousing” and mine were actually different. The sort of data warehousing that he described as not useful to bioinformatics is the sort where all data sources are forced into a single, but NOT unified, schema. There is no attempt to actually integrate these sources into a unified schema where the semantically identical terms are stored in the same location. My definition of data warehousing has always been multiple data sources integrated into a unified schema, of which ONDEX is an example. So, with these revised definitions, I don’t see as much controversy.

He also is very positive about OWL Full, which is a fully semantic ontology. However, it is not guaranteed to be computationally complete, which is one of the reasons why OWL-DL is considered a good compromise.

My lack of understanding of his definition of data warehousing leads me to a final point: there are many data integration methods out there, but even more synonyms for data integration methods. It seems many papers create a new term rather than using currently available ones, and some of those which do re-use terms don’t always agree on the definition. Perhaps an ontology of data integration terms is required? 🙂

Categories
Meetings & Conferences Standards

FuGO Workshop Days 2-3

Ever since Monday (Day 1), there has been change afoot in the secret depths of the FuGO workshop. Not only were the discussions stimulating (as my previous post indicated), but there were ideas of redefinitions and term shuffling that grew, and then grew again during the evening of beer and revelry at the Red Lion Hinxton. Days 2 and 3 continued in this vein, and while I am being deliberately obtuse in order to tantalize the reader with our goings-ons, there was the smell of change in the air (and luckily, in this low-30s heat wave for Britain, that was all – our meeting room is one of the few in the entirety of the EBI where 15+ people can sit in air-conditioned comfort).

I think we are all starting to feel comfortable about where FuGO is headed, and while there was probably a little “analysis paralysis” (a term which Chris was the first, but not the last, to gently use at this meeting), the top-level decisions that need to be made at this stage do require serious discussion, and I believe the balance was about right. Everyone was contributing, and the daily (local to the workshop) update of the OWL file looked significantly different after yesterday’s changes. I shall wait to comment on any specifics until everything is up on the FuGO website, but I look forward with interest to the final day of discussion, and will probably have a sufficently tired brain that the talks on upcoming FuGO tools on Friday will be a balm.

Categories
Meetings & Conferences Standards

FuGO Workshop Day 1

Today was day 1 of my first (but in reality the 2nd) FuGO Workshop. It was full of interesting ontology ideas for the realm of functional genomics and beyond. There were talks concerning new ontological communities (GCP by Martin Senger, BIRNLex by Bill Bug, Immunology Database and Analysis Portal by Richard Scheuerman, and IEDB by Bjoern Peters) first thing, followed by a very interesting discussion on OBO Foundry Principles and developments in a phenotype ontology for model organisms and a unit/measurement ontology.

The most interesting points made this morning (to me!) were:

  1. Richard Scheuerman: the Immunology DB and Analysis Portal group has been thinking closely about data relationships – how much do you capture in the ontology and how much in the corresponding data model? Their current answer is to use only an is_a in ontologies, and to capture more specific stuff in the data model. (As I understood it, they would like to change this to make the ontology more complex at a later date). With their ontology, they emphasized modeling the data based on how the data was going to be used (don’t go into too much detail of the robots, for example). In choosing what data fields to require, they realize that experimentalists would only be happy to fill out forms that had perhaps a dozen data fields, therefore it is important to choose fields which anticipate how users will want to query the database.
    I really like this idea of requiring only those fields which would be of most use to the biologist, rather than those that would make us bioinformaticians most happy. Hopefully with time, the biologist would be happy to fill out more of a FuGE data structure.
  2. The MIcheck Foundry, which will create a “suite of self-consistent, clearly bounted, orthogonal, integrable checklist modules”. This is coming out in a paper (currently in press in Nature Biotech by CF Taylor et al.). It will contain MICheck – a “common resource for minimum information checklists” analogous to obo/ncbo bioportal (analogous to a shop window for displaying these checklists). There are many minimum information checklists out there, and the number will only grow, so it makes a lot of sense.

The afternoon was characterized by good-natured “discussion”. Here’s a summary of the mild-mannered (but – seriously – quite interesting) discussions:

  1. Argument against multiple inheritance (MI) in application ontologies, by Barry Smith.
    The root of the problem is that one shouldn’t combine two diff classifications, e.g. color and car type (if trying to have a red cadillac inherit from both red car and cadillac). Instead there should be a color ontology and car ontology. Many ontologies were originally built to support administrative classifications, but his opinion is that when you’re doing science, you’re interested in capturing the law-like structures in reality, not administrative information. Barry says every instance in reality can fall into many types of classes – the issue is how to build the ontologies: you can capture these instances by having either separate single inheritance (SI) ontologies, or one single “messy” MI ontology. Further, if you have MI, you might not have reuseability (ie why have colors just in car ontology, when they could be reused if they were separate?). In response, Robert Stevens suggested that you could have MI and let a machine deal with the differences – take car and color ontology and combine them mechanically through the relationships you put between them: people shouldn’t be scared of multiple inheritance, just use it carefully.
    The final conclusion was that normalization can upack MI into SI as long as it is a “good” MI ontology, and therefore this can be a reasonable way of ensuring your MI ontology is still a strong one – most errors are associated with misuse or overuse of MI. Another way to think of it would be to have a set of SI reference ontologies, but in your application ontology you would use MI. SI is very useful, as it supports all sorts of reasoning algorithms / statistical tools that MI does not allow. So MI ontology ONLY works if you can break it down into a set of SIs. Normalize a MI in order to use the checking mechanism, and only use a MI if it can be normalized.
  2. I learnt today about fiat types (no – not a car!), and how they relate to a putative measurement/unit ontology. A unit is a fiat universal/type in the dimension we use it to measure. In other words, measurements involve fiat units. A measurement ontology should be about storing units, and therefore do NOT want numbers. Numbers are part of the data, not the ontology.

…And that was only the first day! Wowsers…