The 3rd Integrative Bioinformatics Workshop began yesterday at Rothamsted Research Centre in Harpenden, England. Rothamsted Research is a lovely campus reminescent of the Wellcome Trust Genome Campus just south of Cambridge. It has a long history of plant research as well as being the workplace of the mathematician and statistician Ronald Fisher.
Day One was a single afternoon session with one keynote and seven 25-minute talks. The keynote was an entertaining overview of current methods in semantic integration, and an update on the status of SRS by Steve Gardner from Biowisdom, the new owners of SRS (Lion Biosciences has folded). His classifications of integration methods are:
- rules-based (eg SRS)
- data warehousing – his opinion is that this method is best for repetitive analysis, or “same analysis, different data”. As this is not the sort of work that is often done in bioinformatics, he believes this technique is not as useful as the others. However, it is a common method of integration in bioinformatics anyway.
- federated middleware
- ad-hoc query optimization – query normalization and distribution across multiple databases
What these methods are not is scaleable – it is his opinion that none allow you to understand the meaning of the data. If you don’t understand the concepts or relationships inherent in the data, then it is difficult to do a useful integration.
Semantics is about 1) disambiguation and 2) aggregation. He says that relationships should be more than is_equivalent_of, is_same_as, is_a and is_part_of, and that more descriptive relationships should be used. He posits that current tools don’t have useful search results anymore. When you search pubmed, you are not getting meaningful answers. You get loads of hits, but only take a few out of the “top 10”. ended with a self-explanatory statement about the benefits of semantic integation: “If you can build on semantic consistency, you can get quite a lot for free.”
It was an interesting talk, but I am not sure I agree with everything he says. For one thing, I believe that data warehousing does have a place in modern bioinformatics: look at ONDEX, for example. However, various discussions during day 2 of this workshop made it clear that his definition of “data warehousing” and mine were actually different. The sort of data warehousing that he described as not useful to bioinformatics is the sort where all data sources are forced into a single, but NOT unified, schema. There is no attempt to actually integrate these sources into a unified schema where the semantically identical terms are stored in the same location. My definition of data warehousing has always been multiple data sources integrated into a unified schema, of which ONDEX is an example. So, with these revised definitions, I don’t see as much controversy.
He also is very positive about OWL Full, which is a fully semantic ontology. However, it is not guaranteed to be computationally complete, which is one of the reasons why OWL-DL is considered a good compromise.
My lack of understanding of his definition of data warehousing leads me to a final point: there are many data integration methods out there, but even more synonyms for data integration methods. It seems many papers create a new term rather than using currently available ones, and some of those which do re-use terms don’t always agree on the definition. Perhaps an ontology of data integration terms is required? 🙂