This is a presentation given on 29 April, 2010, at the Link-Age / LifeSpan Workshop on Data Handling for Biogerontology Research held by CISBAN, Newcastle University.
Data integration: one definition is to combine data residing in different sources providing users a unified view of these data. Questions of relevance for the data integration field: scope (all, datasets), type (same, different), implementation (federation, centralisation), access (programmatic i.e. computer to computer, web i.e. interactive) and ownership (public, private). Henning covers federated, mainly programmatic techniques using data of the same type in this talk.
To take an example, if you start with a sample (e.g. from a mouse). Observations of this sample results in one or more (overlapping or non-overlapping) publications. Then, the publication information can be used to annotate interaction databases and sent to PSICQUIC servers. PSICQUIC should allow the user to reconstruct an idealised view of the original system from the interaction data.
The molecular interaction standard is the PSI-MI standard, whose first XML version was produced in February 2004. There have been updates and extensions since then, and has been widely implemented by the major interaction databases including DIP, MINT, MIPS, IntAct, HPRD, etc. (http://www.psidev.info/MI)
The PSI-MI XML format is full featured, but complex. This complexity is both its strength and its weakness. Therefore, due to user request, they developed a simplified tabular format called MITAB where one row equals one binary interaction. You loose a lot of information, such as whether a binary interaction is part of a more complex reaction, but it has proven popular.
PSICQUIC is one API which is implemented by many databases such as those mentioned earlier. Its purpose is for querying molecular interaction databases, and uses a common query language (MIQL, which is based on Lucene) for this data. Can be used for PPIs, drug-target interactions and simplified pathway data. The simple PSICQUIC viewer is at http://www.ebi.ac.uk/Tools/webservices/psicquic/view. The PSICQUIC viewer can also point to other resources such as IntAct and many other non-EBI databases. The viewer also has a more fancy, graphics-based implementation where there is an overlay of molecular interactions on Reactome pathways.
MIQL can query every field available within MITAB in a precise way. SOAP and REST interfaces are available and documented at http://code.google.com/p/psicquic.
The challenge is to move PSICQUIC from simple access to all the resources to a real integrated view of all those resources. How to determine if two sources really are talking about the same interaction? Also, the compute time quickly moves beyond suitable interactive times.
PSICQUIC is a technical solution, whereas IMEx is the social/collaborative answer. IMEx is the International Molecular Exchange Consortium. The aims of its members include: avoiding redundant curation, providing a consistent body of public data using detailed joint curation guidelines, and providing a network of stable and comprehensive resources for MI data. This work is now in production phase since February 2010. The work is split up into the different databases by journal type. You can find out more information about IMEx at http://imex.sf.net. Each interaction has its own database’s identifier, but also an identifier from a common IMEx identifier space. The hardest part was harmonizing curation procedures, and they now have a common curation manual across all databases.
Looking at another aspect of his work, EnCore, which is based on different data types integrated using a federated, programmatic approach. EnCore is an ENFIN platform to enable mining data across various domains, sources, formats and types. It integrates database resources and analysis tools across different disciplines. The first focus is on developing an EnXML format. Access interfaces include Perl API, Java API, ftp, SOAP, REST, GUI, etc. The return formats are in a variety of flavours, e.g. XML, CSV, plain text, JSON, etc. All of this must be squeezed into one consistent format. This is done by putting wrappers around the various programs.
The EnXML structure is set oriented – not only does it tell you about one thing (e.g. protein), but also about a set of them. In this structure, an experiment is run which identifies the results. Each experiment references a Set structure, which contains the structure of the result. Sets can hold further nested sets. There are a number of other further sub-structures. The EnCORE results always include both a positive and negative result set (in the case of the negative result, it lists all identifiers for which *no* hit was found). Negative results allow you to track why you might not have gotten a response, and how you “lost” some identifiers from the result.
EnVision is an end-user tool for the above EnCORE work based on the EnXML format. It provides an initial, integrative view for Sets of molecular entities without the need for programming. It also allows the possibility for further local processing. It allows you to save the status/analysis of your material on a particular date, and use that for, e.g., supplementary materials. You can also download your sub-results in a tabular format. Further information and the ability to run this GUI is available from http://www.enfin.org, where you can play with an EnCORE tutorial.
All of this can be quite laborious – web services that are used by EnCORE can change without warning, so it’s a constant challenge to maintain all of these wrappers. A partial answer is to use, wherever possible, underlying standards for the individual services. Such standards include PSICQUIC for MI data. DAS will be used to access protein annotation and information.
Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!