On 19-20th February 2007, there was a NEBC–EBI Developer’s Workshop at CEH Oxford, which included presentations and discussions of the CISBAN DPI, omixed, MIBBI, OBI, MSI, PSI, and BioMAP, among others. Participants were from NEBC and other sections of Oxford University, EBI, University of Manchester, and Newcastle University. Once there are official minutes of the workshop, I will post the link here. Those minutes should also include links to the presentations given over the course of the two days.
MIGS + FuGE-OM
The most important result, in relation to my work, to come out of the workshop, was the decision we made to implement the MIGS model as an extension of the FuGE-OM. By the time of the next workshop on the 6-8th June 2007, we will have a draft version of the MIGS-OM. Dawn Field, Allyson Lister, Andy Jones, and others are the main developers of the new OM.
Now, back to the day-by-day notes….
Below are some notes that I made during this workshop, which you may wish to peruse. This workshop was originally designed as an internal NEBC-EBI workshop, however many external people were invited to give presentations on current data integration and data archiving techniques within the bioinformatics community that it became something much more. It was two days of exploring possible collaborations and partnerships as well as learning about the interesting and complementary work going on at the four centers of bioinformatics research.
Initially, there was some discussion over the proposal made from the OBI community to the OBO Foundry on a minimal amount of ontology term metadata. Specifically, there was some concern over the general use of alternative_terms in an ontology. The worry is that two different communities will call two terms by the same alternative_term. If they don’t also refer to the ID, there could be problems, as they will not be talking about the same thing. Thinking about a practical case for the moment, such as GO, does reassure me in some ways. Programmatically there is a general consensus that the GO IDs should be used in preference to their unstable names (a name may change, but as long as it is representing the same concept, the ID will remain stable). Further, databases such as UniProt cross-reference to the IDs, and not to the terms. However, there may be some cause for concern. When you see GO terms referenced in papers, for instance in a table or figure, you usually do not see the GO ID. While I think it is laudable to allow community-specific labels for each OBI term and I believe this will make uptake of the ontology easier, cases where the OBI ID is not used may create problems if they are also using a community label. Certainly, it is something to keep in mind.
The CISBAN DPI project, which I have spoken at length about in the past (as it is one of my projects!) was also presented here. It fits in very well in this group of people who are interested in learning of ways to store and archive multiple omics data types. A full description of the current status of the project is available from the DPI project page.
Chris Taylor from the EBI provided a useful overview of the current status of the MIBBI project. This project has two functions: the first is to provide a central location and minimal information for any interested MI* project, and the second is to draw together a checklist of high-level requirements for all biological experiments on which specific communities can base their checklists. Chris mentioned that, at least initially, it is important to separate out those pieces of minimal information that are both useful and practical from those which are “only” useful. The importance of MIBBI and other MI* projects is clear: poor reporting of experimental work leads to an almost zero chance of reproducibility. The importance of a shared checklist such as MIBBI is also straightforward: there is clear crossover in many areas. For instance, MIACA and MIFlowCyt are both interested in flow cytometry, though MIACA has a broader remit. Metabolomics and proteomics also use a lot of the same equipment. However, over the past months and years, these two communities have chosen different priorities in their checklists: metabolomics focuses mainly on sampling and statistics, while the proteomics checklist has spent more time on describing equipment. In the past, there has been some criticism of MIAME, stating that it was vague and there have been some databases that have come out as saying that they do not know if they are MIAME compliant. However, this is in almost all cases a necessary vagueness, as consensus generally leads to compromise in definitions and requirements. Chris’ aim with MIBBI is to keep it from overly formal representations, as MIBBI is more about producing a useable set of guidelines. Each MI* module would, over time, become orthogonal. A research group would take those modules they need and end up with a bespoke set of reporting guidelines for their own work. Tim Booth then spent some time talking about implementing MIBBI in the same forms environment (a web GUI for filling in forms) as that used for GCat. The GCat system takes an xsd as input and can create a series of web forms that will produce XML validated against that xsd, and is currently used to run the front-end to the MIGS database.
David Shotton and Graham
Kline then spoke on their shared interests with NEBC. They are researching the best way to store metadata on images, as well as how to store the images themselves. Firstly, the concept of a single entry in an image database is different from that in a sequence database: where a given sequence of a protein or gene is identical no matter where it is sequenced (as long as you are using the same sample, strain and species as the other group), an image of assay results from an affymetrix chip, for example, will always be slightly different each time the assay is performed. Therefore the number of images will grow very quickly, and storage will be problematic. Their answer is to use a “data web”: put the image on the web – on your own research group’s pages, for example – and then add sufficient metadata that the information can be harvested into a central registry, where both programmatic and GUI access will be available. Shared interests with the NEBC and their partners include linking experimental data acquisition to ontologies and data stores, generating links to genomic and other omics databases, using the web for data acquisition and display, and exploring the interaction between informal labeling and the more formally defined ontological concepts. However, they are not trying to solve all the social problems that arise from distributed storage such as permanence of links.
omixed and its precursor maxd were then discussed by Giles Velarde and Dave Hancock from Manchester University. First, Giles described StreptoBase, which is an instance of maxd. Then he went on to a new project called omixed, which is currently “vaporware”, i.e. at a very early stage. Their website (available above) is up to give people an idea of what omixed may look like when complete. Everything, including the client web page, will work via web services. This makes it easy to integrate things like taverna. Tagging and wikis will be supported and kept separate to the main database, though all will be presented in the same view. They’re hoping that, with enough use, tag clouds could lead to a study of well-used terms (and perhaps, in an ideal world, to a starting place for term collection for an ontology, existing or new?). The main thing from the omixed developers point of view is the “user experience”.
Tiwari described a MIAME-compliant excel worksheet. After that,
from the EBI described MAGE-tab, which is a template that can be loaded into excel to store information compliant with MIAME, however it cannot export mage-ml: it just provides the
correct headers. However, he does have someone working on a converter that will take the output of MAGE-tab and convert it to MAGE-ML. Philippe is also working on extending this tool for proteomics: specifically for Henning Hermjakob and Pride. Already available from the Pride website is the Pride harvester, which uses connections to the
ontology lookup service and which has a macro that triggers conversion to the
xml, and from there you can submit to pride. However, Pride
does not contain anything about experimental design or experimental
only uses a subset of the MAGE-OM, but the part it doesn’t use hasn’t
been used at all so far.
Booth then spoke on handlebar, which is a database and web application to store and create barcodes from the very start of the sampling process. It also creates
basic spreadsheets based on requirements for each barcode type, and allows printing of barcodes directly to barcode printers. It was created to “tame” all the data put into spreadsheets by different groups
etc. He says there should be a paper on in in Biotechniques next month.
Jones from University of Manchester then spoke about the current status of FuGE. In the FuGE-OM, an investigation (there may be multiple investigations in a single experiment)
is just a way of organizing things: investigation
components contain associations to all protocol applications run for that
technology. Types of protocol applications include treatments, data acquisition, data treatments. More information on the FuGE project can be found by following the above link.
Sansone from the EBI discussed BioMAP, a new project in the works and due to start in about a month, and how it will relate to ArrayExpress, Pride and FuGE. ArrayExpress has a variety of parts above and beyond the main database itself: it has MageTAB, a MAGE-ML pipeline, and the MiameExpress tool. Beyond the Pride database, Pride also produces Pride
Harvest (Excel), and has a repository
and warehouse under development. She
has collaborators using more than one omics technology, so how can they all
benefit the most from these “independent” systems? How
can they submit and query both systems? The answer is in the EBI’s NET project (NET stands for Nutrigenomic, Environmental genomics and toxicogenomics) for describing, storing and
sharing such data. Part
of this project is BioMAP. In BioMap,
investigations use multiple technologies and contain many different data
assays may share the same study. BioMAP will build
only what is missing from the already existing EBI projects of ArrayExpress and Pride. They are only a small group of 4 people, so first they will make shorter-term solutions, for example CVs versus ontology, tab-formatted inputs versus XML.