Beta Release: CISBAN Data Portal and Integrator

The Centre for Integrated Systems Biology of Ageing and Nutrition has developed a Data Portal and Integrator
(CISBAN DPI) based on Milestone 3 of the Functional Genomics
Experiment (FuGE)
Object Model (FuGE-OM), and which archives,
stores, and retrieves raw high-throughput data.
We are pleased to announce that the CISBAN Data Portal and Integrator is now available in a
public sandbox version.

Please note that this release is still at an early beta stage, and any data you may upload to the
server may be deleted at any time. You will need a logon to access this database, which you may request
from the helpdesk. This is a low-level of security that will
only serve to prevent anonymous load on the database and to keep your sandbox area separate from others.
For more information on the sandbox DPI, please visit the DPI's
technical documentation.

Until now, few published systems have successfully integrated
multiple omics data types and information about experiments in a single database. The CISBAN DPI is the first
published implementation of FuGE that includes a database back-end, expert and standard interfaces, and utilizes
a Life Science Identifier (LSID) Resolution and Assigning service to identify objects and provide programmatic
access to the database. Having a central data repository prevents deletion, loss, or accidental modification of
primary data, while giving convenient access to the data for publication and analysis. It also provides a
central location for storage of metadata for the high-throughput data sets, and will facilitate subsequent data
integration strategies.

We encourage you to upload data and create as many experiments as you like
so that you may determine if this application may be of use to your own research group. We also appreciate
you contacting us with any comments or questions you may have.

Useful links:

Read and post comments |
Send to a friend


CISBAN Standards

NEBC-EBI Developer’s Workshop

On 19-20th February 2007, there was a NEBCEBI Developer’s Workshop at CEH Oxford, which included presentations and discussions of the CISBAN DPI, omixed, MIBBI, OBI, MSI, PSI, and BioMAP, among others. Participants were from NEBC and other sections of Oxford University, EBI, University of Manchester, and Newcastle University. Once there are official minutes of the workshop, I will post the link here. Those minutes should also include links to the presentations given over the course of the two days.

The most important result, in relation to my work, to come out of the workshop, was the decision we made to implement the MIGS model as an extension of the FuGE-OM. By the time of the next workshop on the 6-8th June 2007, we will have a draft version of the MIGS-OM. Dawn Field, Allyson Lister, Andy Jones, and others are the main developers of the new OM.

Now, back to the day-by-day notes….

Below are some notes that I made during this workshop, which you may wish to peruse. This workshop was originally designed as an internal NEBC-EBI workshop, however many external people were invited to give presentations on current data integration and data archiving techniques within the bioinformatics community that it became something much more. It was two days of exploring possible collaborations and partnerships as well as learning about the interesting and complementary work going on at the four centers of bioinformatics research.

Day 1

Initially, there was some discussion over the proposal made from the OBI community to the OBO Foundry on a minimal amount of ontology term metadata. Specifically, there was some concern over the general use of alternative_terms in an ontology. The worry is that two different communities will call two terms by the same alternative_term. If they don’t also refer to the ID, there could be problems, as they will not be talking about the same thing. Thinking about a practical case for the moment, such as GO, does reassure me in some ways. Programmatically there is a general consensus that the GO IDs should be used in preference to their unstable names (a name may change, but as long as it is representing the same concept, the ID will remain stable). Further, databases such as UniProt cross-reference to the IDs, and not to the terms. However, there may be some cause for concern. When you see GO terms referenced in papers, for instance in a table or figure, you usually do not see the GO ID. While I think it is laudable to allow community-specific labels for each OBI term and I believe this will make uptake of the ontology easier, cases where the OBI ID is not used may create problems if they are also using a community label. Certainly, it is something to keep in mind.

The CISBAN DPI project, which I have spoken at length about in the past (as it is one of my projects!) was also presented here. It fits in very well in this group of people who are interested in learning of ways to store and archive multiple omics data types. A full description of the current status of the project is available from the DPI project page.

Chris Taylor from the EBI provided a useful overview of the current status of the MIBBI project. This project has two functions: the first is to provide a central location and minimal information for any interested MI* project, and the second is to draw together a checklist of high-level requirements for all biological experiments on which specific communities can base their checklists. Chris mentioned that, at least initially, it is important to separate out those pieces of minimal information that are both useful and practical from those which are “only” useful. The importance of MIBBI and other MI* projects is clear: poor reporting of experimental work leads to an almost zero chance of reproducibility. The importance of a shared checklist such as MIBBI is also straightforward: there is clear crossover in many areas. For instance, MIACA and MIFlowCyt are both interested in flow cytometry, though MIACA has a broader remit. Metabolomics and proteomics also use a lot of the same equipment. However, over the past months and years, these two communities have chosen different priorities in their checklists: metabolomics focuses mainly on sampling and statistics, while the proteomics checklist has spent more time on describing equipment. In the past, there has been some criticism of MIAME, stating that it was vague and there have been some databases that have come out as saying that they do not know if they are MIAME compliant. However, this is in almost all cases a necessary vagueness, as consensus generally leads to compromise in definitions and requirements. Chris’ aim with MIBBI is to keep it from overly formal representations, as MIBBI is more about producing a useable set of guidelines. Each MI* module would, over time, become orthogonal. A research group would take those modules they need and end up with a bespoke set of reporting guidelines for their own work. Tim Booth then spent some time talking about implementing MIBBI in the same forms environment (a web GUI for filling in forms) as that used for GCat. The GCat system takes an xsd as input and can create a series of web forms that will produce XML validated against that xsd, and is currently used to run the front-end to the MIGS database.

David Shotton and Graham
then spoke on their shared interests with NEBC. They are researching the best way to store metadata on images, as well as how to store the images themselves. Firstly, the concept of a single entry in an image database is different from that in a sequence database: where a given sequence of a protein or gene is identical no matter where it is sequenced (as long as you are using the same sample, strain and species as the other group), an image of assay results from an affymetrix chip, for example, will always be slightly different each time the assay is performed. Therefore the number of images will grow very quickly, and storage will be problematic. Their answer is to use a “data web”: put the image on the web – on your own research group’s pages, for example – and then add sufficient metadata that the information can be harvested into a central registry, where both programmatic and GUI access will be available. Shared interests with the NEBC and their partners include linking experimental data acquisition to ontologies and data stores, generating links to genomic and other omics databases, using the web for data acquisition and display, and exploring the interaction between informal labeling and the more formally defined ontological concepts.
However, they are not trying to solve all the social problems that arise from distributed storage such as permanence of links.

omixed and its precursor maxd were then discussed by Giles Velarde and Dave Hancock from Manchester University. First, Giles described StreptoBase, which is an instance of maxd. Then he went on to a new project called omixed, which is currently “vaporware”, i.e. at a very early stage. Their website (available above) is up to give people an idea of what omixed may look like when complete. Everything, including the client web page, will work via web services. This makes it easy to integrate things like taverna. Tagging and wikis will be supported and kept separate to the main database, though all will be presented in the same view. They’re hoping that, with enough use, tag clouds could lead to a study of well-used terms (and perhaps, in an ideal world, to a starting place for term collection for an ontology, existing or new?). The main thing from the omixed developers point of view is the “user experience”.

Day 2

Tiwari described a MIAME-compliant excel worksheet. After that,


from the EBI described MAGE-tab, which is a template that can be loaded into excel to store information compliant with MIAME, however it cannot export mage-ml: it just provides the
correct headers. However, he does have someone working on a converter that will take the output of MAGE-tab and convert it to MAGE-ML. Philippe is also working on extending this tool for proteomics: specifically for Henning Hermjakob and Pride.
Already available from the Pride website is the Pride harvester, which uses connections to the
ontology lookup service and which has a macro that triggers conversion to the
xml, and from there you can submit to pride. However, Pride
does not contain anything about experimental design or experimental
factors. MAGE-tab
only uses a subset of the MAGE-OM, but the part it doesn’t use hasn’t
been used at all so far.

Booth then spoke on
handlebar, which is a database and web application to store and create barcodes from the very start of the sampling process. It also creates
basic spreadsheets based on requirements for each barcode type, and allows printing of barcodes directly to barcode printers. It was created to “tame” all the data put into spreadsheets by different groups
etc. He says there should be a paper on in in Biotechniques next month.

Jones from University of Manchester then spoke about the current status of FuGE. In the FuGE-OM, an investigation (there may be multiple investigations in a single experiment)
is just a way of organizing things: investigation
components contain associations to all protocol applications run for that
technology. Types of protocol applications include treatments, data acquisition, data treatments. More information on the FuGE project can be found by following the above link.

Sansone from the EBI discussed BioMAP, a new project in the works and due to start in about a month, and how it will relate to ArrayExpress, Pride and FuGE. ArrayExpress has a variety of parts above and beyond the main database itself: it has MageTAB, a MAGE-ML pipeline, and the MiameExpress tool. Beyond the Pride database, Pride also produces Pride
Harvest (Excel), and has a repository
and warehouse under development
. She
has collaborators using more than one omics technology, so how can they all
benefit the most from these “independent” systems? How
can they submit and query both systems? The answer is in the EBI’s NET project (NET stands for Nutrigenomic, Environmental genomics and toxicogenomics) for describing, storing and
sharing such data. Part
of this project is BioMAP. In BioMap,
investigations use multiple technologies and contain many different data
types. Several
assays may share the same study. BioMAP will build
only what is missing from the already existing EBI projects of ArrayExpress and Pride. They are only a small group of 4 people, so first they will make shorter-term solutions, for example CVs versus ontology, tab-formatted inputs versus XML.

Read and post comments |
Send to a friend


CISBAN Meetings & Conferences

North East Regional e-Science Centre/Digital Curation Centre Collaborative Workshop

North East Regional e-Science Centre/Digital
Curation Centre
Collaborative Workshop was on today, the 5th of February and Newcastle University. The DCC's main role is to "support and promote continuing improvement in the quality of data curation and of associated digital preservation". The aim of the NEReSC is to identify, fund and support
high-quality projects with leading industrial and academic partners. The NEReSC was established in July 2001, funded by the
DTI through the UK Core e-Science
programme, to provide expertise in e-Science and to instigate and run a set
of industrially focused projects.

The first two speakers, Paul Watson and Liz Lyon, gave short introductions about their respective organizations. Paul Watson is the head of NEReSC, and Liz Lyon is the Associate Director for Community Development at the DCC.

Liz spoke of how the DCC are interested in seeing what work is being done at Newcastle University in the context of digital curation and preservation, and perhaps developing partnerships with like-minded projects at the University. The DCC has already held 2 conferences on the subject of digital curation, the last one being last November (2006) in Glasgow. At that conference they also launched the electronic journal "International Journal of Digital Curation". It is a good move, as curation and data preservation are can be difficult to publish on in the more standard biology journals.

Paul Watson outlined the incredible need of the scientific community to have reliable archives of published data. He mentioned his so-called "Bowker's Standard" Scientific Data Life-Cycle, which is less of a life-cycle and really more of a gradual tailing-off. Step one is collect data, step two is publish the data, and step 3 is to gradually loose the original data as machines get turned off and students leave for greener pastures. It is humorous, but does show a real problem in the life sciences. Data for published articles should be preserved: otherwise, it means published papers draw conclusions from unpublished data, other groups are unable to reproduce an experiment, and the data cannot be re-used.

After these introductory speeches, there were 3 talks from Newcastle researchers on projects that involve archiving and curation. First, I spoke on the CISBAN data management strategy, which included an introduction to the CISBAN Data Portal and Integrator (slides for the DPI are available through that link). Then, Paul Watson spoke again, this time on CARMEN. There are a multitude of neuroscience data (molecular, anatomical, neurophysiological, and behavioural to name just a few categories) in many different locations with a variety of restrictions on their publishing and availability. There are a few efforts underway to try to unify data formats and archiving, but it is difficult to overcome the cultural (multiple communities acting independently; concerns from researchers about the consequences of sharing data) as well as technical (multiple proprietary data formats; the great volume of data; the need for standarized detailed metadata) barriers. Hopefully CARMEN and sister efforts such as BIRN and Neuro Commons (via Scientific Commons I believe, but don't quote me on it!) will be able to make real strides in this area in the coming years. Then, Patrick Olivier spoke on his work at the Culture Lab, part of the Institute of Ageing and Health at Newcastle University. They research ways of having the humanities, social sciences and the arts inform and aid computing, and vice versa.

The afternoon was scheduled for presentations from the DCC and general discussion. Unfortunately, I had a prior engagement with another meeting, and had to bow out. However, there was lots of energy in the morning, with many people from both groups asking questions and getting involved. Digital curation, archiving and preservation is an area which every research group should be interested in. It is very easy to forget that, unless you have some sort of data policy in your group, chances are that the data sitting on your computer is JUST on your computer, and is therefore precariously stored indeed.

Read and post comments |
Send to a friend