BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

SyMBA Demo. The lunch hour was also the demo hour. People came to visit me at the SyMBA demo desk for the whole hour, and we had some interesting conversations. There is one particular question I would like to relate from that hour: what should a bioinformatician choose as an output/export format for multi-omics data? This post relates my thoughts about this challenge. It's not meant to be comprehensive: just some ramblings.

I solve this challenge in SyMBA by storing everything as FuGE objects, which can be exported to FuGE-ML. FuGE-ML can be converted into ISA-TAB and into an html format that mimics ISA-TAB using an XSLT. Therefore, because of this interlink between FuGE and ISA-TAB, you can leverage two complementary formats.

But to a bioinformatician who has just been tasked with building an application (and generally on a short time-scale), how do they choose what export format to use, e.g. FuGE or ISA-TAB? There are considerations of:

  • scale: lightweight or heavyweight implementation. A lightweight implementation might favor your own version of ISACreator and the use of ISA-TAB, or a FuGE-based archive (but not a full-blown LIMS) like SyMBA. A heavyweight solution might be a full LIMS such as PIMS, or another FuGE implementation in development called SysFusion.
  • intent: what is the purpose of storing this data? Is it for later analysis? For later deposition to a public database, e.g. at the EBI? Is it archiving? Is it a combination of these things? Your intent will shape what type of application you build, and what formats you focus your effort on. If your intent is storage only, choose whatever is most convenient for your users. However, these days there is always some aspect of data sharing or publishing. If you need further analysis of the data, then you probably want to be able to produce a computationally-friendly format such as XML. If your intent is submission to public databases, you need to ensure you export in a format they import.

Unfortunately, what this means is that the decision depends on the circumstances. FuGE and ISA-TAB are linked, and so you really get two for the price of one with those. I see this sort of thing as a positive – you have a choice as to the representation, storage and export of your data – a choice of formats! And many, like FuGE and ISA-TAB, are going to be easily convertable. The choice depends on your needs, but there is one easy choice: use something that's already been developed – don't reinvent the wheel!

Anyone else have any further suggestions?

Read and post comments |
Send to a friend

original

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Systems Biology of Microorganisms. 11 projects from 91 institutes, whose aim is to record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way. These projects have no one concept of experimentation or modelling, which makes it tough for information exchange. Further, there are issues of people having their own solutions, suspicions (about sharing data, for instance), data issues (many don't have data or don't store it in a standard way) and resource issues (no extra resources). SysMO-DB started in July 2008, and is a 3 year funded effort (3+3 people in 3 teams over 3 sites). Provide a web-based solution to exchange, search, and disseminate data. Need to retrofit data access, model handling and data integration platform. Because of the large number of groups and projects, they are going to aim for low-hanging fruit and early wins: be realistic, not reinvent, sustainable, and encourage standards adoption.

Just like at CISBAN, where we have implemented a web-based data integration, storage, exchange, and dissemination platform in a standards-compliant way (SyMBA), they have three users: experimentalists, bioinformaticians, and modellers. They're lucky, though, in that they have 6 people to develop SysMO-DB, when CISBAN only has 1. :) And, as with CISBAN and many other data integration efforts, much of the work is social: that is, encouraging those three user types to collaborate and understand each other's work. The social solutions include questionnairs, "PALS" (postdocs and phd students), and Audits and sharing of methods, data, models. They discuss things like what people need or don't need from MIAME. (Personal opinion and question: MIAME is intended as a minimal information checklist. What kind of things, then, don't they need? And would it be worth taking this information back to the MIAME people to possibly modify the guidelines if some aspects of it aren't truly minimal? End personal questions.)

Discovery is done via SysMo-SEEK. How to catalog the metadata, and then have mechanisms for accessing the data from locations other than the host site? There is a single search point over "yellow pages" and assets catalogue. They store metadata on results, not the results themselves (again, just like SyMBA, which stores the metadata in a database, and the results in a remote file store). They use myExperiment for both linking the people and the assets. For models, they're using a local installation of JWS online, which is a database of curated models and a model simulator. There is also some links to semantic SBML from the TRANSLUCENT project.

There are two kinds of processes to store. The first is experimental processes, e.g. SOPs and protocols. They use the Nature protocols format, with the addition of high-level classification through tags. (Personal note: What is the underlying format for storing protocols?) The second type of process is Bioinformatics processes, which are stored as workflows. (Question: Why don't you store protocols as workflows? They can be chained in the same way.)  Taverna is used for this work. One bit of work was using libSBML inside taverna for collaborative model development (Peter Li et al). Another automated (definition of automated in this context?) workflow goes from microarray to pathways and published abstracts. Their consortium wants to exchange information from public data sources, SysMO itself, and excel spreadsheets.

(Another personal aside. FuGE (object model for experimental metadata) and ISA-TAB (tabular format, e.g. spreadsheets) are becoming interchangeable – work is going on between FuGE and ISA-TAB people right now – most recent workshop was last week. This is important, as it was mentioned that bioinformaticians have to deal with spreadsheets (which is true enough!). So, you get the best of both worlds with FuGE / ISA-TAB, without having to define yet another schema. A personal question would be: Why build these various metadata schemas and parsers for spreadsheets (e.g. whatever is used for the Assets catalogue and JERM parsing of spreadsheets) rather than use pre-existing models such as FuGE and formats such as ISA-TAB? Using the FuGE object model does not mean that you have to use all aspects of it – you can just take what you need.Perhaps it was due to the maturity of ISA-TAB at the time the project started, though the specification is now in version 1.0. Will SysMO-DB export and import these formats? There was no time for questions at the end of the talk, so I will try to find out during the lunch period. End aside.)

Trying to map to the relevant MIBBI standard. There is a nice feature that reads spreadsheets from specific locations and automatically loads them into the Assets catalogue. (You can still load them directly into that catalogue.) They are performing a 4-site JERM exchange pilot scheme in Spring 2009.

Great talk – thanks :)

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. :)

Read and post comments |
Send to a friend

original

We have just opened up the Systems and Molecular Biology Data and
Metadata Archive (SyMBA, formerly known as the CISBAN DPI) to the
community under the terms of the GNU LGPL. Its new home is SourceForge,
and there is a subversion repository, installation instructions,
mailing list, issue tracker etc available.

The main URL is:

http://symba.sourceforge.net

The Project Page on SF (where you can get to screenshots, subversion browsing etc) is here:

http://sourceforge.net/projects/symba/

We

changed the name of the project to reflect the wider diversity of
developers now contributing to the project. I
will be sending information, announcements, and answers to questions on
the symba developers mailing list, which everyone can subscribe to:

symba-devel@lists.sourceforge.net

If you wish to subscribe, please go here:

http://lists.sourceforge.net/mailman/listinfo/symba-devel

We'd be happy to have additional developers on the project,
and if there is any feature or bug you'd like to report, please use our
issue trackers:

http://sourceforge.net/tracker/?group_id=202680

If

you'd like to take a more hands-on approach, then please email me your
sourceforge user id, and I'll add you as a developer on the project.

SyMBA was initially developed (and is still mainly developed) by the Integrative Bioinformatics Group, headed by Neil Wipat and part of CISBAN. Many thanks to all who have helped, via code or comment, and also to the current SourceForge Developers listed below:

Allyson Lister (CISBAN)
Olly Shaw (CISBAN)
Dan Swan (Newcastle Bioinformatics Support Unit)
Frank Gibson (CARMEN Neuroscience Project)

SyMBA is also being evaluated by other members of the CARMEN project and by CSBE.

The
sandbox to play around with SyMBA is now up again at http://bsu.ncl.ac.uk:8081/symba after a major disk malfunction on the old server. I'll transfer
all old logins in the old system now having logins on the new one, and
if you'd like a login once the new server is up, please drop me a line.
In the meantime, please have a look around the new SourceForge site and
also the code, if you like! All comments and suggestions welcome.

Some general information:
The Centre for Integrated Systems Biology of Ageing and Nutrition has developed a data archive and
integrator
(SyMBA) based on Milestone 3 of the
Functional Genomics Experiment (FuGE)
Object Model (FuGE-OM), and which archives, stores, and retrieves raw high-throughput data. Until now,
few
published systems have successfully integrated multiple omics data types and information about
experiments
in a single database. SyMBA is the first published implementation of FuGE that includes a database
back-end,
expert and standard interfaces, and a Life Science Identifier (LSID) Resolution and Assigning service to
identify objects and provide programmatic access to the database. Having a central data repository
prevents
deletion, loss, or accidental modification of primary data, while giving convenient access to the data
for
publication and analysis. It also provides a central location for storage of metadata for the
high-throughput data sets, and will facilitate subsequent data integration strategies.

http://symba.sf.net

Read and post comments |
Send to a friend

original

A Technical Report for the School of Computing Science of Newcastle University was released last month describing the CISBAN DPI, an implementation of the FuGE Milestone 3 STK. You can find and download that technical report here:
http://www.cs.ncl.ac.uk/research/pubs/trs/abstract.php?number=1016

The Abstract follows:

The Centre for Integrated Systems Biology of Ageing and Nutrition has
developed a Data Portal and Integrator (CISBAN DPI) that is based on
the FuGE Object Model and which archives, stores, and retrieves raw
high-throughput data. Until now, few published systems have
successfully integrated multiple omics data types and information about
experiments in a single database. The CISBAN DPI is the first published
implementation of FuGE that includes a database back-end, expert and
standard interfaces, and utilizes a Life Science Identifier (LSID)
Resolution and Assigning service to identify objects and provide
programmatic access to the database. Having a central data
repository prevents deletion, loss, or accidental modification of
primary data, while giving convenient access to the data for
publication and analysis. It also provides a central location for
storage of metadata for the high-throughput data sets, and will
facilitate subsequent data integration strategies.

Keywords

Functional Genomics, High-Throughput
Experiments, FuGE, LSID, Experimental Workflows, Databases, Data
Standards, Data Sharing, Metadata, Data Integration.


CS-TR: 1016 Implementing the FuGE Object Model: a Systems Biology Data Portal and Integrator
Lister, A. L., Jones, A. R., Pocock, M., Shaw, O., Wipat, A.
School of Computing Science, Newcastle University, Apr 2007

Read and post comments |
Send to a friend

original