Henning Hermjakob: PSICQUIC and EnVision

This is a presentation given on 29 April, 2010, at the Link-Age / LifeSpan Workshop on Data Handling for Biogerontology Research held by CISBAN, Newcastle University.

Data integration: one definition is to combine data residing in different sources providing users a unified view of these data. Questions of relevance for the data integration field: scope (all, datasets), type (same, different), implementation (federation, centralisation), access (programmatic i.e. computer to computer, web i.e. interactive) and ownership (public, private). Henning covers federated, mainly programmatic techniques using data of the same type in this talk.

To take an example, if you start with a sample (e.g. from a mouse). Observations of this sample results in one or more (overlapping or non-overlapping) publications. Then, the publication information can be used to annotate interaction databases and sent to PSICQUIC servers. PSICQUIC should allow the user to reconstruct an idealised view of the original system from the interaction data.

The molecular interaction standard is the PSI-MI standard, whose first XML version was produced in February 2004. There have been updates and extensions since then, and has been widely implemented by the major interaction databases including DIP, MINT, MIPS, IntAct, HPRD, etc. (http://www.psidev.info/MI)

The PSI-MI XML format is full featured, but complex. This complexity is both its strength and its weakness. Therefore, due to user request, they developed a simplified tabular format called MITAB where one row equals one binary interaction. You loose a lot of information, such as whether a binary interaction is part of a more complex reaction, but it has proven popular.

PSICQUIC is one API which is implemented by many databases such as those mentioned earlier. Its purpose is for querying molecular interaction databases, and uses a common query language (MIQL, which is based on Lucene) for this data. Can be used for PPIs, drug-target interactions and simplified pathway data. The simple PSICQUIC viewer is at http://www.ebi.ac.uk/Tools/webservices/psicquic/view. The PSICQUIC viewer can also point to other resources such as IntAct and many other non-EBI databases. The viewer also has a more fancy, graphics-based implementation where there is an overlay of molecular interactions on Reactome pathways.

MIQL can query every field available within MITAB in a precise way. SOAP and REST interfaces are available and documented at http://code.google.com/p/psicquic.

The challenge is to move PSICQUIC from simple access to all the resources to a real integrated view of all those resources. How to determine if two sources really are talking about the same interaction? Also, the compute time quickly moves beyond suitable interactive times.

PSICQUIC is a technical solution, whereas IMEx is the social/collaborative answer. IMEx is the International Molecular Exchange Consortium. The aims of its members include: avoiding redundant curation, providing a consistent body of public data using detailed joint curation guidelines, and providing a network of stable and comprehensive resources for MI data. This work is now in production phase since February 2010. The work is split up into the different databases by journal type. You can find out more information about IMEx at http://imex.sf.net. Each interaction has its own database’s identifier, but also an identifier from a common IMEx identifier space. The hardest part was harmonizing curation procedures, and they now have a common curation manual across all databases.

Looking at another aspect of his work, EnCore, which is based on different data types integrated using a federated, programmatic approach. EnCore is an ENFIN platform to enable mining data across various domains, sources, formats and types. It integrates database resources and analysis tools across different disciplines. The first focus is on developing an EnXML format. Access interfaces include Perl API, Java API, ftp, SOAP, REST, GUI, etc. The return formats are in a variety of flavours, e.g. XML, CSV, plain text, JSON, etc. All of this must be squeezed into one consistent format. This is done by putting wrappers around the various programs.

The EnXML structure is set oriented – not only does it tell you about one thing (e.g. protein), but also about a set of them. In this structure, an experiment is run which identifies the results. Each experiment references a Set structure, which contains the structure of the result. Sets can hold further nested sets. There are a number of other further sub-structures. The EnCORE results always include both a positive and negative result set (in the case of the negative result, it lists all identifiers for which *no* hit was found). Negative results allow you to track why you might not have gotten a response, and how you “lost” some identifiers from the result.

EnVision is an end-user tool for the above EnCORE work based on the EnXML format. It provides an initial, integrative view for Sets of molecular entities without the need for programming. It also allows the possibility for further local processing. It allows you to save the status/analysis of your material on a particular date, and use that for, e.g., supplementary materials. You can also download your sub-results in a tabular format. Further information and the ability to run this GUI is available from http://www.enfin.org, where you can play with an EnCORE tutorial.

All of this can be quite laborious – web services that are used by EnCORE can change without warning, so it’s a constant challenge to maintain all of these wrappers. A partial answer is to use, wherever possible, underlying standards for the individual services. Such standards include PSICQUIC for MI data. DAS will be used to access protein annotation and information.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

SBML in OWL: some thoughts on Model Format OWL (MFO)

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.
A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.
Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.
How a single reaction looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

Pre-Building an Ontology: What to think about before you start

There are a few big questions that need to be kept firmly in mind when starting down the road of ontology building. These are questions of: goals (What are you trying to achieve with this ontology?), competency/scope (What are you trying to describe?), and granularity (to what depth will you need to go?). The rest of this post relates directly to and is organised around these three topics. These topics have a lot of overlap, and aren’t intended to be mutually exclusive: they’re just ideas to get the brain going. I use the upcoming Cell Behavior Ontology (CBO) workshop to illustrate the points.

There are a few big questions that need to be kept firmly in mind when starting down the road of ontology building. These are questions of:

  1. Goals: What are you trying to achieve with this ontology?
  2. Competency/Scope: What are you trying to describe?
  3. Granularity: To what depth will you need to go?

The rest of this post relates directly to and is organised around these three topics. These topics have a lot of overlap, and aren’t intended to be mutually exclusive: they’re just ideas to get the brain going. I use the upcoming Cell Behavior Ontology (CBO) workshop to illustrate the points. The questions I single out below may already have been answered by the workshop organizers, but haven’t been published on the CBO wiki yet. I’ll be attending this workshop, and will aim to post my notes each day. It should be fun!


If a main goal is eventual incorporation within another ontology (e.g. Gene Ontology (GO) for the case of CBO) or even just alignment with the other ontology’s tenets, you have to be sure you’re happy with the limitations this may put on your own ontology. It may be that these limitations are not acceptable, and as a result you choose to reduce the dependencies on the other ontologies.

For CBO, the important questions relate to possible alignment to GO and therefore, ultimately, Basic Formal Ontology (BFO):

Question: Do you wish to ultimately include some CBO terms under, for example, biological processes of GO? GO contains only canonical/non-pathological terms. How does this fit with the goals of CBO?

GO has the express intent of creating terms covering only canonical / non-pathological biology. Therefore, would cell behavior during cancer (e.g. uncontrolled cell proliferation or metastatis, which aren’t in GO) be appropriate if CBO is meant to, in its entirety, be included within GO? They are important terms, so if some amount of incorporation with GO is appropriate, would it only end up being a partial alignment?

Question: Are there any plans to use an Upper Level Ontology (ULO) such as the OBO Foundry-recommended BFO? Though BFO may not need to be considered immediately, it does place certain restrictions on an ontology. Are you happy with those restrictions?

One example of the restrictions placed by the use of BFO is that within BFO, qualities cannot be linked via the Relations Ontology to processes. That is, if you have a property called has_rate which is a child property of “bears”, then you are not allowed to make a statement such as “cell division has_rate some_rate”, where cell division is a process, and some_rate is a quality. There is a good post available about ULOs by Phil Lord.

Question: How richly do we want to describe cell behaviors?

Another important general goal is the level of richness that is needed with CBO. Competency questions, discussed later, will answer this to some extent. We can think about richness using GO as an example. The goal of the GO developers is the integration of multiple resources through the use of shared terms. GO does this very well. But, if you want rich descriptions and semantic interoperability, then this is something that is not a goal of GO.


While it is often a tempting idea to start from the top of an ontology and work downward, consideration should be given to an initial listing of leaf terms that you are sure that you need in the ontology. Not only does this ensure you have terms that people need from the start, the bagging and grouping exercises you would then go through to create the hierarchy will often highlight any potential problems with your expected hierarchy. If you have clear use-cases, then a bottom-up approach, at least in the early stages, can be useful in figuring out what the scope of your ontology is.

This brings us to the importance of having scope – and a set of competency questions – ready from the beginning of ontology development. What do you want to describe?

Question: What is the definition of cell behavior in the context of CBO?

For instance, for CBO, what is meant by the word “behavior”? A specific description of what is, and isn’t, a behavior that the CBO is interested in, is an important first step.

The last thing that would be relevant to the overall goals (but which could equally well be considered in the Granularity section below) is the type of terms to be added:

Question: Should the terms be biological terms only, or also bioinformatics/clinical terms?

To better explain the above question, you could consider the stages of cancer progression. “Stage 2” is a fictitious name for a clinical/bioinformatics description of a stage of a cancer. This is not a biological term. Which type of term should go into CBO? I would guess that the biological term should go in which describes the biology of a cell at “stage 2”, and then perhaps use synonyms to link to bioinformatics/clinical terms. There probably shouldn’t be a mix of the two types of terms as the primary labels.

Additionally, competency questions can help determine the scope. You can make a list of descriptive sentences that you want the ontology to be able to describe, such as “The behavior of asymmetric division (e.g. stem cell division)”. By listing a number of such sentences, you can determine which are out of scope and which must be included, thus building up a clear definition of the scope.


For me the granularity question has two aspects: first, and more generally, is how fine-grained do you want to be with your terms; second, and more interestingly, is in the context of CBO, are we interested in the behavior of cells and/or the behavior in cells? The examples given in the workshop material seem to come from both of these areas (see http://cbo.compucell3d.org/index.php/Breakout_Session_Groups).

Question: Should CBO deal with the behavior OF cells and/or the behavior IN cells?

For the above question we can use as examples cell polarization and cell movement. Both are listed in the link to the wiki provided just above, so both are considered within the scope of CBO. However, cell movement is a characteristic behavior of a cell, while polarization is something that happens in a cell (e.g. polarization within a S.cerevisiae cell with regards to the budscar). Both of these types of behaviors are relevant, but they are different classes of behavior and may be an appropriate separation within the CBO hierarchy.

As an aside, is cell division a behavior? It is covered in the CBO material, so with respect to CBO, it is. I think that the CBO is intended to deal with single cells, so I’m not sure where cell division fits in.

These questions should be considered, but you should also try not to let them reduce the effectiveness and efficiency of ontology development. However, as with many biological domains, try to ensure that everyone is on the same page with their goals, scope, and granularity and there will be (I believe!) fewer arguments and more results.

Also, I am positive I’ve missed stuff out, so please add your suggestions in the comments!

With special thanks to Phil Lord for the useful discussions surrounding ontology building that formed the basis for this post.

CISBAN and telomere maintenance and shortening, BBSRC Systems Biology Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Amanda Greenall: Telomere binding proteins are conserved between yeat and higher eukaryotes. The capping proteins are very important, because they prevent the telomeres from being recognized as double-strand breaks. They work on cdc13, which is the functional homologue of POT1 in humans. A point mutation cdc13-1 allows them to study telomere uncapping. When grown above 27 degrees Celcius, the cdc13-1 protein becomes non-functional, and fall off. This uncapping causes telomere loss and cell-cycle arrest. So, they do further study into the checkpoint response that happens when telomeres are uncapped. Yeast is a good model, as many of the proteins involved in humans have direct analogs in yeast. They did a series of transcriptomics experiments to determine how gene expression is affected when telomeres are uncapped. They did 30 arrays, and the data was analysed using limma. 647 differentially-expressed genes were identified (418 upregulated (carbohydrate metabolism, energy generation, response to OS), and 229 downregulated (amino acid and ribosome biogenesis, RNA metabolism, etc)). The number of differentially-expressed genes increase with time. For example, 259 of the genes were involved in DNA damage response.

They became quite interested in BNA2, which is an enzyme which catalyses de novo NAD+ biosynthesis. Why is it upregulated? It seems over-expression of BNA2 enhances survival of cdc13-1 strains (using spot tests). Nicotinamide biosynthetic genes are altered when telomeres are uncapped in yeast and humans. The second screen was a robotic screen to identify ExoX and/or pathways affecting responses to telomere uncapping. Robots were used to to large-scale screens that can measure systematic cdc13-1 genetic interactions. One of the tests was the up-down assay, which allows them to distinguish Exo1-like and Rad9-like suppressors. Carry on with the spot tests until have worked through the entire library of strains.

Darren Wilkinson: a discrete stochastic kinetic model has been built to model the cellular response to uncapping. (J Royal Soc Interface, 4(12):73-90), and in Biomodels. Encoded in SBML and simulated in BASIS (web-based simulation engine). You can use the microarray data to infer networks of interactions. Such top-down modelling can often be done with Dynamic Bayesian Networks (DBNs) for discretised data and sparse Dynamic Linear Models (DLMs) for (normalized) continuous data. A special case of DLM is the sparse vector auto-regressive model of order 1, known as the sparse VAR(1) model, and this appears to be effective for uncovering dynamic network interactions (see Opgen-Rhein and Strimmer, 2007). They use a simple version of this model. They use a RJ-MCMC algorithm to explore both graphical structure and model parameters. When the RJ-MCMC is performed, it's quite hard to visualize. They do a plot of the marginal probability that an edge exists. This can also be summarised by choosing an arbitrary threshold and then plotting the resulting network. You can change the thickness of the edges so they match the marginal probability associated with each edge. This picture is then easier for biologists to analyse, and allows them to narrow down their search for important genes. He also performed analysis over the robotic genetic screens. There are usually about 1000 images per experiment, each with 384 spots, and therefore image analysis needs to be automated. Want to pick out those strains that are genetically interacting with the query mutation. For interactions to be useful concept in practice, you need the networks to be sparse. With HTP data, we have sufficient data to be able to re-scale the data in order to enforce this sparsity. A scatter-plot of double against single will show them all lying along a straight line (under a model of genetic independence). Points above and below the regression line are phenotypic enhancers and suppressors, respectively.

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend


Introduction and Directors’ Updates: BBSRC SB Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 15 December 2008.

This is the third BBSRC grantholders' workshop for systems biology. The participants include grantholders and their staff; people connected with the activities of the 6 SB centres; guests from BBSRC-sponsored institutes; guests from outside of these institutes; and finally, members of the BBSRC peer-review panels. Why is this workshop happening? It's an opportunity for sponsors and participants to discover and say how things are going; also interested in everyone sharing experiences with colleagues; and, of course, to talk about new ways to solve biological problems. I'll be providing my notes on most of the talks that are given over the next few days.

Now for my notes on the updates from the directors of the six Centres for Integrated Systems Biology (CISBs).

CISBAN (Tom Kirkwood): Why and how do we age? Limited investment in cell maintenance (the disposable soma theory). Ageing is caused primarily by damage, and longevity is regulated by resistance and repair. There are multiple mechanisms: it is a highly complex system which is inherently stochastic. There are a number of interesting questions that revolve around optimality, plasticity, and trade-offs. Now involved as an academic partner in the EU network of excellence, LifeSpan. One area of major research within CISBAN is that of telomere length homeostasis. Work includes high-throughput (HT) yeast robot experiments, which has led to the identification of a large number of genes that affect telomere maintenance. CISBAN also studies the intrinsic ageing of mammalian cells, where there is considerable cell-to-cell heterogeneity. Research has shown that it is much more than just telomere erosion: there is significant crosstalk between telomere and, for example, mitochondrial dysfunction. He also gave an overview of BASIS, a web application that allows the researcher to upload SBML models and run simulations on the BASIS server. There is a large amount of research going on within CISBAN that has a statistical focus, and there is also work going on with respect to data integration.

CISBIC (Brian Robertson): CISBIC focuses on host-pathogen interactions, and contains 3 exemplar sub-projects. They have core facilities in glycoproteomics, metabolomics, transcriptomics and cell imaging. They are interested in combating disease in both plants and animals. They use infection as a perturbation to study host biology, mainly focusing on the interaction between the microbe and the innate immune response. They have tried very hard to create a community of systems biologists, and also have collaborations with people outside the centre. Top questions from experimental biologists: "How do I start in SB?", "How do I start modelling?", "Which modelling approach should I take?". They're considering starting a summer school to train people.

CPIB (Charlie Hodgman): Began with a nice screencast/automated slideshow of pictures and work going on at CPIB. Main research works on plant roots, and how they respond to external stimuli. The research program is split into 4 strands: root cell elongation, root apical meristem, lateral root development, and an integrated root model. Models they are developing include: static and dynamic network modelling, biomechanical modelling, tissue-scale computational modelling, and more. Technological developments include vertical-stage confocal imaging, automated focusing time-lapse microscopy. Outreach work includes: advanced hormone-transport workshop in May 2008, and had their first plant modelling summer school in September 2008. In December 2007 there had a Maths and Plant Sciences Study Group.

CSBE (Igor Goryanin): This centre is focused on dynamic biological systems rather than on any particular biological project. They are concentrating on cell-to-cell interactions, and their main aim is to streamline the modelling process. Mentioned the Bio-PEPA Eclipse Plug-in, which can run simulations within Eclipse.
MCISB (Hans Westerhoff): Focus on one project, using S. cerevisiae: they want to make SB work, in a complete sense. This means from the individual sequences right up to modelling a complete cell.

OCISB (Judy Armitage): Their remit is to build upon the strengths of groups within Oxford to provide a focus for the catalytic development of a wide range of systems approaches. They were trying to understand, predic ad control physiological behaviour by integrating knowledge of interactions at the molecular, cellular and population levels. Their initial core testbeds are: bacterial chemotaxis and sensory transduction, cell cycle, hypoxia and the Trypanosome flagellum. New projects include: plant metabolomics and signalling, extracellular matrix, t-cell signalling, diversity of host responses in malarial infection, meiosis, biofilm development and synthetic signaling networks. They're also doing more work in outreach, including open seminar series and one-day workshops.

I really liked how each of the directors thought it was important to point out the outreach – it is an opinion I share. Additionally, what was interesting is that every director made a specific point of how their Centre was quite unusual in the amount of interchange and multidisciplinary research they do. This type of collaboration has traditionally been very rare, and the Centres were originally quite unusual in that regard. However, what is gratifying is that, in order to be successful in SB, outreach and collaboration have become a necessity, and I am hearing them talked about much more these days. Perhaps those directors can now all remove the word "unusual" from the next versions of those presentations. Wouldn't that be nice? 🙂

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend


Pre-workshop post on the FuGE / ISA-TAB Workshop, 8-9 December

Tomorrow is the first day of a two-day workshop set up to continue the integration process between the ISA-TAB format and the FuGE standard. (Well, technically, it starts tonight with a workshop dinner, where I'll get to catch up with the people in the workshop, many of whom I haven't seen since the MGED 11 meeting in Italy this past summer. Should be fun!)

ISA-TAB can be seen as the next generation of MAGE-TAB, a very popular format with biologists who need to get their data and metadata into a common format acceptable by public repositories such as ArrayExpress. ISA-TAB goes one step further, and does for tabular formats what FuGE does for object models and XML formats: that is, it is able to represent multi-omics experiments rather than just the transcriptomics experiments of MAGE-TAB. I encourage you to find out more about both FuGE and ISA-TAB by looking at their respective project pages. The FuGE group also has a very nice introduction to the model in their Nature Biotechnology article.

Each day I'll provide a summary of what's gone on at the workshop, which centers around the current status of both ISA-TAB and some relevant FuGE extensions, as well as the production of a seamless conversion from FuGE-ML to ISA-TAB and back again. ISA-TAB necessarily cannot handle as much detail as the FuGE model can (being limited by the tabular format), and therefore in the FuGE-ML to ISA-TAB direction, it is possible that it may not be entirely lossless. However, this workshop and all the work that's gone on around it aims to reconcile the two formats as much as possible. And, even though I have mentioned a caveat or two, this reconciliation is entirely possible: both ISA-TAB and FuGE share the same high-level structures. Indeed, ISA-TAB was created with FuGE in mind, to ensure that such a useful undertaking used all it could of the FuGE Object Model. It is important to remember that FuGE is an abstract model which can be converted into many formats, including XML. Because it is an abstract model, many projects can make use of its structures while maintaing whatever concrete format they wish.

Specific topics of the workshop include:

  • Advance and possibly finalize XSLT rendering of FUGE Documents into ISA-TAB. This includes the finishing-off of the generic FuGE XSL stylesheet.
  • Work on some of the extensions, including FCM, Gel-ML, and MAGE2. MAGE2 is the most interesting for me for this workshop, as I've heard that it's almost complete. This is the XML format that is a direct extension of the FuGE model, and will be very useful for bioinformaticians wishing to store, share and search their transcriptomics data using a multi-omics standard like FuGE.

Thanks to Philippe Rocca-Serra and Susanna-Assunta Sansone for the hard work they've done on the format specification, and for everyone who's coming today. It's a deliberately small group so that we can spend our time in technical discussion rather than in presentations. I'm a bit of a nut about data and metadata standards (and am in complete agreement with Frank over at peanutbutter on the triumverate of experimental standards) and so I love these types of meetings. It's going to be fun, and I'll keep you updated!

Read and post comments |
Send to a friend