CISBAN Data Integration Meetings & Conferences

Henning Hermjakob: PSICQUIC and EnVision

This is a presentation given on 29 April, 2010, at the Link-Age / LifeSpan Workshop on Data Handling for Biogerontology Research held by CISBAN, Newcastle University.

Data integration: one definition is to combine data residing in different sources providing users a unified view of these data. Questions of relevance for the data integration field: scope (all, datasets), type (same, different), implementation (federation, centralisation), access (programmatic i.e. computer to computer, web i.e. interactive) and ownership (public, private). Henning covers federated, mainly programmatic techniques using data of the same type in this talk.

To take an example, if you start with a sample (e.g. from a mouse). Observations of this sample results in one or more (overlapping or non-overlapping) publications. Then, the publication information can be used to annotate interaction databases and sent to PSICQUIC servers. PSICQUIC should allow the user to reconstruct an idealised view of the original system from the interaction data.

The molecular interaction standard is the PSI-MI standard, whose first XML version was produced in February 2004. There have been updates and extensions since then, and has been widely implemented by the major interaction databases including DIP, MINT, MIPS, IntAct, HPRD, etc. (

The PSI-MI XML format is full featured, but complex. This complexity is both its strength and its weakness. Therefore, due to user request, they developed a simplified tabular format called MITAB where one row equals one binary interaction. You loose a lot of information, such as whether a binary interaction is part of a more complex reaction, but it has proven popular.

PSICQUIC is one API which is implemented by many databases such as those mentioned earlier. Its purpose is for querying molecular interaction databases, and uses a common query language (MIQL, which is based on Lucene) for this data. Can be used for PPIs, drug-target interactions and simplified pathway data. The simple PSICQUIC viewer is at The PSICQUIC viewer can also point to other resources such as IntAct and many other non-EBI databases. The viewer also has a more fancy, graphics-based implementation where there is an overlay of molecular interactions on Reactome pathways.

MIQL can query every field available within MITAB in a precise way. SOAP and REST interfaces are available and documented at

The challenge is to move PSICQUIC from simple access to all the resources to a real integrated view of all those resources. How to determine if two sources really are talking about the same interaction? Also, the compute time quickly moves beyond suitable interactive times.

PSICQUIC is a technical solution, whereas IMEx is the social/collaborative answer. IMEx is the International Molecular Exchange Consortium. The aims of its members include: avoiding redundant curation, providing a consistent body of public data using detailed joint curation guidelines, and providing a network of stable and comprehensive resources for MI data. This work is now in production phase since February 2010. The work is split up into the different databases by journal type. You can find out more information about IMEx at Each interaction has its own database’s identifier, but also an identifier from a common IMEx identifier space. The hardest part was harmonizing curation procedures, and they now have a common curation manual across all databases.

Looking at another aspect of his work, EnCore, which is based on different data types integrated using a federated, programmatic approach. EnCore is an ENFIN platform to enable mining data across various domains, sources, formats and types. It integrates database resources and analysis tools across different disciplines. The first focus is on developing an EnXML format. Access interfaces include Perl API, Java API, ftp, SOAP, REST, GUI, etc. The return formats are in a variety of flavours, e.g. XML, CSV, plain text, JSON, etc. All of this must be squeezed into one consistent format. This is done by putting wrappers around the various programs.

The EnXML structure is set oriented – not only does it tell you about one thing (e.g. protein), but also about a set of them. In this structure, an experiment is run which identifies the results. Each experiment references a Set structure, which contains the structure of the result. Sets can hold further nested sets. There are a number of other further sub-structures. The EnCORE results always include both a positive and negative result set (in the case of the negative result, it lists all identifiers for which *no* hit was found). Negative results allow you to track why you might not have gotten a response, and how you “lost” some identifiers from the result.

EnVision is an end-user tool for the above EnCORE work based on the EnXML format. It provides an initial, integrative view for Sets of molecular entities without the need for programming. It also allows the possibility for further local processing. It allows you to save the status/analysis of your material on a particular date, and use that for, e.g., supplementary materials. You can also download your sub-results in a tabular format. Further information and the ability to run this GUI is available from, where you can play with an EnCORE tutorial.

All of this can be quite laborious – web services that are used by EnCORE can change without warning, so it’s a constant challenge to maintain all of these wrappers. A partial answer is to use, wherever possible, underlying standards for the individual services. Such standards include PSICQUIC for MI data. DAS will be used to access protein annotation and information.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

CISBAN Semantics and Ontologies Software and Tools

SBML in OWL: some thoughts on Model Format OWL (MFO)

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.
A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.
Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.
How a single reaction looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

CISBAN Meetings & Conferences Semantics and Ontologies Standards

Pre-Building an Ontology: What to think about before you start

There are a few big questions that need to be kept firmly in mind when starting down the road of ontology building. These are questions of:

  1. Goals: What are you trying to achieve with this ontology?
  2. Competency/Scope: What are you trying to describe?
  3. Granularity: To what depth will you need to go?

The rest of this post relates directly to and is organised around these three topics. These topics have a lot of overlap, and aren’t intended to be mutually exclusive: they’re just ideas to get the brain going. I use the upcoming Cell Behavior Ontology (CBO) workshop to illustrate the points. The questions I single out below may already have been answered by the workshop organizers, but haven’t been published on the CBO wiki yet. I’ll be attending this workshop, and will aim to post my notes each day. It should be fun!


If a main goal is eventual incorporation within another ontology (e.g. Gene Ontology (GO) for the case of CBO) or even just alignment with the other ontology’s tenets, you have to be sure you’re happy with the limitations this may put on your own ontology. It may be that these limitations are not acceptable, and as a result you choose to reduce the dependencies on the other ontologies.

For CBO, the important questions relate to possible alignment to GO and therefore, ultimately, Basic Formal Ontology (BFO):

Question: Do you wish to ultimately include some CBO terms under, for example, biological processes of GO? GO contains only canonical/non-pathological terms. How does this fit with the goals of CBO?

GO has the express intent of creating terms covering only canonical / non-pathological biology. Therefore, would cell behavior during cancer (e.g. uncontrolled cell proliferation or metastatis, which aren’t in GO) be appropriate if CBO is meant to, in its entirety, be included within GO? They are important terms, so if some amount of incorporation with GO is appropriate, would it only end up being a partial alignment?

Question: Are there any plans to use an Upper Level Ontology (ULO) such as the OBO Foundry-recommended BFO? Though BFO may not need to be considered immediately, it does place certain restrictions on an ontology. Are you happy with those restrictions?

One example of the restrictions placed by the use of BFO is that within BFO, qualities cannot be linked via the Relations Ontology to processes. That is, if you have a property called has_rate which is a child property of “bears”, then you are not allowed to make a statement such as “cell division has_rate some_rate”, where cell division is a process, and some_rate is a quality. There is a good post available about ULOs by Phil Lord.

Question: How richly do we want to describe cell behaviors?

Another important general goal is the level of richness that is needed with CBO. Competency questions, discussed later, will answer this to some extent. We can think about richness using GO as an example. The goal of the GO developers is the integration of multiple resources through the use of shared terms. GO does this very well. But, if you want rich descriptions and semantic interoperability, then this is something that is not a goal of GO.


While it is often a tempting idea to start from the top of an ontology and work downward, consideration should be given to an initial listing of leaf terms that you are sure that you need in the ontology. Not only does this ensure you have terms that people need from the start, the bagging and grouping exercises you would then go through to create the hierarchy will often highlight any potential problems with your expected hierarchy. If you have clear use-cases, then a bottom-up approach, at least in the early stages, can be useful in figuring out what the scope of your ontology is.

This brings us to the importance of having scope – and a set of competency questions – ready from the beginning of ontology development. What do you want to describe?

Question: What is the definition of cell behavior in the context of CBO?

For instance, for CBO, what is meant by the word “behavior”? A specific description of what is, and isn’t, a behavior that the CBO is interested in, is an important first step.

The last thing that would be relevant to the overall goals (but which could equally well be considered in the Granularity section below) is the type of terms to be added:

Question: Should the terms be biological terms only, or also bioinformatics/clinical terms?

To better explain the above question, you could consider the stages of cancer progression. “Stage 2” is a fictitious name for a clinical/bioinformatics description of a stage of a cancer. This is not a biological term. Which type of term should go into CBO? I would guess that the biological term should go in which describes the biology of a cell at “stage 2”, and then perhaps use synonyms to link to bioinformatics/clinical terms. There probably shouldn’t be a mix of the two types of terms as the primary labels.

Additionally, competency questions can help determine the scope. You can make a list of descriptive sentences that you want the ontology to be able to describe, such as “The behavior of asymmetric division (e.g. stem cell division)”. By listing a number of such sentences, you can determine which are out of scope and which must be included, thus building up a clear definition of the scope.


For me the granularity question has two aspects: first, and more generally, is how fine-grained do you want to be with your terms; second, and more interestingly, is in the context of CBO, are we interested in the behavior of cells and/or the behavior in cells? The examples given in the workshop material seem to come from both of these areas (see

Question: Should CBO deal with the behavior OF cells and/or the behavior IN cells?

For the above question we can use as examples cell polarization and cell movement. Both are listed in the link to the wiki provided just above, so both are considered within the scope of CBO. However, cell movement is a characteristic behavior of a cell, while polarization is something that happens in a cell (e.g. polarization within a S.cerevisiae cell with regards to the budscar). Both of these types of behaviors are relevant, but they are different classes of behavior and may be an appropriate separation within the CBO hierarchy.

As an aside, is cell division a behavior? It is covered in the CBO material, so with respect to CBO, it is. I think that the CBO is intended to deal with single cells, so I’m not sure where cell division fits in.

These questions should be considered, but you should also try not to let them reduce the effectiveness and efficiency of ontology development. However, as with many biological domains, try to ensure that everyone is on the same page with their goals, scope, and granularity and there will be (I believe!) fewer arguments and more results.

Also, I am positive I’ve missed stuff out, so please add your suggestions in the comments!

With special thanks to Phil Lord for the useful discussions surrounding ontology building that formed the basis for this post.

CISBAN Meetings & Conferences

CISBAN and telomere maintenance and shortening, BBSRC Systems Biology Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Amanda Greenall: Telomere binding proteins are conserved between yeat and higher eukaryotes. The capping proteins are very important, because they prevent the telomeres from being recognized as double-strand breaks. They work on cdc13, which is the functional homologue of POT1 in humans. A point mutation cdc13-1 allows them to study telomere uncapping. When grown above 27 degrees Celcius, the cdc13-1 protein becomes non-functional, and fall off. This uncapping causes telomere loss and cell-cycle arrest. So, they do further study into the checkpoint response that happens when telomeres are uncapped. Yeast is a good model, as many of the proteins involved in humans have direct analogs in yeast. They did a series of transcriptomics experiments to determine how gene expression is affected when telomeres are uncapped. They did 30 arrays, and the data was analysed using limma. 647 differentially-expressed genes were identified (418 upregulated (carbohydrate metabolism, energy generation, response to OS), and 229 downregulated (amino acid and ribosome biogenesis, RNA metabolism, etc)). The number of differentially-expressed genes increase with time. For example, 259 of the genes were involved in DNA damage response.

They became quite interested in BNA2, which is an enzyme which catalyses de novo NAD+ biosynthesis. Why is it upregulated? It seems over-expression of BNA2 enhances survival of cdc13-1 strains (using spot tests). Nicotinamide biosynthetic genes are altered when telomeres are uncapped in yeast and humans. The second screen was a robotic screen to identify ExoX and/or pathways affecting responses to telomere uncapping. Robots were used to to large-scale screens that can measure systematic cdc13-1 genetic interactions. One of the tests was the up-down assay, which allows them to distinguish Exo1-like and Rad9-like suppressors. Carry on with the spot tests until have worked through the entire library of strains.

Darren Wilkinson: a discrete stochastic kinetic model has been built to model the cellular response to uncapping. (J Royal Soc Interface, 4(12):73-90), and in Biomodels. Encoded in SBML and simulated in BASIS (web-based simulation engine). You can use the microarray data to infer networks of interactions. Such top-down modelling can often be done with Dynamic Bayesian Networks (DBNs) for discretised data and sparse Dynamic Linear Models (DLMs) for (normalized) continuous data. A special case of DLM is the sparse vector auto-regressive model of order 1, known as the sparse VAR(1) model, and this appears to be effective for uncovering dynamic network interactions (see Opgen-Rhein and Strimmer, 2007). They use a simple version of this model. They use a RJ-MCMC algorithm to explore both graphical structure and model parameters. When the RJ-MCMC is performed, it's quite hard to visualize. They do a plot of the marginal probability that an edge exists. This can also be summarised by choosing an arbitrary threshold and then plotting the resulting network. You can change the thickness of the edges so they match the marginal probability associated with each edge. This picture is then easier for biologists to analyse, and allows them to narrow down their search for important genes. He also performed analysis over the robotic genetic screens. There are usually about 1000 images per experiment, each with 384 spots, and therefore image analysis needs to be automated. Want to pick out those strains that are genetically interacting with the query mutation. For interactions to be useful concept in practice, you need the networks to be sparse. With HTP data, we have sufficient data to be able to re-scale the data in order to enforce this sparsity. A scatter-plot of double against single will show them all lying along a straight line (under a model of genetic independence). Points above and below the regression line are phenotypic enhancers and suppressors, respectively.

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend


CISBAN Meetings & Conferences

Introduction and Directors’ Updates: BBSRC SB Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 15 December 2008.

This is the third BBSRC grantholders' workshop for systems biology. The participants include grantholders and their staff; people connected with the activities of the 6 SB centres; guests from BBSRC-sponsored institutes; guests from outside of these institutes; and finally, members of the BBSRC peer-review panels. Why is this workshop happening? It's an opportunity for sponsors and participants to discover and say how things are going; also interested in everyone sharing experiences with colleagues; and, of course, to talk about new ways to solve biological problems. I'll be providing my notes on most of the talks that are given over the next few days.

Now for my notes on the updates from the directors of the six Centres for Integrated Systems Biology (CISBs).

CISBAN (Tom Kirkwood): Why and how do we age? Limited investment in cell maintenance (the disposable soma theory). Ageing is caused primarily by damage, and longevity is regulated by resistance and repair. There are multiple mechanisms: it is a highly complex system which is inherently stochastic. There are a number of interesting questions that revolve around optimality, plasticity, and trade-offs. Now involved as an academic partner in the EU network of excellence, LifeSpan. One area of major research within CISBAN is that of telomere length homeostasis. Work includes high-throughput (HT) yeast robot experiments, which has led to the identification of a large number of genes that affect telomere maintenance. CISBAN also studies the intrinsic ageing of mammalian cells, where there is considerable cell-to-cell heterogeneity. Research has shown that it is much more than just telomere erosion: there is significant crosstalk between telomere and, for example, mitochondrial dysfunction. He also gave an overview of BASIS, a web application that allows the researcher to upload SBML models and run simulations on the BASIS server. There is a large amount of research going on within CISBAN that has a statistical focus, and there is also work going on with respect to data integration.

CISBIC (Brian Robertson): CISBIC focuses on host-pathogen interactions, and contains 3 exemplar sub-projects. They have core facilities in glycoproteomics, metabolomics, transcriptomics and cell imaging. They are interested in combating disease in both plants and animals. They use infection as a perturbation to study host biology, mainly focusing on the interaction between the microbe and the innate immune response. They have tried very hard to create a community of systems biologists, and also have collaborations with people outside the centre. Top questions from experimental biologists: "How do I start in SB?", "How do I start modelling?", "Which modelling approach should I take?". They're considering starting a summer school to train people.

CPIB (Charlie Hodgman): Began with a nice screencast/automated slideshow of pictures and work going on at CPIB. Main research works on plant roots, and how they respond to external stimuli. The research program is split into 4 strands: root cell elongation, root apical meristem, lateral root development, and an integrated root model. Models they are developing include: static and dynamic network modelling, biomechanical modelling, tissue-scale computational modelling, and more. Technological developments include vertical-stage confocal imaging, automated focusing time-lapse microscopy. Outreach work includes: advanced hormone-transport workshop in May 2008, and had their first plant modelling summer school in September 2008. In December 2007 there had a Maths and Plant Sciences Study Group.

CSBE (Igor Goryanin): This centre is focused on dynamic biological systems rather than on any particular biological project. They are concentrating on cell-to-cell interactions, and their main aim is to streamline the modelling process. Mentioned the Bio-PEPA Eclipse Plug-in, which can run simulations within Eclipse.
MCISB (Hans Westerhoff): Focus on one project, using S. cerevisiae: they want to make SB work, in a complete sense. This means from the individual sequences right up to modelling a complete cell.

OCISB (Judy Armitage): Their remit is to build upon the strengths of groups within Oxford to provide a focus for the catalytic development of a wide range of systems approaches. They were trying to understand, predic ad control physiological behaviour by integrating knowledge of interactions at the molecular, cellular and population levels. Their initial core testbeds are: bacterial chemotaxis and sensory transduction, cell cycle, hypoxia and the Trypanosome flagellum. New projects include: plant metabolomics and signalling, extracellular matrix, t-cell signalling, diversity of host responses in malarial infection, meiosis, biofilm development and synthetic signaling networks. They're also doing more work in outreach, including open seminar series and one-day workshops.

I really liked how each of the directors thought it was important to point out the outreach – it is an opinion I share. Additionally, what was interesting is that every director made a specific point of how their Centre was quite unusual in the amount of interchange and multidisciplinary research they do. This type of collaboration has traditionally been very rare, and the Centres were originally quite unusual in that regard. However, what is gratifying is that, in order to be successful in SB, outreach and collaboration have become a necessity, and I am hearing them talked about much more these days. Perhaps those directors can now all remove the word "unusual" from the next versions of those presentations. Wouldn't that be nice? 🙂

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend


CISBAN Data Integration Meetings & Conferences Software and Tools Standards

Pre-workshop post on the FuGE / ISA-TAB Workshop, 8-9 December

Tomorrow is the first day of a two-day workshop set up to continue the integration process between the ISA-TAB format and the FuGE standard. (Well, technically, it starts tonight with a workshop dinner, where I'll get to catch up with the people in the workshop, many of whom I haven't seen since the MGED 11 meeting in Italy this past summer. Should be fun!)

ISA-TAB can be seen as the next generation of MAGE-TAB, a very popular format with biologists who need to get their data and metadata into a common format acceptable by public repositories such as ArrayExpress. ISA-TAB goes one step further, and does for tabular formats what FuGE does for object models and XML formats: that is, it is able to represent multi-omics experiments rather than just the transcriptomics experiments of MAGE-TAB. I encourage you to find out more about both FuGE and ISA-TAB by looking at their respective project pages. The FuGE group also has a very nice introduction to the model in their Nature Biotechnology article.

Each day I'll provide a summary of what's gone on at the workshop, which centers around the current status of both ISA-TAB and some relevant FuGE extensions, as well as the production of a seamless conversion from FuGE-ML to ISA-TAB and back again. ISA-TAB necessarily cannot handle as much detail as the FuGE model can (being limited by the tabular format), and therefore in the FuGE-ML to ISA-TAB direction, it is possible that it may not be entirely lossless. However, this workshop and all the work that's gone on around it aims to reconcile the two formats as much as possible. And, even though I have mentioned a caveat or two, this reconciliation is entirely possible: both ISA-TAB and FuGE share the same high-level structures. Indeed, ISA-TAB was created with FuGE in mind, to ensure that such a useful undertaking used all it could of the FuGE Object Model. It is important to remember that FuGE is an abstract model which can be converted into many formats, including XML. Because it is an abstract model, many projects can make use of its structures while maintaing whatever concrete format they wish.

Specific topics of the workshop include:

  • Advance and possibly finalize XSLT rendering of FUGE Documents into ISA-TAB. This includes the finishing-off of the generic FuGE XSL stylesheet.
  • Work on some of the extensions, including FCM, Gel-ML, and MAGE2. MAGE2 is the most interesting for me for this workshop, as I've heard that it's almost complete. This is the XML format that is a direct extension of the FuGE model, and will be very useful for bioinformaticians wishing to store, share and search their transcriptomics data using a multi-omics standard like FuGE.

Thanks to Philippe Rocca-Serra and Susanna-Assunta Sansone for the hard work they've done on the format specification, and for everyone who's coming today. It's a deliberately small group so that we can spend our time in technical discussion rather than in presentations. I'm a bit of a nut about data and metadata standards (and am in complete agreement with Frank over at peanutbutter on the triumverate of experimental standards) and so I love these types of meetings. It's going to be fun, and I'll keep you updated!

Read and post comments |
Send to a friend


CISBAN Outreach

Scientist Meets Small Children, and doesn’t stop talking (and listening) all day!

[Update: You can get the slides for this presentation now. See my related post.]

This past Monday, one day before fantastic things happened in the voting booths of America (I had already submitted my absentee ballot), I spent a day at a local primary school. Names and locations for the school will not be mentioned, as I am unsure about rules regarding child protection, but the day was great, and I’ll tell you all about that. I had contacted the Teacher Scientist Network (information about volunteering is at the end of the post), which pairs teachers with scientists, about a year or so ago. As far as I know, prior to me, most of the scientists who had volunteered had been wet-lab-based scientists, and the partnerships were generally geared around that sort of work. I am a bioinformatician/computational biologist, which means I spend all my work time in front of a computer. Figuring out what to do with me in the TSN, and who to pair me with, took some effort. However, with the wonderful help of people like Deborah Herridge, and now Claire Willis, from the TSN, I was eventually paired with a teacher. She has a biology background, and now teacher 10-11 year olds. So, she is very aware of what sort of science she wants the kids to learn, and also understands how important it is for kids to interact with a “real” scientist. And what she organized for my visit was just great. Words like funny, wonderful, crazy, surprising, cute, interesting, intelligent, curious, shy, proud, and many others come to my mind when I think of those kids, and I’ll try to explain why.

My teacher partner has organized all this week as the school’s Science Week. And, for the first day, I was to visit and give them all a talk or two, and answer questions, about 1) what it is like being a scientist, and 2) what genes are and how they are used in research and medicine. In the case of the older kids, I also was able to talk a little about ethics, which was really good.

I saw 9 separate classes, speaking at each one, and also gave a short talk at the two assemblies (one for the juniors, and one for the infants). The ages of the children ranged from 5 to 11. In my short talk I had prepared some slides about how I had become a scientist, and the longer talk centered around the theme that me and my teacher partner had decided on: genes, and how they are used in scientific research. Yes, this meant talking about “GM”, but there are so many aspects of it that the media never really touch on, that the talk was wonderfully diverse. I spoke on historical domestication of animals, making medicine for humans, encouraging hardier crops, the similarities and differences between lab-based genetic manipulation and “traditional” selective breeding, the obligatory glowing fish and glowing mice, and the Nobel Laureates in Chemistry for this year, one of whom released a picture of a petri dish of fluorescing bacteria “drawn” with the picture of a sunset.

By the end of the day, my throat was sore from talking. People who know me would not be surprised to hear I talked a lot – that’s a standing joke. However, the kids talked (in a good way!) almost as much as I did. They always wanted to tell me about their experiences, and how they related to the slides, and were full of interesting questions. It was fabulous.

Top Questions Asked, in order of remembered frequency

1. Why do scientists wear white coats?

2. Is that seahorse real? (From a picture from the GloFish website)

3. Can scientists mix more than one gene together?

4. How many colors can you make? (with respect to fluorescent proteins)

5. What happens when you mix colors? (another fluorescent protein question)

6. Do you like America? / Do you like America more than the UK?

7. Does that one have eyes? (When looking at this picture of e.coli.)

8. They look like hot dogs! (Ok, so not a question, but hilariously accurate – of the e.coli, again.)

9. Why does one of my eyes have a bit of brown in it? (This was a child with blue eyes, except for a wedge of brown in one eye.)

10. Is the science in CSI (the TV program) like real science?

11. Are some scientists going to destroy the world with black holes? (A little off-topic, but I didn’t mind at all!)

As you can see, not all of them were strictly on-topic, but they just kept asking question after question. They also did the classic kid trick of raising their hand as if to ask a question, when their real purpose was telling you all about their vacation/pet dog/pet hamster/pet cat/Uncle who used to be a scientist but now works at Asda (yes, really!). But I didn’t mind those at all. I kept on running over time in each classroom, as they had so many things to ask me. It didn’t matter what age – pitch the talk in the right way, and they really seemed to enjoy it!

Some top tips if I were to do it again (which I really would like to do – the teacher I’m paired with and I have some ideas for next time) include:

1. My method of using no text on the vast majority of the slides really worked. It was especially useful as it meant I could stop anywhere in my slides if I was running out of time, and the littlest ones were not distracted by trying to read the words rather than listening to me.

2. Pictures of fluffy, pretty, cute, or “gross” animals were very, very popular. The number of “Awwwws” I got when showing pictures of cats was astounding. Equally, all the older ones wanted to see my pictures of the newborn mice (pretty gross with no hair!), and all ages enjoyed trying to figure out what the photo of e.coli was.

3. As soon as you ask a question, they all raise their hands to answer it. Not sure when this stops, but I know that by the time I was in high school the teachers had a hard time prying any answers out of the majority of us! 😉 However, on Monday I was at a school where the eldest was 11, and they all wanted to contribute. So, ask them questions. I found there were two types: the question where I wanted to get an answer (such as “What traits make a good horse?” or “What do you think makes these two cats different?”) and the type where I just wanted them to feel included in the talk, and just wanted a show of hands (such as “How many of you have a cat?” or “Who has heard of diabetes?”).

4. Introduce some ethics, and show how scientists think very carefully before doing research. We talked about genes a lot, and how putting new genes in bugs like e.coli can help us, e.g. the human insulin gene into e.coli to help with diabetes. I told all the older kids that it wasn’t the tool that is a problem: a tool is neither good nor evil. It’s how that tool is used, and people need to make a fresh decision, and think about the benefits and downsides each time that tool is used. I said genetic modification is like a knife: it is neither good nor bad, and that scientists try very hard to make sure that it is used for the right reasons, and in a safe way.

5. Visually-arresting analogies. Even though DNA is a double-helix and not a spiral staircase, I found it a very useful analogy, especially for the younger ones.

6. My partnered teacher had prepared some slides to show the kids prior to my arrival. They dealt with Mr. Green Genes, the GFP-glowing cat. Some of the other teachers also talked to their kids about inheriting some of your traits from your mom, and some from your dad, and used the labradoodle as a visual aid. This prepped them for my talk, which I think was really helpful.

7. Make your talk inclusive. It keeps their interest, I think. When I showed pictures of cats, I included one picture of my own cat, and told them a little about her. I often asked them questions about if they had pets, or scientists in the family, or liked the look of a picture, or knew what something was.

These are probably things that most people are well aware of, however, I thought I’d just share my experiences!

Things That Surprised Me

1. How many of them knew the word “bacteria” before I could even say it. It was, strangely enough, the top answer to my “What do you think this is?” question when I showed them the picture of e.coli.

2. How many of the kids, without prompting, came up to me after my talks and said that they really enjoyed it.

3. How many came up to me after my talks to ask more questions, or tell me about a scientist in the family.

4. How much enjoyment I got out of giving my talks, and from listening to what the kids had to say.

5. One of the kids made an immediate connection between adding “glowing genes” (GFP etc) to fish, and Jurassic Park. Ok, so it isn’t an exact analogy, but that was really great to hear. It also brought forth a discussion, led by the kids, about saving endangered animals.

And, in one direct appeal to my vanity, a little 5 or 6 year old girl told me as I was leaving her class that she thought I was pretty! Wow, what a nice way to finish a talk, and it definitely helped the ego 😉 I thanked her, and the teacher heard her and told her that her house could have a point. Then I realized that their school, just like Hogwart’s, had houses that got points! Too cool 🙂 And finally, a very great compliment from the teacher I’m paired with: “the kids are so much more enthusiastic about science and a lot of them have asked when you are comming back! Your work was perfectly pitched to the children’s needs and was explained in a way that was so easy to understand.” Thanks!

I highly recommend the Teacher Scientist Network. If you are interested in registering with the Teacher Scientist Network in my area (operated by Science Learning Centre North East), please visit and register at as a scientist. Claire Willis will then receive your application and arrange a mutually convenient time to meet up. If you’re interested, but aren’t sure where to go for your area, then have a look at that page – you can send questions to Claire from there. that website also has more information about the TSN. They don’t ask for very much time from scientists at all, from a day or two per year, to anything that the teacher and scientist agree to. In my case, the head of my Centre told all of us employees that if we wanted to volunteer for the TSN, we could do, and do it on work time. He is most generous, and definitely sees the benefit of science outreach to schools.

Thanks to the TSN, my bosses, my partnered teacher, and most especially all those kids! 🙂

Read and post comments |

Send to a friend



3 Bioinformatics Research Associate Positions: Newcastle University

There are three bioinformatics jobs (one in pure bioinformatics, one in network analysis, and another in modelling/mathematical biology) currently available within CISBAN, an interdisciplinary centre studying the systems biology of ageing and nutrition. The full particulars are posted both on Nature Jobs and on the Newcastle University Job Vacancies web pages.

Below are links to the various job advertisements, as well as summaries of the jobs themselves. This is a summary of the three Nature Jobs postings, put together on a single page for easy perusal. The closing date for all of these positions is 11 January 2008. This is a great opportunity, though I may be speaking from a biased perspective as I work at CISBAN and find it an interesting and challenging workplace.

  1. Centre for Integrated Systems Biology of Ageing and Nutrition, Institute for Ageing and Health

    Research Positions

    Level F £25,134 – £32,796 p.a.
    Level G: £33,779 – £40,335 p.a.

    We seek scientists to join CISBAN, an exciting new research centre established following a major award (£6.4m) from BBSRC and EPSRC,
    to participate in studies of the mechanisms responsible for ageing and
    how they are affected by nutrition. Ageing is recognised
    internationally as a ‘grand challenge’ and is a field prioritised for
    growth. This post offer opportunities to work in an intensely
    multidisciplinary, world-class centre and contribute to the development
    and application of systems science.

    Research Associate (Bioinformation/Computing Scientist – Applications)

    develop and maintain the computing software and hardware infrastructure
    for systems biology, including a central web portal integrating
    applications for data capture, storage and visualisation and high
    performance computing systems and databases, including a large Linux

    Job reference: A1091R

    Posts are tenable until 30 September 2010.

    Enquiries for the post may be directed to Dr Anil Wipat, School of Computing Science (email:
    Further particulars for this post can be found on the University’s web page at

    Applications should be submitted by 11 January 2008 to Professor Tom Kirkwood, CISBAN Director,
    Institute for Ageing and Health, Henry Wellcome Laboratory for
    Biogerontology Research, Newcastle University, Newcastle upon Tyne NE4 6BE (email:
    Committed to Equal Opportunities

  2. Centre for Integrated Systems Biology of Ageing and Nutrition, Institute for Ageing and Health

    Research Positions

    Level F £25,134 – £32,796 p.a.
    Level G: £33,779 – £40,335 p.a.

    We seek scientists to join CISBAN, an exciting new research centre established following a major award (£6.4m) from BBSRC and EPSRC,
    to participate in studies of the mechanisms responsible for ageing and
    how they are affected by nutrition. Ageing is recognised
    internationally as a ‘grand challenge’ and is a field prioritised for
    growth. This post offer opportunities to work in an intensely
    multidisciplinary, world-class centre and contribute to the development
    and application of systems science.

    Research Associate (Bioinformatician – Network Analysis)

    research and develop novel methods of representing and integrating
    molecular and cellular data as networks and apply this methodology to
    identify novel proteins and elucidate novel pathways involved in the
    process of cellular ageing and senescence.

    Job reference: A1090R

    Posts are tenable until 30 September 2010.

    Enquiries for the post may be directed to Dr Anil Wipat, School of Computing Science (email:
    Further particulars for this post can be found on the University’s web page at

    Applications should be submitted by 11 January 2008 to Professor Tom Kirkwood, CISBAN Director,
    Institute for Ageing and Health, Henry Wellcome Laboratory for
    Biogerontology Research, Newcastle University, Newcastle upon Tyne NE4 6BE (email:

    Committed to Equal Opportunities

  3. Centre for Integrated Systems Biology of Ageing and Nutrition, Institute for Ageing and Health

    Research Positions

    Level F £25,134 – £32,796 p.a.
    Level G: £33,779 – £40,335 p.a.

    We seek scientists to join CISBAN, an exciting new research centre established following a major award (£6.4m) from BBSRC and EPSRC,
    to participate in studies of the mechanisms responsible for ageing and
    how they are affected by nutrition. Ageing is recognised
    internationally as a ‘grand challenge’ and is a field prioritised for
    growth. This post offer opportunities to work in an intensely
    multidisciplinary, world-class centre and contribute to the development
    and application of systems science.

    Research Associate (Modeller/Mathematical Biologist)

    develop models of molecular and cellular mechanisms of ageing and to
    explore links between ageing, development and evolution from a
    life-course perspective. This post will also involve collaboration
    within the EU Network of Excellence LifeSpan, linking development and ageing.

    Job Ref: A1092R

    Posts are tenable until 30 September 2010.

    Enquiries for the post may be directed to to Professor Tom Kirkwood, Institute for Ageing and Health (email: Further particulars for this post can be found on the University’s web page.

    Applications should be submitted by 11 January 2008 to Professor Tom Kirkwood, CISBAN Director,
    Institute for Ageing and Health, Henry Wellcome Laboratory for
    Biogerontology Research, Newcastle University, Newcastle upon Tyne NE4 6BE (email:*

    Committed to Equal Opportunities

Read and post comments |
Send to a friend


CISBAN Meetings & Conferences Standards

FuGE Users’ Workshop: 13-14 December, 2007

The two-day FuGE Users’ workshop was organized by Norman Paton and held at the University of Manchester. It was great fun, and if you just want the short summary of my time there, then just know that there was loads of enthusiasm for FuGE as well as interesting talks, both by communities who were already extending FuGE, and by  developers who were already building tools and databases based on it. There were only a dozen or so people, which kept the discussions lively but neither too long nor too divergent. The workshop dinner was great, though the trip to the restaurant was correctly described by one of the attendees as an Odyssey. For more information on the social aspect of the FuGE workshop, please have a look at Phil Lord’s humorous posting on the matter. For another post on the workshop, see the peanutbutter Bioinformatics blog by Frank Gibson.

If you wish to read the longer notes rather than the short summary, then please read on!

Please note that these are my own notes, and are in no way considered to be an “official” FuGE report on the workshop. As such, any errors or inconsistencies are entirely my own. However, if you see a problem with this post, then please let me know, and I’ll fix it!

The objectives of the workshop were to share and document experiences in the use of FuGE, to identify good-practices, to document guidelines, and to make known these experiences and guidelines. Hopefully, the result will be a paper that documents the current users’ experiences and increases communities’ understanding of FuGE. It will hopefully help people who who have read the Nature Biotechnology paper and want to use FuGE, but aren’t completely sure what to do next.

Attendees were:

Peter Wilkinson, from
Montreal, who was interested in FuGE for flow cytometry.

Khalid Belhajjame: works
with Norman Paton in Manchester, and who may soon be a full-time
developer of FuGE

Javier Santoyo: University
of Edinburgh, part of a consortium trying to develop standards for
RNAi work

Andy Jones: one of the
original developers of FuGE, from Liverpool, developed GelML with
Frank Gibson.

Heiko Rosenfelder: German
Cancer Centre at Heidelberg, here as part of MIACA, and wants to use
FuGE for the cellular assay format.

Martin Eisenacher: Proteome
Centre (mzML and analysisXML) and wants to use FuGE

Phil Lord, Frank Gibson: via
CARMEN, wants to use FuGE. Frank also developed GelML with Andy

Neil Wipat, Matt Pocock,
Allyson Lister: We use FuGE in our internal application for storing
HT data. Matt and Allyson also involved in OBI.

Leandro Hermida: SIB,
they’re part of a group that is making SystemsX. Want to use FuGE to
store and manage the data. Also want to make an extension of FuGE for
deep sequencing technologies.

Norman Paton: originally
from proteomics field, but developer of FuGE and organizer of the

Session 1:
Experiences Using and Extending FuGE

GelML –
Frank Gibson and Andy Jones

GelML is a FuGE extension that has passed the PSI approval process. PSI defines community standards for data representation in proteomics. There are a variety of working groups, including gel electrophoresis, mass spectrometry, protein modifications, etc. Within the Gel WG there are three
specifications: MIAPE-GE (minimum checklist for reporting gel elecrotphoresis experiments), sepCV (controlled vocabulary), and GelML (data transfer format, based on FuGE).

GelML covers the model of a gel, 1-D and 2-D GE, other (N-dimensional) GE’s, sample loading, electrophoresis, detection, image acquisition, the excision of locations on gels, and SubstanceMixtureProtocol and SubstanceAction.

The first extended FuGE class described was the Material abstract class. The first of such classes is the Gel class. A Gel has Dimensions, MeasuredMaterial, and others. You use the “Measurement” package to describe the characteristics of the Gel. Measurements include PercentageAcrylamide, while information about the gel (i.e. if purchased, from where), information on the Dimensional Units and CrossLinkerRatio are all FuGE OntologyTerms). MeasuredMaterial was not originally in FuGE because it was planned that such substances could be captured by ontology terms. Rather than using named associations to GenericParameter, they tended to use either GenericParameter
(with a CV term) or extended the
Parameter class. This was just a design decision, and he would like to see how others do it.

Another extended FuGE class is the Protocol abstract class. The GelML SampleLoadingProtocol has an AddBufferAction which points to a SubstanceMixtureProtocol. 2DGelProtocol has a SampleLoadingAction, a FirstDimensionAction, a SecondDimensionAction (both Electrophoresis
protocols), and an
InterDimensionAction (for when something happens between the first and second dimension actions), and the DetectionAction.

Within the Electrophoresis protocols there is the ElectrophoresisStop (an Action) which contains a StopTime, which is a TimeParameter, with has Duration and TimePoint. They’d be really interested to see how others have/would like to model time. It was also a design decision to guide people with the structure of the XML to help them know what to fill out, e.g. you must have a 2dGelProtocol. For each case, should we extend the FuGE model or add experiment-specific semantics through the use of ontologies? I think this is a case of using both, depending on the circumstances.

They have used standard XML references within the documents. But, for instance, do we still need internal document identifiers when the ontologyURI is a globally-unique identifer anyway? Maybe required if the terms are created ad hoc within the group making the XML file. What is the best way to use ontologies?


– Martin Eisenacher

He is a member of the ProDaC Consortium. ProDaC is a funded consortium that is meant as a “coordination action” within the 6th EU Framework Programme. Its aims are the development of international standards, standardized data submission pipelines, systematic data collection in public standards-compliant repositories, and data access for community and publication. There was a kick-off meeting of ProDaC in Long Beach in October 2006, and there have been two workshops since. Proteomics data includes spectra (peak lists), and peptide lists. He works specifically with the MS (for peak lists and instruments, mzML) and Proteomics Informatics (for “results”, analysisXML) PSI WG’s.

mzML is a merger of mzData and mzXML. Perhaps this merger is one of the reasons that it is not currently FuGE-based. AnalysisXML includes annotation of search databases, search, algorithms, search parameters, instrument characteristics, peptides (peptide-spectrum link, peptide scores), proteins (protein-peptide link, protein scores, significance values, false-discovery estimation) and quantisation. In September 2007 they added comments into the UML that are passed into the XML.

They use the MagicDraw Community Edition, which is available for free. The Analysis package is subdivided into process, quantisation, and search.
Process contains things that aren’t directly related to the search protocol applications, but other steps such as ProteinDetermination and PolypeptideProcess. Some of the classes they have made that inherit from the Data class inside the search package include AnalysisResultSet
(a set of spectra),
AnalysisResult (a spectrum), and AnalysisResultItem (all peptides found for that spectrum). These are all abstract classes,
whose concrete subclasses include
PolypeptideResultSet, PolypeptideSearchResult, and PolypeptideResultItem.

At the moment they are assembling their own CV (to include search parameters that are most commonly used in search engines like MASCOT), but they can also use Pride CV. They use the ontology classes directly from FuGE, without extending it. This means that it fits what they need without modifications.

In analysisXML, peptides and sequences are listed only once. Different types of analyses in one file or in separate files with external cross-references. Also, the AnalysisProtocol could be used as parameter input for search engines. However, there are many cross-references and unique identifiers that are not validated by the FuGE Schema. Further, there are external cross-references to mzML, which can be difficult if you have only
local files and not public URI’s. Also, sequences (just the letters) are not polypeptides (“real” molecules with modifications). Therefore, the
ConceptualMolecule FuGE class is not appropriate for polypeptides, though it is suitable for sequences (though they are still able to use that class).
Additionally, the ResultSet-ResultItem hierarchy does not fit all analysis types. Finally, many FuGE elements seem to have very long names that aren’t always useful (but you shouldn’t be typing XML manually!).

All items of the collections have unique identifiers. References to them are attributes called “…_ref”. Schema validation does not consider whether _ref links to an allowed section (or that used CV’s are allowed). In mzML, for example, “semantic validation” of CV’s is possible (suggested/implemented by the EBI). Are identifier checks possible? ProDaC has an online validator for mzML, analysisXML, mzData and prideXML that performs semantic validation, though the extra ontology/CV checks are only supported for mzML.

Still to do is the finalization of analysisXML, which is a deliverable for last October! They also want to provide “Quality Determination” as a process. They also want to make some use-cases and instance documents. They will have some from Matrix Science, MPC. Also, they need to finalize the CV they are using.


– Allyson Lister

I gave this talk, so I didn’t write anything about it! Instead, have a look at the SourceForge website (

The Integrative Bioinformatics Group, headed by Neil Wipat and part of The Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN), has developed a data archive and integrator (SyMBA) based on Milestone 3 of the Functional Genomics Experiment (FuGE) Object Model (FuGE-OM), and which archives, stores, and retrieves raw high-throughput data. Until now, few published systems have successfully integrated multiple omics
data types and information about experiments in a single database. SyMBA is the first published implementation of FuGE that includes a database back-end, expert and standard interfaces, and a Life Science Identifier (LSID) Resolution and Assigning service to identify objects and provide programmatic access to the database. Having a central data repository prevents deletion, loss, or accidental modification of primary data, while giving convenient access to the data for publication and analysis. It also provides a central location for storage of metadata for the high-throughput data sets,
and will facilitate subsequent data integration strategies.

Flow Cytometry FuGE extensions – Peter Wilkinson

Developing MIFlowCyt. Originally, they stored the metadata and data in a single file, but their latest format (ACS) will separate these two types. They are considering having some of their data formats be in RDF as well as XML, even for those formats that will be built on FuGE – is there a good XML to
RDF converter? I suppose so, as I’ve been able to save OWL/RDF as OWL/XML in Protege 4.

One example of their extension is Cytometer, which is a subclass of equipment. How descriptive should they get with their samples? Should it be at the entity or attribute level? For instance, there is a conceptual difference between prepared samples and “generic” materials. But why not draw an  association to material and call it “sample”? They can’t do that because sample has a lot of associations itself that aren’t present in Material. For things like buffers and solutions, spML doesn’t seem to view them as things that exist – you just talk about them in the protocol. This way you don’t have to list them 1000s of times. In FC, you have to know exactly which thing is used in the protocol (e.g. they must record batch numbers). However, you could have a single buffer instance, and then in the ProtocolApplication you have a specific parameter that is modified in that particular application of the Protocol, such as the batch number.

Open issues include: FuGE should reference a stable version of AndroMDA, there should be a best-practice for deciding when a Generic* class is replaced by a specific omics-type class, how is the OntologyTerm abstract class intended to be used for specific controlled lists, fitting the organism into FuGE::Bio somewhere, and versioning. He’s also trying to write a FuGE database by hand, rather than using what is generated by AndroMDA, as he needs to squeeze as much performance out of the system as he can. Much more difficult, but could conceivably be much more efficient.

and Custom Extensions – Andy Jones


is for sample processing. SubstanceMixtureProtocol is for describing a mixture of substances, e.g. buffers and solutions and the method of their creation. Actions relate to constituents. Timings relate to constituents and volume, concentration, or mass. SetPropertyAction is a generic model to be used in conjunction with protocols where parameters may be set with associated ActionText. Their chromatography extension comprises extensions of Protocol, Equipment, and ProtocolApplication. The ChromatographyProtocol contains extensions of Parameter, has a child protocol for sample injection, and various uses of GenericActions. With ChromatographyEquipment, there is column-associated sub-components. All extensions of Chromatography equipment can have additional parameters, including specific named parameters where they are always required. Uses Equipment:make. The mobile phase of LCProtocol is described using the SubstanceMixtureProtocol. Inputs are defined with GenericMaterialMeasurement, and the outputs are either Chromatogram (ExternalData) and SeparationFraction. You can also have two-dimensional chromatography.

GenericSeparation is a protocol that uses generic models for defining substance used to create a separation gradient and the parameters applied. In this
case, the equipment defines the type of separation and criteria using ontology terms – but how do you communicate how this should be used to all of the developers? In contrast, we don’t want to have huge models. Inputs also defined using
GenericMaterialMeasurement, and the outputs are either SeparationLog (ExternalData) and SeparationFraction.

TreatmentProtocol is a simple model for treatments, intended for labelling, mixing, splitting, and washing, for example. The treatment IO in TreatmentProtocolApplication is restricted to having material inputs and material outputs only. There seems to be three sorts of models: column-oriented, category-oriented, and completely generic protocols. Much of what is in spML might be useful for the “library of models” we’ve been discussing.

The generic model is very flexible for different types of separation, and could be used for LC, GC, capillary electrophoresis, rotofors etc. It
is also unlikely to break if new type of experiment is defined, and the Treatments model could potentially be useful in the context of any experiment type. Also, the generic model is much smaller, and can be used in various ways. However, this last one could be a “con” as well, because different users/implementers are likely to encode the same information in different ways. Further, a specific model can guide the user to provide specific details, e.g. for MIAPE compliance.

spML units are derived from the OBO Unit Ontology. Should FuGE extensions be allowed to have user-defined terms? It would be useful for the creation of in-house lists to populate drop-down menus.

Below are a list of questions and suggestions that we came up with while the initial talks progressed in the first couple of sessions. Many were discussed, and some were answered, in breakout sessions later. Notes from the discussions I was a part of are included below. The unanswered points in the list may have been discussed at other breakout sessions, or may still be untouched.

Discussion on Semantic Validation and
Identifiers: Khalid Belhajjame, Norman Paton, Allyson Lister, Martin

Identifiers and
Auxillary/Semantic Validation: Types of Validation and How Simple
Support can be done.

in Document
Property Checked
by XML Tooling
Property Checked
by XML Tooling
Property Checked
by XML Tooling
Property Checked
by XML Tooling
of Identifiable
yes GP yes GP
(+) no yes no See
(#) n/a yes no yes no yes
(not in UML)
($) n/a yes no yes no yes (*) (*)want
to know it’s a file of a particular kind

Can be checked with a generic program.

All things marked GP or X could be attacked by people wanting to write a semantic validation tool.

(+) Only for some types of Globally-unique identifiers would we be able to check that they were truly unique and well-formed.

(#): Should OntologyTerm elements be unique (irrespective of their identifiers, which must be unique)? If people compare OT identifiers they may think two terms are different when in fact they are the same, and someone was sloppy when making OT elements. However, if they have linked their OT to an OntologySource then it can be checked if it is both unique in document and globally unique (if it is a logical/physical uri)? In that case, why should OS be optional at all, if custom CV’s can be included in the OS.

(^): This is where the ontology mapping files come in.

($): The same argument for uniqueness of ED applies as that in OT.

(1) Will we suggest a type of identifier to use with FuGE as a best-practice?

Do we still need internal document identifiers when the ontologyURI is a globally-unique identifier anyway?

Should identifiers be human readable?

Do community extensions automatically have their own namespace/prefix? That is, if “sample” is used in the FC community and also in another extension, will it be problematic if you try to create a multi-omics FuGE-ML file? This is all about linkage between different FuGE-based instances (unique identifiers, both within a single document and between documents.What is the identifier an identifier of? Is every Identifiable object a “first-class citizen”? We shouldn’t force all (any!) identifiers to be URI’s.

Should you use a logical or physical naming scheme?

Physical naming schemes:

  • Are fragile
  • May not work for all users (i.e. if the URI points to a laptop that isn’t publicly accessible)

Logical naming schemes:

  • Are robust, but require a greater investment of time, as they need tools that provide resolving facilities.
  • If locally-unique identifiers are used:
    • it means that you may get into trouble in the long run
  • If globally-unique identifiers are used:
    • clashes between different FuGE-ML files will be avoided
  • People should look this over and discover which is the best setup for their situation.

If we use URI’s, should URI’s be resolvable? What is the scope?

  • Martin has a URI that points to a data file, and a (possibly locally unique) identifier that points to a spectrum within the data file. How to deal with this? Do we have a best-practice for it?

Schema Validation

  • Schema validation does not consider whether _ref links to an allowed section (or that used CV’s are allowed). Native XML validation does not do this, but you could make a tool. In theory, the prefix before the _ref is always the name of the class. FuGE needs semantic validation.
  • How should user-defined ontology terms be validated in the XML?

Discussion on Versioning: Khalid Belhajjame, Norman Paton, Allyson Lister, Phil Lord, Matt Pocock, Leandro Hermida

  • Is there a best way to implement versioning?

Characteristics of (SyMBA) Versioning:

  1. Complete History of Atomic Changes
  2. Low Cost of Updates –No Cascades
  3. Higher Cost of Retrieval

This is actually a transaction-time database with tuple-level timestamps. In a transaction-time database, the time is in the world of the database and not the time in the real world (vs valid-time database, where the time you insert does not match the time that you actually wanted to input). If you don’t put the timestamp in the tuple, you put it in the attribute. In this context, people have looked at the properties of update operations.

Can’t just use LSID versioning because there is no specification of how the version should be updated.

SyMBA Versioning Requirements:

  1. Preserving the semantics of the LSID
  2. Getting exactly the version requested, and getting all versions
  3. Nothing should disappear

This isn’t necessarily versioning, or what versioning in FuGE should be.

Leandro’s Versioning Requirements:

  1. Getting exactly the version requested, and getting all versions
  2. Nothing should disappear

Should this be done in FuGE, or in the FuGE-OM specifically? Perhaps just in the Maven build? We could put hooks in FuGE that would allow fine-grained logging. The current Audit setup does not allow linking back to previous versions unless you put the delta in free text somehow. The Audit classes may
be suitable for XML, and you could make a log of such changes and roll-back (in a non-RDBMS way) to whatever version you want.

While it is clear we could make an STK that could have versioning of some type, whether or not this should be a (optionally-used) change to the OM is a much bigger thing. It is certainly a worry that versioning has to be dealt with at the application level. However, versioning at the file or XML level means multiple files, otherwise you’d have to apply a diff to a very large file.

We haven’t really had the time to scan the space of options here. We could circulate a general document, and then outline what’s actually been done so far. A paper would, in any case, be centered around pros/cons, and a bit less on current implementations, but definitely not say which is the “right” way to do things, as there is no single right way.

There are different technical solutions, and not all of these solutions should necessarily be provided in the model.

Discussion on Tools – Leandro Hermida, Heiko Rosenfelder, Neil Wipat, Phil Lord, Allyson Lister

  • What about trying to get some automatic mapping between the XML classes and the Hibernate/Spring classes?
  • There is a disconnect between the XSD that is generated from the XML schema cartridge and the code generated from your persistence cartridge.
  • This means you have to write your mapping manually.
  • There is a possibility that we could get hyperjaxb3 to work for this (Allyson had tried with an earlier version but it didn’t work properly). Hyperjaxb3 generates both Entity POJO’s and the jaxb2 classes. So, in theory you could only use the Andromda XSD cartridge and hyperjaxb3 for the rest. However, then you loose all the information that is present in the UML but not in the XSD.
  • Hyperjaxb3 uses both hibernate and ejb3 natively (you can choose). Leandro wants to work on a merged persistence/hyperjaxb3 extended cartridge, or perhaps its own cartridge. So perhaps the generation of a hyperjaxb3 cartridge is possible in future.
  • Is there an XSLT that could be made to have a “standard” way of viewing a fuge experiment?
    • Khalid mentioned that it is important to allow input from the programmer in such a tool, so they can see as little or as much of the FuGE structure as you wish to present to the user.
    • Leandro is working on an ejb3 cartridge from the androMDA plugins project (not part of the AndroMDA core yet), and have used FuGE as a test-case. What this cartridge does is generate a mapping file and load it into any application server running hibernate and it will generate your  database. Whereas the Hibernate+Spring cartridge generates 1) Entity POJO’s + mapping files 2) Spring DAO + DAOException + DAOImpl. With the ejb3 cartridge you get 1) ejb3-annotated Entity POJO’s + DAO*. You can use Spring, if you wish, to build your web framework. Leandro decided to instead use Seam, which is the business layer of a web framework that builds on top of ejb3. Seam then uses the JSF (Facelets) and Jboss RichFaces for the actual web UI.To get the Seam classes, you model <> classes and then draw dependencies, which then auto-generate Seam-enabled ejb service beans. However, the Facelets and RichFaces have their xhtml files manually, though AndroMDA creates the entire web/ structure and base Seam classes for you. This doesn’t answer our simple UI question.
    • The ejb3 cartridge has a web service (jax-ws, via soap) to your DAO’s and Entity POJO’s.
    • With MAGE, someone wrote a regular Java Swing program where you download the jar which opens a little tabbed client that views MAGE. We could do something similar. (A J2SE app to write/read FuGE-ML of nice wizard interface)
    • The GSC has a lightweight XSD-to-web-form software app.
    • An XSLT, which is a style-sheet that is richer than CSS, but it is a tough language to use (convert XML to “HTML”). XSLT’s don’t have first-class functions, so you can’t do anything generic.
    • Also would like to have simple jar that has input XML, output HTML. This means three tool types here: 1) heavyweight (already existing in SyMBA and SystemsX) 2) midweight (J2SE app to read/write with a wizard-like interface) 3) lightweight (input XML, output HTML with some simple options).
  • Tool support for FuGE STK version 1, including a validator
    • The MAGE STK includes a validator.
    • XML validation can be done with JAXB2 as is with the Milestone 3 STK, but longer-term we need the semantic validation tool.
    • Perhaps have some ontology lookup helper classes (OLS from the EBI?) to help users and developers add terms from (a certain set of?) ontologies. This may help people populate their databases, choose a term from a list on a front-end tool, etc.
  • Tool support for database schema / AndroMDA / Alternatives.
    • Dealt with in the other sections

Discussion on Challenging Constructs, including Investigation Package, Abstract Associations, and the Ontology Package – the entire group

  • What is the real meaning of the Investigation package? It’s one of the few parts of FuGE that isn’t meant to be extended.
  • How is the OntologyTerm abstract class intended to be used for specific controlled lists? One example is taxonomies as opposed to ontologies.

The intention is that this package would not be changed or extended by communities. Each technology would be reported in the InvestigationComponent. The Factor class actually is meant as a summary report of the factors used in the experiment. There is currently no direct link between the Factors and the protocol workflows – the detail can be recorded in other places in FuGE. It’s a summary and duplication of the factor information.

So, if you want to say that your Investigation compared two different values for a single factor, the Investigation has the factor type, but not the data for the factor or the values themselves. However, you can connect to the data made from the various omics technologies via the DataPartition class. There could be a problem where it is a set of factors that only together make a particular set of data useful. Example: if your important aggregate
of factors is time1.mouse1.foodstuff1. However, you would have to have each of these factors would be named separately, and you would get a different slice of data for the time1 than you would for the mouse1. How to you join them up? Perhaps allow multiple
FactorValues (and OntologyTerms) for a single Factor. Not a very nice solution, though. Perhaps you don’t need to change it at all, as you would only add Factors
that are relevant to a particular

How do you describe which combinations of Factors are the combinations you’re interested in?
Norman did this by seeing an IC as a particular run of an experiment.

Dimensions are used in FuGE as a way of naming coordinates in a matrix. This does not mean that the data has to be stored here. You can store the
data internally via the
InternalData class, or you can reference it externally via the ExternalData class (or, of course, create your own subclass of Data).

There are 21 <>‘s in FuGE, and all but 6 have identical concrete associations. Some auto-generated AndroMDA code mistakenly ignores the “abstract” parts and incorrectly generates the methods etc. In this case, you can just delete the abstract association in your copy of the UML and re-generate the code. It should be fixed within AndroMDA, though.

For multi-dimensional data, DataPartition are meant as a mechanism to relate back to the data from the Investigation, but many groups will choose not to use DataPartition. Very big, regularly-shaped data sets will be good things to use DataPartition with (e.g. Flow Cytometry). In the case of proteomics data, this may be more of a challenge. A best-practices documents should contain information on which data types are best-suited to this system, and which aren’t. It should also include any alternatives to this system. One alternative solution to using DataPartition and its associated coordination system for dimensional data is to build an association from their data of interest back to FactorValue.

What is PartitionPair? In the case where you have two data files, and you wish to associate a particular row of one data file (for example) to a particular spectrum in another (to continue the example). So, it is a “shortcut” to linking particular data sets.

How should users of FuGE build CV lists using the Ontology Package? An OntologyTerm has an OntologyProperty, which contains both a DataProperty
ObjectProperty (these are the relationships within an ontology). Also inside OntologyTerm is OntologyIndividual. OI is the individual itself. Why not just provide the term – why try to recreate the structure of an ontology into UML? However, in OWL, every single class, relationship etc has a URI, so
why not use those in UML? An example use: you have in an ontology the concept of age, which has an initial time point and a unit. How do you pull that concept into the UML? We’re essentially creating a cut-down version of the ontology to allow extensibility in FuGE. But why would you want this? To create an individual of an ontology within the UML. It also allows restrictions of the name-values (left and right-hand side of a relationships) to those that are allowed within the ontology. One opinion is that there shouldn’t be a purpose-built extensibility point in UML, as the entire purpose of UML is that it is extensible everywhere. It also means that users of your FuGE file don’t need to parse both that file and the ontology file. However, the users of
your file must understand your extensibility point that you’ve made, which isn’t useful. The extra knowledge should be stored in that ontology, in the same way as analysisXML links to mzML. One solution is to have a Property class with term “height” and a Value class with term “meters”, and a PV class with associations to both Property and Value that provides the link. In the end, this is optional. In the guidelines, these concerns should be expressed.

Other questions not fully addressed:

  • How do we find out when Generic* classes should be replaced by a specific omics-type class? Rather than using named associations to
    GenericParameter, GelML tended to use either GenericParameter (with a CV term) or extended the Parameter class. Is there a best way to
    use Parameter/GenericParameter?

    If it is the same shape as the Generic class, and you are just renaming it, that is a good argument for using an ontology term. However, there is less of a learning curve for users if you subclass GenericParameter with your own name. Subclassing can lock you in, and may make life more difficult further down the line if your requirements change. Remember though, hardly anyone will write XML by hand, and we shouldn’t worry too much about tool implementation. Still want to make it easier for tool developers, though!
  • How should we model time?
  • For experiment-specific semantics, when should we extend the FuGE model rather than add information through the use of ontologies?
  • How descriptive should extending communities get with their samples? Should it be at the entity or attribute level? Is there a
    best-practice that should be documented?
  • How do we find out if two classes from two different communities are actually the same? Recurring model requirements, e.g. a library of
    model fragments e.g time and sample.
  • Could organism be fitted into FuGE::Bio somewhere?
  • There is no date of the Action in the ActionApplication.
    You could have a time parameter that comes in when you add it to your own subclass of Action/ActionApplication, and then provide a
    different value for that parameter in ActionApplication.
  • Somewhere, the distinction between Action and Protocol should be defined.
  • In general, we should describe a modelling best-practice to tell what is considered “standard” procedure.
  • Data package: internal versus external data
  • There may be an issue with describing physical materials within Protocols versus ProtocolApplications (theoretical materials vs physical
    materials, SubstanceMixtureProtocol was designed to account for this problem)


CISBAN Meetings & Conferences Standards

1st RSBI Workshop, 6-8 December 2007

Last week I attended the first RSBI (Reporting Structure for Biological Investigations) Workshop, carrying with me a multitude of hats. RSBI is a working group committed to the progression of standardization in multi-omics investigations. The purpose of the workshop was to examine and offer suggestions on the initial draft of ISA-TAB (more on that in a moment).

My first hat was a FuGE-user's hat, as the triumvirate of standards upon which RSBI is built is the Functional Genomics Experiment Model (FuGE), the Minimum Information for Biological and Biomedical Investigations (MIBBI) Project, and the Ontology for Biomedical Investigations (OBI). I was asked to give a current status update on FuGE itself, and on any communities that have already built extensions to FuGE. Andy Jones from Liverpool provided me with all of the hot-off-the-press information (my FuGE slides) – thanks Andy!

My second hat was a SyMBA-developer's hat. SyMBA uses FuGE to build a database and web front-end for storing data and experimental metadata. We use it in-house to store all of our large, high-throughput 'omics data. The use of FuGE in the system made it relevant for the workshop (my SyMBA slides, more SyMBA slides).

My final hat was a CISBAN-employee's hat. I work in the Wipat group there, and CISBAN is one of the "leading groups" involved in RSBI. As such, I was CISBAN's representative to the workshop.

The reason for the workshop, as stated earlier, was the evaluation of ISA-TAB, a proposed tabular format whose purpose is to provide a standard format for data and metadata submission into the formative BioMAP database at the EBI. ISA-TAB would have two uses:

  1. Humans: As a tabular format, it is quite easy for people to view and manipulate such templates within spreadsheet software such as Excel.
  2. Computers: As an interim solution only, ISA-TAB would be used as a computational exchange format until such time as each of the FuGE-based community extensions are complete for Metabolomics, Proteomics, and Transcriptomics. At this time, ISA-TAB would remain available for human use, but there would be a conversion step into "FuGE-ML".

The scope for ISA-TAB is large, and this was reflected in the attendees of the meeting. Representatives from ArrayExpress, Pride, and BioMAP were of course present, but also attending were people from the Metabolomics community, the MIACA project, toxico- and environmental genomics, and the FDA's NCTR.

A full write-up of the results of the workshop will soon be available online at the project's RSBI Google Group, so I'll leave it there. It was an exciting meeting, with fantastic food and even better discussions on getting public databases organized quickly for simple, straightforward multi-omics investigation data and metadata submission.

You can contact the RSBI via

Read and post comments |
Send to a friend