Categories
Meetings & Conferences Standards

The more things change, the more they stay the same

…also known as Day 1 of the BBSRC Synthetic Biology Standards Workshop at Newcastle University, and musings arising from the day’s experiences.

In my relatively short career (approximately 12 years – wait, how long?) in bioinformatics, I have been involved to a greater or lesser degree in a number of standards efforts. It started in 1999 at the EBI, where I worked on the production of the protein sequence database UniProt. Now, I’m working with systems biology data and beginning to look into synthetic biology. I’ve been involved in the development (or maintenance) of a standard syntax for protein sequence data; standardized biological investigation semantics and syntax; standardized content for genomics and metagenomics information; and standardized systems biology modelling and simulation semantics.

(Bear with me – the reason for this wander through memory lane becomes apparent soon.)

How many standards have you worked on? How can there be multiple standards, and why do we insist on creating new ones? Doesn’t the definition of a standard mean that we would only need one? Not exactly. Take the field of systems biology as an example. Some people are interested in describing a mathematical model, but have no need for storing either the details of how to simulate that model or the results of multiple simulation runs. These are logically separate activities, yet they fall within a single community (systems biology) and are broadly connected. A model is used in a simulation, which then produces results. So, when building a standard, you end up with the same separation: have one standard for the modelling, another for describing a simulation, and a third for structuring the results of a simulation. All that information does not need to be stored in a single location all the time. The separation becomes even more clear when you move across fields.

But this isn’t completely clear cut. Some types of information overlap within standards of a single domain and even among domains, and this is where it gets interesting. Not only do you need a single community talking to each other about standard ways of doing things, but you also need cross-community participation. Such efforts result in even more high-level standards which many different communities can utilize. This is where work such as OBI and FuGE sit: with such standards, you can describe virtually any experiment. The interconnectedness of standards is a whole job (or jobs) in itself – just look at the BioSharing and MIBBI projects. And sometimes standards that seem (at least mostly) orthogonal do share a common ground. Just today, Oliver Ruebenacker posted some thoughts on the biopax-discuss mailing list where he suggests that at least some of BioPAX and SBML share a common ground and might be usefully “COMBINE“d more formally (yes, I’d like to go to COMBINE; no, I don’t think I’ll be able to this year!). (Scroll down that thread for a response by Nicolas Le Novère as to why that isn’t necessarily correct.) So, orthogonality, or the extent to which two or more standards overlap, is sometimes a hard thing to determine.

So, what have I learnt? As always, we must be practical. We should try to develop an elegant solution, but it really, really should be one which is easy to use and intuitive to understand. It’s hard to get to that point, especially as I think that point is (and should be) a moving target. From my perspective, group standards begin with islands of initial research in a field, which then gradually develop into a nascent community. As a field evolves, ‘just-enough’ strategies for storing and structuring data become ‘nowhere-near-enough’. Communication with your peers becomes more and more important, and it becomes imperative that standards are developed.

This may sound obvious, but the practicalities of creating a community standard means such work requires a large amount of effort and continued goodwill. Even with the best of intentions, with every participant working towards the same goal, it can take months – or years – of meetings, document revisions and conference calls to hash out a working standard. This isn’t necessarily a bad thing, though. All voices do need to be heard, and you cannot have a viable standard without input from the community you are creating that standard for. You can have the best structure or semantics in the world, but if it’s been developed without the input of others, you’ll find people strangely reluctant to use it.

Every time I take part in a new standard, I see others like me who have themselves been involved in the creation of standards. It’s refreshing and encouraging. Hopefully the time it takes to create standards will drop as the science community as a whole gets more used to the idea. When I started, the only real standards in biological data (at least that I had heard of) were the structures defined by SWISS-PROT and EMBL/GenBank/DDBJ. By the time I left the EBI in 2006, I could have given you a list a foot long (GO, PSI, and many others), and that list continues to grow. Community engagement and cross-community discussions continue to be popular.

In this context, I can now add synthetic biology standards to my list of standards I’ve been involved in. And, as much as I’ve seen new communities and new standards, I’ve also seen a large overlap in the standardization efforts and an even greater willingness for lots of different researchers to work together, even taking into account the sometimes violent disagreements I’ve witnessed! The more things change, the more they stay the same…

At this stage, it is just a limited involvement, but the BBSRC Synthetic Biology Standards Workshop I’m involved in today and tomorrow is a good place to start with synthetic biology. I describe most of today’s talks in this post, and will continue with another blog post tomorrow. Enjoy!

For those with less time, here is a single sentence for each talk that most resounded with me:

  1. Mike Cooling: Emphasising the ‘re’ in reusable, and make it easier to build and understand large models from reusable components.
  2. Neil Wipat: For a standard to be useful, it must be computationally amenable as well as useful for humans.
  3. Herbert Sauro: Currently there is no formal ontology for synthetic biology, but one will need to be developed.

This meeting is organized by Jen Hallinan and Neil Wipat of Newcastle University. Its purpose is to set up key relationships in the synthetic biology community to aid the development of a standard for that community. Today, I listened to talks by Mike Cooling, Neil Wipat, and Herbert Sauro. I was – unfortunately – unable to be present for the last couple of talks, but will be around again for the second – and final – day of the workshop tomorrow.

Mike Cooling – Bioengineering Institute Auckland, New Zealand

Mike uses CellML (it’s made where he works, but that’s not the only reason…) in his work with systems and synthetic biology models. Among other things, it wraps MathML and partitions the maths, variables and units into reusable pieces. Although many of the parts seem domain specific, CellML itself is actually not domain specific. Further, unlike other modelling languages such as SBML, components in CellML are reusable and can be imported into other models. (Yes, a new package called comp in SBML Level 3 is being created to allow the importing of models into other models, but it isn’t mature – yet.)

How are models stored? There is the CellML repository, but what is out there for synthetic biology? The Registry of Standard Biological Parts was available, but only described physical parts. Therefore they created a Registry of Standard Virtual Parts (SVPs) to complement the original registry. This was developed as a group effort with a number of people including Neil Wipat and Goksel Misirli at Newcastle University.

They start with template mathematical structures (which are little parts of CellML), and then use the import functionality available as part of CellML to combine the templates into larger physical things/processes (‘SVPs’) and ultimately to combine things into system models.

They extended the CellMLRepository to hold the resulting larger multi-file models, which included adding a method of distributed version control and allow the sharing of models between projects through embedded workspaces.

What can these pieces be used for? Some of this work included the creation of a CellML model of the biology represented in Levskaya et al. 2005 and deposit all of the pieces of the model in the CellML repository. Another example is a model he’s working on about shear stress and multi-scale modelling for aneurysms.

Modules are being used and are growing in number, which is great, but he wants to concentrate more at the moment on the ‘re’ of the reusable goal, and make it easier to build and understand large models from reusable components. Some of the integrated services he’d like to have: search and retrieval, (semi-automated) visualization, semantically-meaningful metadata and annotations, and semi-automated composition.

All this work above converges on the importance of metadata. With the CellML Metadata Framework 1.0, not many used it. With version 2.0 they have developed a core specification with is very simple and then provide many additional satellite specifications. For example, there is a biological information satellite, where you use the biomodels qualifiers as relationships between your data and MIRIAM URNs. The main challenge is to find a database that is at the right level of abstraction (e.g. canonical forms of your concept of interest).

Neil Wipat – Newcastle University

Please note Neil Wipat is my PhD supervisor.

Speaking about data standards, tool interoperability, data integration and synthetic biology, a.k.a “Why we need standards”. They would like to promote interoperability and data exchange between their own tools (important!) as well as other tools. They’d also like to facilitate data integration to inform the design of biological systems both from a manual designer’s perspective and from the POV of what is necessary for computational tool use. They’d also like to enable the iterative exchange of data and experimental protocols in the synthetic biology life cycle.

A description of some of the tools developed in Neil’s group (and elsewhere) exemplify the differences in data structures present within synthetic biology. BacilloBricks was created to help get, filter and understand the information from the MIT registry of standard parts. They also created the Repository of Standard Virtual Biological Parts. This SVP repository was then extended with parts from Bacillus and was extended to make use of SBML as well as CellML. This project is called BacilloBricks Virtual. All of these tools use different formats.

It’s great having a database of SVPs, but you need a way of accessing and utilizing the database. Hallinan and Wipat have started a collaboration with Microsoft Research with the people who created a programming language for genetic engineering of living cells called the genetic engineering of cells (GEC) simulator. Some work a summer student did created a GEC compiler for SVPs from BacilloBricks virtual. Goksel has also created the MoSeC system where you can automatically go from a model to a graph to a EMBL file.

They also have BacillusRegNet, which is an information repository about transcription factors for Bacillus spp. It is also a source of orthogonal transcription factors for use in B. subtilis and Geobacillus. Again, it is very important to allow these tools to communicate efficiently.

The data warehouse they’re using is ONDEX. They feed information from the ONDEX data store to the biological parts database. ONDEX was created for systems biology to combine large experimental datasets. ONDEX views everything as a network, and is therefore a graph-based data warehouse. ONDEX has a “mini-ontology” to describe the nodes and edges within it, which makes querying the data (and understanding how the data is structured) much easier. However, it doesn’t include any information about the synthetic biology side of things. Ultimately, they’d like an integrated knowledgebase using ONDEX to provide information about biological virtual parts. Therefore they need a rich data model for synthetic biology data integration (perhaps including an RDF triplestore).

Interoperabiligy, Design and Automation: why we need standards.

Requirement 1. There needs to be interoperability and data exchange among these tools as well as among these tools and other external tools. Requirement 2. Standards for data integration aid the design of synthetic systems. The format must be both computationally amenable and useful for humans. Requirement 3. Automation of the design and characterization of synthetic systems, and this also requires standards.

The requirements of synthetic biology research labs such as Neil Wipat’s make it clear that standards are needed.

KEYNOTE: Herbert Sauro – University of Washington, US

Herbert Sauro described the developing community within synthetic biology, the work on standards that has already begun, and the Synthetic Biology Open Language (SBOL).

He asks us to remember that Synthetic Biology is not biology – it’s engineering! Beware of sending synthetic biology grant proposals to a biology panel! It is a workflow of design-build-test. He’s mainly interested in the bit between building and testing, where verification and debugging happens.

What’s so important about standards? It’s critical in engineering, where if increases productivity and lowers costs. In order to identify the requirement you must describe a need. There is one immediate need: store everything you need to reconstruct an experiment within a paper (for more on this see the Nature Biotech paper by Peccoud et al. 2011: Essential information for synthetic DNA sequences). Currently, it’s almost impossible to reconstruct a synthetic biology experiment from a paper.

There are many areas requiring standards to support the synthetic biology workflow: assembly, design, distributed repositories, laboratory parts management, and simulation/analysis. From a practical POV, the standards effort needs to allow researchers to electronically exchange designs with round tripping, and much more.

The standardization effort for synthetic biology began with a grant from Microsoft in 2008 and the first meeting was in Seattle. The first draft proposal was called PoBoL but was renamed to SBOL. It is a largely unfunded project. In this way, it is very similar to other standardization projects such as OBI.

DARPA mandated 2 weeks ago that all projects funded from Living Foundries must use SBOL.

SBOL is involved in the specification, design and build part of the synthetic biology life cycle (but not in the analysis stage). There are a lot of tools and information resources in the community where communication is desperately needed.

SBOL Semantic, SBOL Visual, and SBOL Script. SBOL Semantic is the one that’s going to be doing all of the exchange between people and tools. SBOL Visual is a controlled vocabulary and symbols for sequence features.

Have you been able to learn anything from SBML/SBGN, as you have a foot in both worlds? SBGN doesn’t address any of the genetic side, and is pretty complicated. You ideally want a very minimalistic design. SBOL semantic is written in UML and is relatively small, though has taken three years to get to this point. But you need host context above and beyond what’s modelled in SBOL Semantic. Without it, you cannot recreate the experiment.

Feature types such as operator sites, promoter sites, terminators, restriction sites etc can go into the sequence ontology (SO). The SO people are quite happy to add these things into their ontology.

SBOLr is a web front end for a knowledgebase of standard biological parts that they used for testing (not publicly accessible yet). TinkerCell is a drag and drop CAD tool for design and simulation. There is a lot of semantic information underneath to determine what is/isn’t possible, though there is no formal ontology. However, you can semantically-annotate all parts within TinkerCell, allowing the plugins to interpret a given design. A TinkerCell model can be composed of sub-models. Makes it easy to swap in new bits of models to see what happens.

WikiDust is a TinkerCell plugin written in Python which searches SBPkb for design components, and ultimately uploads them to a wiki. LibSBOLj is a library for developers to help them connect software to SBOL.

The physical and host context must be modelled to make all of this useful. By using semantic web standards, SBOL becomes extensible.

Currently there is no formal ontology for synthetic biology but one will need to be developed.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

Standards and infrastructure for managing experimental metadata (ISMB DAM SIG 2009)

Philippe Rocca-Serra, EBI

Metagenomics and metatranscriptomics experiments are growing in size and complexity. Gilbert et al PLoS One, 2008 is one example. There are many different domains of science, and the all share some common problems. Consistent reporting of the experimental metadata along with the resulting data has a positive and long-lasting impact on the value of collective scientific outputs. To help solve these problems, many communities have developed reporting standards covering minimal information to be reported about an experiment type (MIBBI).

At the EBI, there are separate submission systems for proteomics data, transcriptomics data, sequence data etc. This is frustrating for the researcher who may have all of this information in one experiment. Therefore, BII is being developed to simplify the submission process. More generally, work is underway to promote synergies among standards initiatives. The common efforts include: MIBBI for scope, FuGE and ISA-TAB for syntax, and OBO foundry ontologies and terminologies for semantics. There is a MIBBI talk later this week at ISMB by Chris Taylor.

There are a number of components that are part of the ISA infrastructure (for more information, see http://isatab.sf.net): The isacreator configurator allows you to set metadata fields and allowed values; isacreator itself is used to describe and upload the experimental metadata (ontologies are accessed in real time via the ontology lookup services and BioPortal, and groups of samples are colour coded); isacreator has a nice visualisation of the various group types that gives you an overview of size and relative importance to other groups.

My thoughts: Firstly, you should know that I contributed to both the FuGE projects and the ISA-TAB projects, and helped develop the ISA-TAB specification. Therefore, I have an interest in this. Moving on… Overall, it really looks like isacreator is coming along nicely from its early incarnations. It looks nice, which is very important for user uptake. It also is compatible with FuGE (though development is still ongoing to increase compatibility, I think).

FriendFeed discussion: http://ff.im/4vcW5

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Housekeeping & Self References Papers Research Blogging Software and Tools Standards

Modeling and Managing Experimental Data Using FuGE

ResearchBlogging.org

Want to share your umpteen multi-omics data sets and experimental protocols with one common format? Encourage collaboration! Speak a common language! Share your work! How, you might ask? With FuGE, and this latest paper (citation at the end of the post) tells you how.

In 2007, FuGE version 1 was released (website, Nature Biotechnology paper). FuGE allows biologists and bioinformaticians to describe any life science experiment using a single format, making collaboration and repeatability of experiments easier and more efficient. However, if you wanted to start using FuGE, until now it was difficult to know where to start. Do you use FuGE as it stands? Do you create an extension of FuGE that specifically meets your needs? What do the developers of FuGE suggest when taking your first steps using it? This paper focuses on best practices for using FuGE to model and manage your experimental data. Read this paper, and you’ll be taking your first steps with confidence!

[Aside: Please note that I am one of the authors of this paper.]

What is FuGE? I’ll leave it to the authors to define:

The approach of the Functional Genomics Experiment (FuGE) model is different, in that it attempts to generalize the modeling constructs that are shared across many omics techniques. The model is designed for three purposes: (1) to represent basic laboratory workflows, (2) to supplement existing data formats with metadata to give them context within larger workflows, and (3) to facilitate the development of new technology-specific formats. To support (3), FuGE provides extension points where developers wishing to create a data format for a specific technique can add constraints or additional properties.

A number of groups have started using FuGE, including MGED, PSI (for GelML and AnalysisXML), MSI, flow cytometry, RNA interference and e-Neuroscience (full details in the paper). This paper helps you get a handle on how to use FuGE by presenting two running examples of capturing experimental metadata in the fields of flow cytometry and proteomics of flow cytometry and gel electrophoresis. Part of Figure 2 from the paper is shown on the right, and describes one section of the flow cytometry FuGE extension from FICCS.

The flow cytometry equipment created as subclasses of the FuGE equipment class.
The flow cytometry equipment created as subclasses of the FuGE equipment class.

FuGE covers many areas of experimental metadata including the investgations, the protocols, the materials and the data. The paper starts by describing how protocols are designed in FuGE and how those protocols are applied. In doing so, it describes not just the protocols but also parameterization, materials, data, conceptual molecules, and ontology usage.

Examples of each of these FuGE packages are provided in the form of either the flow cytometry or the GelML extensions. Further, clear scenarios are provided to help the user determine when it is best to extend FuGE and when it is best to re-use existing FuGE classes. For instance, it is best to extend the Protocol class with an application-specific subclass when all of the following are true: when you wish to describe a complex Protocol that references specific sub-protocols, when the Protocol must be linked to specific classes of Equipment or Software, and when specific types of Parameter must be captured. I refer you to the paper for scenarios for each of the other FuGE packages such as Material and Protocol Application.

The paper makes liberal use of UML diagrams to help you understand the relationship between the generic FuGE classes and the specific sub-classes generated by extensions. A large part of the paper is concerned expressly with helping the user understand how to model an experiment type using FuGE, and also to understand when FuGE on its own is enough. But it also does more than that: it discusses the current tools that are already available for developers wishing to use FuGE, and it discusses the applicability of other implementations of FuGE that might be useful but do not yet exist. Validation of FuGE-ML and the storage of version information within the format are also described. Implementations of FuGE, including SyMBA and sysFusion for the XML format and ISA-TAB for compatibility with a spreadsheet (tab-delimited) format, are also summarised.

I strongly believe that the best way to solve the challenges in data integration faced by the biological community is to constantly strive to simply use the same (or compatible) formats for data and for metadata. FuGE succeeds in providing a common format for experimental metadata that can be used in many different ways, and with many different levels of uptake. You don’t have to use one of the provided STKs in order to make use of FuGE: you can simply offer your data as a FuGE export in addition to any other omics formats you might use. You could also choose to accept FuGE files as input. No changes need to be made to the underlying infrastructure of a project in order to become FuGE compatible. Hopefully this paper will flatten the learning curve associated for developers, and get them on the road to a common format. Just one thing to remember: formats are not something that the end user should see. We developers do all this hard work, but if it works correctly, the biologist won’t know about all the underpinnings! Don’t sell your biologists on a common format by describing the intricacies of FuGE to them (unless they want to know!), just remind them of the benefits of a common metadata standard: cooperation, collaboration, and sharing.

Jones, A., Lister, A.L., Hermida, L., Wilkinson, P., Eisenacher, M., Belhajjame, K., Gibson, F., Lord, P., Pocock, M., Rosenfelder, H., Santoyo-Lopez, J., Wipat, A., & Paton, N. (2009). Modeling and Managing Experimental Data Using FuGE OMICS: A Journal of Integrative Biology, 2147483647-13 DOI: 10.1089/omi.2008.0080

Categories
Meetings & Conferences Software and Tools Standards

SyMBA Demo causes pondering: how should a bioinformatician choose their output format(s)?

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

SyMBA Demo. The lunch hour was also the demo hour. People came to visit me at the SyMBA demo desk for the whole hour, and we had some interesting conversations. There is one particular question I would like to relate from that hour: what should a bioinformatician choose as an output/export format for multi-omics data? This post relates my thoughts about this challenge. It's not meant to be comprehensive: just some ramblings.

I solve this challenge in SyMBA by storing everything as FuGE objects, which can be exported to FuGE-ML. FuGE-ML can be converted into ISA-TAB and into an html format that mimics ISA-TAB using an XSLT. Therefore, because of this interlink between FuGE and ISA-TAB, you can leverage two complementary formats.

But to a bioinformatician who has just been tasked with building an application (and generally on a short time-scale), how do they choose what export format to use, e.g. FuGE or ISA-TAB? There are considerations of:

  • scale: lightweight or heavyweight implementation. A lightweight implementation might favor your own version of ISACreator and the use of ISA-TAB, or a FuGE-based archive (but not a full-blown LIMS) like SyMBA. A heavyweight solution might be a full LIMS such as PIMS, or another FuGE implementation in development called SysFusion.
  • intent: what is the purpose of storing this data? Is it for later analysis? For later deposition to a public database, e.g. at the EBI? Is it archiving? Is it a combination of these things? Your intent will shape what type of application you build, and what formats you focus your effort on. If your intent is storage only, choose whatever is most convenient for your users. However, these days there is always some aspect of data sharing or publishing. If you need further analysis of the data, then you probably want to be able to produce a computationally-friendly format such as XML. If your intent is submission to public databases, you need to ensure you export in a format they import.

Unfortunately, what this means is that the decision depends on the circumstances. FuGE and ISA-TAB are linked, and so you really get two for the price of one with those. I see this sort of thing as a positive – you have a choice as to the representation, storage and export of your data – a choice of formats! And many, like FuGE and ISA-TAB, are going to be easily convertable. The choice depends on your needs, but there is one easy choice: use something that's already been developed – don't reinvent the wheel!

Anyone else have any further suggestions?

Read and post comments |
Send to a friend

original

Categories
Uncategorized

SysMO-DB and Carole Goble, BBSRC Systems Biology Workshop

BBSRC Systems Biology Grantholder Workshop, University of Nottingham, 16 December 2008.

Systems Biology of Microorganisms. 11 projects from 91 institutes, whose aim is to record and describe the dynamic molecular processes occurring in microorganisms in a comprehensive way. These projects have no one concept of experimentation or modelling, which makes it tough for information exchange. Further, there are issues of people having their own solutions, suspicions (about sharing data, for instance), data issues (many don't have data or don't store it in a standard way) and resource issues (no extra resources). SysMO-DB started in July 2008, and is a 3 year funded effort (3+3 people in 3 teams over 3 sites). Provide a web-based solution to exchange, search, and disseminate data. Need to retrofit data access, model handling and data integration platform. Because of the large number of groups and projects, they are going to aim for low-hanging fruit and early wins: be realistic, not reinvent, sustainable, and encourage standards adoption.

Just like at CISBAN, where we have implemented a web-based data integration, storage, exchange, and dissemination platform in a standards-compliant way (SyMBA), they have three users: experimentalists, bioinformaticians, and modellers. They're lucky, though, in that they have 6 people to develop SysMO-DB, when CISBAN only has 1. 🙂 And, as with CISBAN and many other data integration efforts, much of the work is social: that is, encouraging those three user types to collaborate and understand each other's work. The social solutions include questionnairs, "PALS" (postdocs and phd students), and Audits and sharing of methods, data, models. They discuss things like what people need or don't need from MIAME. (Personal opinion and question: MIAME is intended as a minimal information checklist. What kind of things, then, don't they need? And would it be worth taking this information back to the MIAME people to possibly modify the guidelines if some aspects of it aren't truly minimal? End personal questions.)

Discovery is done via SysMo-SEEK. How to catalog the metadata, and then have mechanisms for accessing the data from locations other than the host site? There is a single search point over "yellow pages" and assets catalogue. They store metadata on results, not the results themselves (again, just like SyMBA, which stores the metadata in a database, and the results in a remote file store). They use myExperiment for both linking the people and the assets. For models, they're using a local installation of JWS online, which is a database of curated models and a model simulator. There is also some links to semantic SBML from the TRANSLUCENT project.

There are two kinds of processes to store. The first is experimental processes, e.g. SOPs and protocols. They use the Nature protocols format, with the addition of high-level classification through tags. (Personal note: What is the underlying format for storing protocols?) The second type of process is Bioinformatics processes, which are stored as workflows. (Question: Why don't you store protocols as workflows? They can be chained in the same way.)  Taverna is used for this work. One bit of work was using libSBML inside taverna for collaborative model development (Peter Li et al). Another automated (definition of automated in this context?) workflow goes from microarray to pathways and published abstracts. Their consortium wants to exchange information from public data sources, SysMO itself, and excel spreadsheets.

(Another personal aside. FuGE (object model for experimental metadata) and ISA-TAB (tabular format, e.g. spreadsheets) are becoming interchangeable – work is going on between FuGE and ISA-TAB people right now – most recent workshop was last week. This is important, as it was mentioned that bioinformaticians have to deal with spreadsheets (which is true enough!). So, you get the best of both worlds with FuGE / ISA-TAB, without having to define yet another schema. A personal question would be: Why build these various metadata schemas and parsers for spreadsheets (e.g. whatever is used for the Assets catalogue and JERM parsing of spreadsheets) rather than use pre-existing models such as FuGE and formats such as ISA-TAB? Using the FuGE object model does not mean that you have to use all aspects of it – you can just take what you need.Perhaps it was due to the maturity of ISA-TAB at the time the project started, though the specification is now in version 1.0. Will SysMO-DB export and import these formats? There was no time for questions at the end of the talk, so I will try to find out during the lunch period. End aside.)

Trying to map to the relevant MIBBI standard. There is a nice feature that reads spreadsheets from specific locations and automatically loads them into the Assets catalogue. (You can still load them directly into that catalogue.) They are performing a 4-site JERM exchange pilot scheme in Spring 2009.

Great talk – thanks 🙂

These are just my notes and are not guaranteed to be correct. Please feel free to let me know about any errors, which are all my fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend

original

Categories
Data Integration Research Blogging

Adding informative metadata to bioinformatics services

ResearchBlogging.org

Carole Goble and the other authors of “Data curation + process curation = data integration + science” have written a paper on the importance of curating not just the services used in bioinformatics, but also how they are used. Just as more and more biologists are becoming convinced of the importance of storing and annotating their data in a common format, so should bioinformaticians take a little of their own medicine and ensure that the services they produce and use are annotated properly. I personally feel that it is just as important to ensure that in silico work is properly curated as it is in the more traditional, wet-lab biological fields.

They mention a common feature of web services and workflows: namely, that they are generally badly documented. Just as the majority of programmers leave it until the last possible minute to comment their code (if they comment at all!), so also are many web services annotated very sparsely, and not necessarily in a way that is useful to either humans or computers. I remember that my first experience with C code was trying to take over a bunch of code written by a C genius, who had but one flaw: a complete lack of commenting. Yes, I learnt a lot about writing efficient C code from his files, but it took me many hours more than it would have done if there had been comments in there!

They touch briefly on how semantic web services (SWS) could help, e.g. using formats such as OWL-S and SAWSDL. I recently read an article in the Journal of Biomedical Informatics (Garcia-Sanchez et al. 2008, citation at the end of the paper) that had a good introduction to both semantic web services and, to a lesser extent, multi-agent systems that could autonomously interact with such services. While the Goble et al. paper did not go into as much detail as the Garcia-Sanchez paper did on this point, it was nice to learn a little more about what was going on in the bioinformatics word with respect to SWS.

Their summary of the pitfalls to be aware of due to the lack of curated processes was good, as was their review of currently-existing catalogues and workflow and WS aggregators. The term “Web 2.0” was used, in my opinion correctly, but I was once again left with the feeling that I haven’t seen a good definition of what Web 2.0 is. I must hear it talked about every day, and haven’t come across any better definition than Tim O’Reilly’s. Does anyone reading this want to share their “favorite” definition? This isn’t a failing of this paper – more of my own lack of understanding. It’s a bit like trying to define “gene” (this is my favorite) or “systems biology” succinctly and in a way that pleases most people – it’s a very difficult undertaking! Another thing I would have liked to have seen in this paper, but which probably wasn’t suitable for the granularity level at which this paper was written, is a description and short analysis of the traffic and usage stats for myExperiment. Not a big deal – I’m just curious.

As with anything in standards development, even though there are proposed minimal information guidelines for web services out there (see MIAOWS), the main problem will always be lack of uptake and getting a critical mass (also important in community curation efforts, by the way). In my opinion, a more important consideration for this point is that getting a MIA* guideline to be followed does not guarantee any standard format. All it guarantees is a minimal amount of information to be provided.

They announce the BioCatalogue in the discussion section of this paper, which seems to be a welcome addition to the attempts to get people to annotate and curate their services in a standard way, and store them in a single location. It isn’t up and running yet, but is described in the paper as a web interface to more easily allow people to annotate their WSDL files, whereas previous efforts have mainly focused on the registry aspects. Further information can be associated with these files once they are uploaded to the website. However, I do have some questions about this service. What format is the further information (ontology terms, mappings) stored in? Are the ontology terms somehow put back into the WSDL file? How will information about the running of a WS or workflow be stored, if at all? Does it use a SWS format? I would like to see performances of Bioinformatics workflows stored publicly, just as performances of biological workflows (eg running a microarray experiment) can be. But I suppose many of these questions would be answered once BioCatalogue is in a state suitable for publishing on its own.

In keeping with this idea of storing the applications of in silico protocols and software in a standard format, I’d like to mention one syntax standard that might be of use in storing both descriptions of services and their implementation in specific in silico experiments: FuGE. While it does not currently have the structures required to implement everything mentioned in this paper (such as operational capability and usage/popularity scores) in a completely explicit way, many of the other metadata items that this paper suggests can already be stored within the FuGE object model (e.g. provenance, curation provenance, and functional capability). Further, FuGE is built as a model that can easily be extended. There is no reason why we cannot, for example, build a variety of Web services protocols and software within the FuGE structure. One downside of this method would be that information would be stored in the FuGE objects (e.g. a FuGE database or XML file) and not in the WSDL or Taverna workflow file. Further, there is no way to “execute” FuGE XML files, as there is with taverna files or WSs. However, if your in silico experiment is stored in FuGE, you immediately have your computational data stored in a format that can also store all of the wet-lab information, protocols, and applications of the protocols. The integration of your analyses with your wet-lab metadata would be immediate.

In conclusion, this paper presents a summary of a vital area of bioinformatics research: how, in order to aid data integration, it is imperative that we annotate not just wet-lab data and how they were generated, but also our in silico data and how they were generated. Imagine storing your web services in BioCatalogue and then sharing your entire experimental workflows, data and metadata with other bioinformaticians quickly and easily (perhaps using FuGE to integrate in silico analyses with wet-lab metadata, producing a full experimental metadata file that stores all the work of an experiment from test tube to final analysis).

Goble C, Stevens R, Hull D, Wolstencroft K, Lopez R. (2008). Data curation + process curation=data integration + science. Briefings in bioinformatics DOI: 19060304

F GARCIASANCHEZ, J FERNANDEZBREIS, R VALENCIAGARCIA, J GOMEZ, R MARTINEZBEJAR (2008). Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources Journal of Biomedical Informatics, 41 (5), 848-859 DOI: 10.1016/j.jbi.2008.05.007

Categories
Data Integration Meetings & Conferences

FuGE / ISA-TAB Workshop, Day 1

Today was the first day of the workshop – back at the good old EBI, though it isn't as recognizable as it used to be. Sure, there is the new EBI extension, but I am used to that now. However, they're renovating the inside of the old EBI building as well, reducing many of my friends to portakabin living over the winter months: better them than me!

Today definitely had an emphasis on the "work" part of "workshop". While a large part of the work on the XSLT for converting between FuGE and ISA-TAB is complete, some of the slightly stickier areas of the conversion are still being worked on. We spent today on trying to iron out some of the difficulties that arise from trying to convert the sort of rich tree structure that you get from the XML implementation of FuGE (FuGE-ML) into the flatter tabular format of ISA-TAB. Below are some of the more general ideas that we were throwing around as a result. (Some are more directly related to the conversion process than others – but all raise interesting points to me.)

  • One of the column names in the ISA-TAB Assay file is currently named "Raw Data File" in the 1.0 Specification. This caused a large amount of discussion as to what "raw" meant, and that many people would have a different idea of what a raw data file was. It was originally named this way to act as a foil against another (optional) column name, "Derived Data File". However, derived data files have a more precise definition in ISA-TAB – such a column can only be used to name files resulting from data transformations or processing. In the end, we are considering a name change, from "Raw Data File" to "Data File".
  • In the end, there will be a few simple ways to format your FuGE-ML files in a way that will aid the conversion into ISA-TAB. It would be useful to eventually produce a set of guidelines to aid in interoperability.
  • Some of the developers already using FuGE (myself included) are using the <Description> element within a FuGE-ML file as a way to allow our biologists to give a free-text description to both materials and data files. There is no specific element in these objects to add such information, and therefore the generic Description element is the best location. This isn't exactly as per FuGE best-practices, where the default Description elements are really only meant for private comments within a local FuGE implementation, and can normally be ignored by external bioinformaticians making use of your FuGE-ML. Such material and data descriptions can be copied into the ISA-TAB file as free text within the Comment[] columns, where what sits within the "[]" is the material or data identifier. We'll have to see if this idea turns out to be useful.
  • The main challenge in collapsing FuGE-ML into ISA-TAB is ensuring that the multi-level protocol application structures (for more information, see the GenericProtocolApplication and GenericProtocol objects within the FuGE Object Model) are correctly converted. We spent the majority of today trying to figure out an elegant way of doing this. We'll work on it again tomorrow, and will hopefully have a new version of the XSLT with a first-bash solution tomorrow evening!

Read and post comments |
Send to a friend

original

Categories
CISBAN Data Integration Meetings & Conferences Software and Tools Standards

Pre-workshop post on the FuGE / ISA-TAB Workshop, 8-9 December

Tomorrow is the first day of a two-day workshop set up to continue the integration process between the ISA-TAB format and the FuGE standard. (Well, technically, it starts tonight with a workshop dinner, where I'll get to catch up with the people in the workshop, many of whom I haven't seen since the MGED 11 meeting in Italy this past summer. Should be fun!)

ISA-TAB can be seen as the next generation of MAGE-TAB, a very popular format with biologists who need to get their data and metadata into a common format acceptable by public repositories such as ArrayExpress. ISA-TAB goes one step further, and does for tabular formats what FuGE does for object models and XML formats: that is, it is able to represent multi-omics experiments rather than just the transcriptomics experiments of MAGE-TAB. I encourage you to find out more about both FuGE and ISA-TAB by looking at their respective project pages. The FuGE group also has a very nice introduction to the model in their Nature Biotechnology article.

Each day I'll provide a summary of what's gone on at the workshop, which centers around the current status of both ISA-TAB and some relevant FuGE extensions, as well as the production of a seamless conversion from FuGE-ML to ISA-TAB and back again. ISA-TAB necessarily cannot handle as much detail as the FuGE model can (being limited by the tabular format), and therefore in the FuGE-ML to ISA-TAB direction, it is possible that it may not be entirely lossless. However, this workshop and all the work that's gone on around it aims to reconcile the two formats as much as possible. And, even though I have mentioned a caveat or two, this reconciliation is entirely possible: both ISA-TAB and FuGE share the same high-level structures. Indeed, ISA-TAB was created with FuGE in mind, to ensure that such a useful undertaking used all it could of the FuGE Object Model. It is important to remember that FuGE is an abstract model which can be converted into many formats, including XML. Because it is an abstract model, many projects can make use of its structures while maintaing whatever concrete format they wish.

Specific topics of the workshop include:

  • Advance and possibly finalize XSLT rendering of FUGE Documents into ISA-TAB. This includes the finishing-off of the generic FuGE XSL stylesheet.
  • Work on some of the extensions, including FCM, Gel-ML, and MAGE2. MAGE2 is the most interesting for me for this workshop, as I've heard that it's almost complete. This is the XML format that is a direct extension of the FuGE model, and will be very useful for bioinformaticians wishing to store, share and search their transcriptomics data using a multi-omics standard like FuGE.

Thanks to Philippe Rocca-Serra and Susanna-Assunta Sansone for the hard work they've done on the format specification, and for everyone who's coming today. It's a deliberately small group so that we can spend our time in technical discussion rather than in presentations. I'm a bit of a nut about data and metadata standards (and am in complete agreement with Frank over at peanutbutter on the triumverate of experimental standards) and so I love these types of meetings. It's going to be fun, and I'll keep you updated!

Read and post comments |
Send to a friend

original

Categories
Meetings & Conferences Standards

Introduction and update on MGED Standards

Chris Stoeckert
Afternoon Session, 2 September (11th MGED Meeting, 1-4 September, 2008)

How do we tie together the various "silos" of communities and data? There is a real ecosystem of biomedical standards. Not just MGED, but also PSI, MSI, OBO, BIRN, etc. Each community generates its own list of acronyms etc 🙂

We need to bring community standards together into a single, common, integrative standards. MGED is working on MIBBI, FuGE, ISA-TAB, OBI, and MINSEQE. But having standards is only the first step: we need tools to make use of these standards.

MINSEQE is to help prepare for datasets based on UHTS related to research typically done with microarrays. There is crossover with communities primarily concerned with sequence data (e.g. GSC), and existing formats such as SRF. Where should such data get deposited?

Examples of an ultra-high-throughput (UHTS) experiment: chromatin modifications from normal versus disease cells, meta-genomic analysis of a microbial culture. UHTS requires standardization at multiple levels: from sequence reads to interpreting results.

These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend

original

Categories
CISBAN Meetings & Conferences Standards

FuGE Users’ Workshop: 13-14 December, 2007

The two-day FuGE Users’ workshop was organized by Norman Paton and held at the University of Manchester. It was great fun, and if you just want the short summary of my time there, then just know that there was loads of enthusiasm for FuGE as well as interesting talks, both by communities who were already extending FuGE, and by  developers who were already building tools and databases based on it. There were only a dozen or so people, which kept the discussions lively but neither too long nor too divergent. The workshop dinner was great, though the trip to the restaurant was correctly described by one of the attendees as an Odyssey. For more information on the social aspect of the FuGE workshop, please have a look at Phil Lord’s humorous posting on the matter. For another post on the workshop, see the peanutbutter Bioinformatics blog by Frank Gibson.

If you wish to read the longer notes rather than the short summary, then please read on!

Please note that these are my own notes, and are in no way considered to be an “official” FuGE report on the workshop. As such, any errors or inconsistencies are entirely my own. However, if you see a problem with this post, then please let me know, and I’ll fix it!

The objectives of the workshop were to share and document experiences in the use of FuGE, to identify good-practices, to document guidelines, and to make known these experiences and guidelines. Hopefully, the result will be a paper that documents the current users’ experiences and increases communities’ understanding of FuGE. It will hopefully help people who who have read the Nature Biotechnology paper and want to use FuGE, but aren’t completely sure what to do next.

Attendees were:

Peter Wilkinson, from
Montreal, who was interested in FuGE for flow cytometry.

Khalid Belhajjame: works
with Norman Paton in Manchester, and who may soon be a full-time
developer of FuGE

Javier Santoyo: University
of Edinburgh, part of a consortium trying to develop standards for
RNAi work

Andy Jones: one of the
original developers of FuGE, from Liverpool, developed GelML with
Frank Gibson.

Heiko Rosenfelder: German
Cancer Centre at Heidelberg, here as part of MIACA, and wants to use
FuGE for the cellular assay format.

Martin Eisenacher: Proteome
Centre (mzML and analysisXML) and wants to use FuGE

Phil Lord, Frank Gibson: via
CARMEN, wants to use FuGE. Frank also developed GelML with Andy
Jones.

Neil Wipat, Matt Pocock,
Allyson Lister: We use FuGE in our internal application for storing
HT data. Matt and Allyson also involved in OBI.

Leandro Hermida: SIB,
they’re part of a group that is making SystemsX. Want to use FuGE to
store and manage the data. Also want to make an extension of FuGE for
deep sequencing technologies.

Norman Paton: originally
from proteomics field, but developer of FuGE and organizer of the
workshop.

Session 1:
Experiences Using and Extending FuGE

GelML –
Frank Gibson and Andy Jones

GelML is a FuGE extension that has passed the PSI approval process. PSI defines community standards for data representation in proteomics. There are a variety of working groups, including gel electrophoresis, mass spectrometry, protein modifications, etc. Within the Gel WG there are three
specifications: MIAPE-GE (minimum checklist for reporting gel elecrotphoresis experiments), sepCV (controlled vocabulary), and GelML (data transfer format, based on FuGE).

GelML covers the model of a gel, 1-D and 2-D GE, other (N-dimensional) GE’s, sample loading, electrophoresis, detection, image acquisition, the excision of locations on gels, and SubstanceMixtureProtocol and SubstanceAction.

The first extended FuGE class described was the Material abstract class. The first of such classes is the Gel class. A Gel has Dimensions, MeasuredMaterial, and others. You use the “Measurement” package to describe the characteristics of the Gel. Measurements include PercentageAcrylamide, while information about the gel (i.e. if purchased, from where), information on the Dimensional Units and CrossLinkerRatio are all FuGE OntologyTerms). MeasuredMaterial was not originally in FuGE because it was planned that such substances could be captured by ontology terms. Rather than using named associations to GenericParameter, they tended to use either GenericParameter
(with a CV term) or extended the
Parameter class. This was just a design decision, and he would like to see how others do it.

Another extended FuGE class is the Protocol abstract class. The GelML SampleLoadingProtocol has an AddBufferAction which points to a SubstanceMixtureProtocol. 2DGelProtocol has a SampleLoadingAction, a FirstDimensionAction, a SecondDimensionAction (both Electrophoresis
protocols), and an
InterDimensionAction (for when something happens between the first and second dimension actions), and the DetectionAction.

Within the Electrophoresis protocols there is the ElectrophoresisStop (an Action) which contains a StopTime, which is a TimeParameter, with has Duration and TimePoint. They’d be really interested to see how others have/would like to model time. It was also a design decision to guide people with the structure of the XML to help them know what to fill out, e.g. you must have a 2dGelProtocol. For each case, should we extend the FuGE model or add experiment-specific semantics through the use of ontologies? I think this is a case of using both, depending on the circumstances.

They have used standard XML references within the documents. But, for instance, do we still need internal document identifiers when the ontologyURI is a globally-unique identifer anyway? Maybe required if the terms are created ad hoc within the group making the XML file. What is the best way to use ontologies?

AnalysisXML

– Martin Eisenacher

http://www.fp6-prodac.eu

He is a member of the ProDaC Consortium. ProDaC is a funded consortium that is meant as a “coordination action” within the 6th EU Framework Programme. Its aims are the development of international standards, standardized data submission pipelines, systematic data collection in public standards-compliant repositories, and data access for community and publication. There was a kick-off meeting of ProDaC in Long Beach in October 2006, and there have been two workshops since. Proteomics data includes spectra (peak lists), and peptide lists. He works specifically with the MS (for peak lists and instruments, mzML) and Proteomics Informatics (for “results”, analysisXML) PSI WG’s.

mzML is a merger of mzData and mzXML. Perhaps this merger is one of the reasons that it is not currently FuGE-based. AnalysisXML includes annotation of search databases, search, algorithms, search parameters, instrument characteristics, peptides (peptide-spectrum link, peptide scores), proteins (protein-peptide link, protein scores, significance values, false-discovery estimation) and quantisation. In September 2007 they added comments into the UML that are passed into the XML.

They use the MagicDraw Community Edition, which is available for free. The Analysis package is subdivided into process, quantisation, and search.
Process contains things that aren’t directly related to the search protocol applications, but other steps such as ProteinDetermination and PolypeptideProcess. Some of the classes they have made that inherit from the Data class inside the search package include AnalysisResultSet
(a set of spectra),
AnalysisResult (a spectrum), and AnalysisResultItem (all peptides found for that spectrum). These are all abstract classes,
whose concrete subclasses include
PolypeptideResultSet, PolypeptideSearchResult, and PolypeptideResultItem.

At the moment they are assembling their own CV (to include search parameters that are most commonly used in search engines like MASCOT), but they can also use Pride CV. They use the ontology classes directly from FuGE, without extending it. This means that it fits what they need without modifications.

In analysisXML, peptides and sequences are listed only once. Different types of analyses in one file or in separate files with external cross-references. Also, the AnalysisProtocol could be used as parameter input for search engines. However, there are many cross-references and unique identifiers that are not validated by the FuGE Schema. Further, there are external cross-references to mzML, which can be difficult if you have only
local files and not public URI’s. Also, sequences (just the letters) are not polypeptides (“real” molecules with modifications). Therefore, the
ConceptualMolecule FuGE class is not appropriate for polypeptides, though it is suitable for sequences (though they are still able to use that class).
Additionally, the ResultSet-ResultItem hierarchy does not fit all analysis types. Finally, many FuGE elements seem to have very long names that aren’t always useful (but you shouldn’t be typing XML manually!).

All items of the collections have unique identifiers. References to them are attributes called “…_ref”. Schema validation does not consider whether _ref links to an allowed section (or that used CV’s are allowed). In mzML, for example, “semantic validation” of CV’s is possible (suggested/implemented by the EBI). Are identifier checks possible? ProDaC has an online validator for mzML, analysisXML, mzData and prideXML that performs semantic validation, though the extra ontology/CV checks are only supported for mzML.

Still to do is the finalization of analysisXML, which is a deliverable for last October! They also want to provide “Quality Determination” as a process. They also want to make some use-cases and instance documents. They will have some from Matrix Science, MPC. Also, they need to finalize the CV they are using.

SyMBA

– Allyson Lister

I gave this talk, so I didn’t write anything about it! Instead, have a look at the SourceForge website (http://symba.sf.net):

The Integrative Bioinformatics Group, headed by Neil Wipat and part of The Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN), has developed a data archive and integrator (SyMBA) based on Milestone 3 of the Functional Genomics Experiment (FuGE) Object Model (FuGE-OM), and which archives, stores, and retrieves raw high-throughput data. Until now, few published systems have successfully integrated multiple omics
data types and information about experiments in a single database. SyMBA is the first published implementation of FuGE that includes a database back-end, expert and standard interfaces, and a Life Science Identifier (LSID) Resolution and Assigning service to identify objects and provide programmatic access to the database. Having a central data repository prevents deletion, loss, or accidental modification of primary data, while giving convenient access to the data for publication and analysis. It also provides a central location for storage of metadata for the high-throughput data sets,
and will facilitate subsequent data integration strategies.

Developing
Flow Cytometry FuGE extensions – Peter Wilkinson

Developing MIFlowCyt. Originally, they stored the metadata and data in a single file, but their latest format (ACS) will separate these two types. They are considering having some of their data formats be in RDF as well as XML, even for those formats that will be built on FuGE – is there a good XML to
RDF converter? I suppose so, as I’ve been able to save OWL/RDF as OWL/XML in Protege 4.

One example of their extension is Cytometer, which is a subclass of equipment. How descriptive should they get with their samples? Should it be at the entity or attribute level? For instance, there is a conceptual difference between prepared samples and “generic” materials. But why not draw an  association to material and call it “sample”? They can’t do that because sample has a lot of associations itself that aren’t present in Material. For things like buffers and solutions, spML doesn’t seem to view them as things that exist – you just talk about them in the protocol. This way you don’t have to list them 1000s of times. In FC, you have to know exactly which thing is used in the protocol (e.g. they must record batch numbers). However, you could have a single buffer instance, and then in the ProtocolApplication you have a specific parameter that is modified in that particular application of the Protocol, such as the batch number.

Open issues include: FuGE should reference a stable version of AndroMDA, there should be a best-practice for deciding when a Generic* class is replaced by a specific omics-type class, how is the OntologyTerm abstract class intended to be used for specific controlled lists, fitting the organism into FuGE::Bio somewhere, and versioning. He’s also trying to write a FuGE database by hand, rather than using what is generated by AndroMDA, as he needs to squeeze as much performance out of the system as he can. Much more difficult, but could conceivably be much more efficient.

Generic
and Custom Extensions – Andy Jones

spML

is for sample processing. SubstanceMixtureProtocol is for describing a mixture of substances, e.g. buffers and solutions and the method of their creation. Actions relate to constituents. Timings relate to constituents and volume, concentration, or mass. SetPropertyAction is a generic model to be used in conjunction with protocols where parameters may be set with associated ActionText. Their chromatography extension comprises extensions of Protocol, Equipment, and ProtocolApplication. The ChromatographyProtocol contains extensions of Parameter, has a child protocol for sample injection, and various uses of GenericActions. With ChromatographyEquipment, there is column-associated sub-components. All extensions of Chromatography equipment can have additional parameters, including specific named parameters where they are always required. Uses Equipment:make. The mobile phase of LCProtocol is described using the SubstanceMixtureProtocol. Inputs are defined with GenericMaterialMeasurement, and the outputs are either Chromatogram (ExternalData) and SeparationFraction. You can also have two-dimensional chromatography.

GenericSeparation is a protocol that uses generic models for defining substance used to create a separation gradient and the parameters applied. In this
case, the equipment defines the type of separation and criteria using ontology terms – but how do you communicate how this should be used to all of the developers? In contrast, we don’t want to have huge models. Inputs also defined using
GenericMaterialMeasurement, and the outputs are either SeparationLog (ExternalData) and SeparationFraction.

TreatmentProtocol is a simple model for treatments, intended for labelling, mixing, splitting, and washing, for example. The treatment IO in TreatmentProtocolApplication is restricted to having material inputs and material outputs only. There seems to be three sorts of models: column-oriented, category-oriented, and completely generic protocols. Much of what is in spML might be useful for the “library of models” we’ve been discussing.

The generic model is very flexible for different types of separation, and could be used for LC, GC, capillary electrophoresis, rotofors etc. It
is also unlikely to break if new type of experiment is defined, and the Treatments model could potentially be useful in the context of any experiment type. Also, the generic model is much smaller, and can be used in various ways. However, this last one could be a “con” as well, because different users/implementers are likely to encode the same information in different ways. Further, a specific model can guide the user to provide specific details, e.g. for MIAPE compliance.

spML units are derived from the OBO Unit Ontology. Should FuGE extensions be allowed to have user-defined terms? It would be useful for the creation of in-house lists to populate drop-down menus.

Below are a list of questions and suggestions that we came up with while the initial talks progressed in the first couple of sessions. Many were discussed, and some were answered, in breakout sessions later. Notes from the discussions I was a part of are included below. The unanswered points in the list may have been discussed at other breakout sessions, or may still be untouched.

Discussion on Semantic Validation and
Identifiers: Khalid Belhajjame, Norman Paton, Allyson Lister, Martin
Eisenacher

Identifiers and
Auxillary/Semantic Validation: Types of Validation and How Simple
Support can be done.

Unique
in Document
Not
Dangling
Globally
Unique
Type
Correct
Notes
Property Checked
by XML Tooling
Property Checked
by XML Tooling
Property Checked
by XML Tooling
Property Checked
by XML Tooling
Instances
of Identifiable
yes GP yes GP
(yes?)
(+) no yes no See
(1)
Ontology
Terms
(#) n/a yes no yes no yes
(not in UML)
no
(^)
External
Data
($) n/a yes no yes no yes (*) (*)want
to know it’s a file of a particular kind

GP:
Can be checked with a generic program.

All things marked GP or X could be attacked by people wanting to write a semantic validation tool.

(+) Only for some types of Globally-unique identifiers would we be able to check that they were truly unique and well-formed.

(#): Should OntologyTerm elements be unique (irrespective of their identifiers, which must be unique)? If people compare OT identifiers they may think two terms are different when in fact they are the same, and someone was sloppy when making OT elements. However, if they have linked their OT to an OntologySource then it can be checked if it is both unique in document and globally unique (if it is a logical/physical uri)? In that case, why should OS be optional at all, if custom CV’s can be included in the OS.

(^): This is where the ontology mapping files come in.

($): The same argument for uniqueness of ED applies as that in OT.

(1) Will we suggest a type of identifier to use with FuGE as a best-practice?

Do we still need internal document identifiers when the ontologyURI is a globally-unique identifier anyway?

Should identifiers be human readable?

Do community extensions automatically have their own namespace/prefix? That is, if “sample” is used in the FC community and also in another extension, will it be problematic if you try to create a multi-omics FuGE-ML file? This is all about linkage between different FuGE-based instances (unique identifiers, both within a single document and between documents.What is the identifier an identifier of? Is every Identifiable object a “first-class citizen”? We shouldn’t force all (any!) identifiers to be URI’s.

Should you use a logical or physical naming scheme?

Physical naming schemes:

  • Are fragile
  • May not work for all users (i.e. if the URI points to a laptop that isn’t publicly accessible)

Logical naming schemes:

  • Are robust, but require a greater investment of time, as they need tools that provide resolving facilities.
  • If locally-unique identifiers are used:
    • it means that you may get into trouble in the long run
  • If globally-unique identifiers are used:
    • clashes between different FuGE-ML files will be avoided
  • People should look this over and discover which is the best setup for their situation.

If we use URI’s, should URI’s be resolvable? What is the scope?

  • Martin has a URI that points to a data file, and a (possibly locally unique) identifier that points to a spectrum within the data file. How to deal with this? Do we have a best-practice for it?

Schema Validation

  • Schema validation does not consider whether _ref links to an allowed section (or that used CV’s are allowed). Native XML validation does not do this, but you could make a tool. In theory, the prefix before the _ref is always the name of the class. FuGE needs semantic validation.
  • How should user-defined ontology terms be validated in the XML?

Discussion on Versioning: Khalid Belhajjame, Norman Paton, Allyson Lister, Phil Lord, Matt Pocock, Leandro Hermida

  • Is there a best way to implement versioning?

Characteristics of (SyMBA) Versioning:

  1. Complete History of Atomic Changes
  2. Low Cost of Updates –No Cascades
  3. Higher Cost of Retrieval

This is actually a transaction-time database with tuple-level timestamps. In a transaction-time database, the time is in the world of the database and not the time in the real world (vs valid-time database, where the time you insert does not match the time that you actually wanted to input). If you don’t put the timestamp in the tuple, you put it in the attribute. In this context, people have looked at the properties of update operations.

Can’t just use LSID versioning because there is no specification of how the version should be updated.

SyMBA Versioning Requirements:

  1. Preserving the semantics of the LSID
  2. Getting exactly the version requested, and getting all versions
  3. Nothing should disappear

This isn’t necessarily versioning, or what versioning in FuGE should be.

Leandro’s Versioning Requirements:

  1. Getting exactly the version requested, and getting all versions
  2. Nothing should disappear

Should this be done in FuGE, or in the FuGE-OM specifically? Perhaps just in the Maven build? We could put hooks in FuGE that would allow fine-grained logging. The current Audit setup does not allow linking back to previous versions unless you put the delta in free text somehow. The Audit classes may
be suitable for XML, and you could make a log of such changes and roll-back (in a non-RDBMS way) to whatever version you want.

While it is clear we could make an STK that could have versioning of some type, whether or not this should be a (optionally-used) change to the OM is a much bigger thing. It is certainly a worry that versioning has to be dealt with at the application level. However, versioning at the file or XML level means multiple files, otherwise you’d have to apply a diff to a very large file.

We haven’t really had the time to scan the space of options here. We could circulate a general document, and then outline what’s actually been done so far. A paper would, in any case, be centered around pros/cons, and a bit less on current implementations, but definitely not say which is the “right” way to do things, as there is no single right way.

There are different technical solutions, and not all of these solutions should necessarily be provided in the model.

Discussion on Tools – Leandro Hermida, Heiko Rosenfelder, Neil Wipat, Phil Lord, Allyson Lister

  • What about trying to get some automatic mapping between the XML classes and the Hibernate/Spring classes?
  • There is a disconnect between the XSD that is generated from the XML schema cartridge and the code generated from your persistence cartridge.
  • This means you have to write your mapping manually.
  • There is a possibility that we could get hyperjaxb3 to work for this (Allyson had tried with an earlier version but it didn’t work properly). Hyperjaxb3 generates both Entity POJO’s and the jaxb2 classes. So, in theory you could only use the Andromda XSD cartridge and hyperjaxb3 for the rest. However, then you loose all the information that is present in the UML but not in the XSD.
  • Hyperjaxb3 uses both hibernate and ejb3 natively (you can choose). Leandro wants to work on a merged persistence/hyperjaxb3 extended cartridge, or perhaps its own cartridge. So perhaps the generation of a hyperjaxb3 cartridge is possible in future.
  • Is there an XSLT that could be made to have a “standard” way of viewing a fuge experiment?
    • Khalid mentioned that it is important to allow input from the programmer in such a tool, so they can see as little or as much of the FuGE structure as you wish to present to the user.
    • Leandro is working on an ejb3 cartridge from the androMDA plugins project (not part of the AndroMDA core yet), and have used FuGE as a test-case. What this cartridge does is generate a mapping file and load it into any application server running hibernate and it will generate your  database. Whereas the Hibernate+Spring cartridge generates 1) Entity POJO’s + mapping files 2) Spring DAO + DAOException + DAOImpl. With the ejb3 cartridge you get 1) ejb3-annotated Entity POJO’s + DAO*. You can use Spring, if you wish, to build your web framework. Leandro decided to instead use Seam, which is the business layer of a web framework that builds on top of ejb3. Seam then uses the JSF (Facelets) and Jboss RichFaces for the actual web UI.To get the Seam classes, you model <> classes and then draw dependencies, which then auto-generate Seam-enabled ejb service beans. However, the Facelets and RichFaces have their xhtml files manually, though AndroMDA creates the entire web/ structure and base Seam classes for you. This doesn’t answer our simple UI question.
    • The ejb3 cartridge has a web service (jax-ws, via soap) to your DAO’s and Entity POJO’s.
    • With MAGE, someone wrote a regular Java Swing program where you download the jar which opens a little tabbed client that views MAGE. We could do something similar. (A J2SE app to write/read FuGE-ML of nice wizard interface)
    • The GSC has a lightweight XSD-to-web-form software app.
    • An XSLT, which is a style-sheet that is richer than CSS, but it is a tough language to use (convert XML to “HTML”). XSLT’s don’t have first-class functions, so you can’t do anything generic.
    • Also would like to have simple jar that has input XML, output HTML. This means three tool types here: 1) heavyweight (already existing in SyMBA and SystemsX) 2) midweight (J2SE app to read/write with a wizard-like interface) 3) lightweight (input XML, output HTML with some simple options).
  • Tool support for FuGE STK version 1, including a validator
    • The MAGE STK includes a validator.
    • XML validation can be done with JAXB2 as is with the Milestone 3 STK, but longer-term we need the semantic validation tool.
    • Perhaps have some ontology lookup helper classes (OLS from the EBI?) to help users and developers add terms from (a certain set of?) ontologies. This may help people populate their databases, choose a term from a list on a front-end tool, etc.
  • Tool support for database schema / AndroMDA / Alternatives.
    • Dealt with in the other sections

Discussion on Challenging Constructs, including Investigation Package, Abstract Associations, and the Ontology Package – the entire group

  • What is the real meaning of the Investigation package? It’s one of the few parts of FuGE that isn’t meant to be extended.
  • How is the OntologyTerm abstract class intended to be used for specific controlled lists? One example is taxonomies as opposed to ontologies.

The intention is that this package would not be changed or extended by communities. Each technology would be reported in the InvestigationComponent. The Factor class actually is meant as a summary report of the factors used in the experiment. There is currently no direct link between the Factors and the protocol workflows – the detail can be recorded in other places in FuGE. It’s a summary and duplication of the factor information.

So, if you want to say that your Investigation compared two different values for a single factor, the Investigation has the factor type, but not the data for the factor or the values themselves. However, you can connect to the data made from the various omics technologies via the DataPartition class. There could be a problem where it is a set of factors that only together make a particular set of data useful. Example: if your important aggregate
of factors is time1.mouse1.foodstuff1. However, you would have to have each of these factors would be named separately, and you would get a different slice of data for the time1 than you would for the mouse1. How to you join them up? Perhaps allow multiple
FactorValues (and OntologyTerms) for a single Factor. Not a very nice solution, though. Perhaps you don’t need to change it at all, as you would only add Factors
that are relevant to a particular
InvestigationComponent.

How do you describe which combinations of Factors are the combinations you’re interested in?
Norman did this by seeing an IC as a particular run of an experiment.

Dimensions are used in FuGE as a way of naming coordinates in a matrix. This does not mean that the data has to be stored here. You can store the
data internally via the
InternalData class, or you can reference it externally via the ExternalData class (or, of course, create your own subclass of Data).

There are 21 <>‘s in FuGE, and all but 6 have identical concrete associations. Some auto-generated AndroMDA code mistakenly ignores the “abstract” parts and incorrectly generates the methods etc. In this case, you can just delete the abstract association in your copy of the UML and re-generate the code. It should be fixed within AndroMDA, though.

For multi-dimensional data, DataPartition are meant as a mechanism to relate back to the data from the Investigation, but many groups will choose not to use DataPartition. Very big, regularly-shaped data sets will be good things to use DataPartition with (e.g. Flow Cytometry). In the case of proteomics data, this may be more of a challenge. A best-practices documents should contain information on which data types are best-suited to this system, and which aren’t. It should also include any alternatives to this system. One alternative solution to using DataPartition and its associated coordination system for dimensional data is to build an association from their data of interest back to FactorValue.

What is PartitionPair? In the case where you have two data files, and you wish to associate a particular row of one data file (for example) to a particular spectrum in another (to continue the example). So, it is a “shortcut” to linking particular data sets.

How should users of FuGE build CV lists using the Ontology Package? An OntologyTerm has an OntologyProperty, which contains both a DataProperty
and
ObjectProperty (these are the relationships within an ontology). Also inside OntologyTerm is OntologyIndividual. OI is the individual itself. Why not just provide the term – why try to recreate the structure of an ontology into UML? However, in OWL, every single class, relationship etc has a URI, so
why not use those in UML? An example use: you have in an ontology the concept of age, which has an initial time point and a unit. How do you pull that concept into the UML? We’re essentially creating a cut-down version of the ontology to allow extensibility in FuGE. But why would you want this? To create an individual of an ontology within the UML. It also allows restrictions of the name-values (left and right-hand side of a relationships) to those that are allowed within the ontology. One opinion is that there shouldn’t be a purpose-built extensibility point in UML, as the entire purpose of UML is that it is extensible everywhere. It also means that users of your FuGE file don’t need to parse both that file and the ontology file. However, the users of
your file must understand your extensibility point that you’ve made, which isn’t useful. The extra knowledge should be stored in that ontology, in the same way as analysisXML links to mzML. One solution is to have a Property class with term “height” and a Value class with term “meters”, and a PV class with associations to both Property and Value that provides the link. In the end, this is optional. In the guidelines, these concerns should be expressed.

Other questions not fully addressed:

  • How do we find out when Generic* classes should be replaced by a specific omics-type class? Rather than using named associations to
    GenericParameter, GelML tended to use either GenericParameter (with a CV term) or extended the Parameter class. Is there a best way to
    use Parameter/GenericParameter?

    If it is the same shape as the Generic class, and you are just renaming it, that is a good argument for using an ontology term. However, there is less of a learning curve for users if you subclass GenericParameter with your own name. Subclassing can lock you in, and may make life more difficult further down the line if your requirements change. Remember though, hardly anyone will write XML by hand, and we shouldn’t worry too much about tool implementation. Still want to make it easier for tool developers, though!
  • How should we model time?
  • For experiment-specific semantics, when should we extend the FuGE model rather than add information through the use of ontologies?
  • How descriptive should extending communities get with their samples? Should it be at the entity or attribute level? Is there a
    best-practice that should be documented?
  • How do we find out if two classes from two different communities are actually the same? Recurring model requirements, e.g. a library of
    model fragments e.g time and sample.
  • Could organism be fitted into FuGE::Bio somewhere?
  • There is no date of the Action in the ActionApplication.
    You could have a time parameter that comes in when you add it to your own subclass of Action/ActionApplication, and then provide a
    different value for that parameter in ActionApplication.
  • Somewhere, the distinction between Action and Protocol should be defined.
  • In general, we should describe a modelling best-practice to tell what is considered “standard” procedure.
  • Data package: internal versus external data
  • There may be an issue with describing physical materials within Protocols versus ProtocolApplications (theoretical materials vs physical
    materials, SubstanceMixtureProtocol was designed to account for this problem)

original