CISBAN Semantics and Ontologies Software and Tools

SBML in OWL: some thoughts on Model Format OWL (MFO)

What is SBML in OWL?

I’ve created a set of OWL axioms that represent the different parts of the Systems Biology Markup Language (SBML) Level 2 XSD combined with information from the SBML Level 2 Version 4 specification document and from the Systems Biology Ontology (SBO). This OWL file is called Model Format OWL (MFO) (follow that link to find out more information about downloading and manipulating the various files associated with the MFO project). The version I’ve just released is Version 2, as it is much improved on the original version first published at the end of 2007. Broadly, SBML elements have become OWL classes, and SBML attributes have become OWL properties (either datatype or object properties, as appropriate). Then, when actual SBML models are loaded, their data is stored as individuals/instances in an OWL file that can be imported into MFO itself.

A partial overview of the classes (and number of individuals) in MFO.
A partial overview of the classes (and number of individuals) in MFO.

In the past week, I’ve loaded all curated BioModels from the June release into MFO: that’s over 84,000 individuals!1 It takes a few minutes, but it is possible to view all of those files in Protege 3.4 or higher. However, I’m still trying to work out the fastest way to reason over all those individuals at once. Pellet 2.0.0 rc7 performs the slowest over MFO, and FaCT++ the fastest. I’ve got a few more reasoners to try out, too. Details of reasoning times can be found in the MFO subverison project.

Why SBML in OWL?

Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.
Jupiter and its biggest moons (not shown to scale). Public Domain, NASA.

For my PhD, I’ve been working on a semantic data integration. Imagine a planet and its satellites: the planet is your specific domain of biological interest, and the satellites are the data sources you want to pull information from. Then, replace the planet with a core ontology that richly describes your domain of biology in a semantically-meaningful way. Finally, replace each of those satellite data sources with OWL representations, or syntactic ontologies of the format in which your data sources are available. By layering your ontologies like this, you can separate out the process of syntactic integration (the conversion of satellite data into a single format) from the semantic integration, which is the exciting part. Then you can reason over, query, and browse that core ontology without needing to think about the format all that data was once stored in. It’s all presented in a nice, logical package for you to explore. It’s actually very fun. And slowly, very slowly, it’s all coming together.

Really, why SBML in OWL?

As one of my data sources, I’m using BioModels. This is a database of simulatable, biological models whose primary format is SBML. I’m especially interested in BioModels, as the ultimate point of this research is to aid the modellers where I work in annotating and creating new models. In BioModels, the “native” format for the models is SBML, though other formats are available. Because of the importance of SBML in my work, MFO is one of the most important of my syntactic “satellite” ontologies for rule-based mediation.

How a single reaction looks in MFO when viewed with Protege 3.4.
How a single reaction looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.
How a single species looks in MFO when viewed with Protege 3.4.

Is this all MFO is good for?

No, you don’t need to be interested in data integration to get a kick out of SBML in OWL: just download the MFO software package, pick your favorite BioModels curated model from the src/main/resources/owl/curated-sbml/singletons directory, and have a play with the file in Protege or some other OWL editor. All the details to get you started are available from the MFO website. I’d love to hear what you think about it, and if you have any questions or comments.

MFO is an alternative format for viewing (though not yet simulating) SBML models. It provides logical connections between the various parts of a model. It’s purpose is to be a direct translation of SBML, SBO, and the SBML Specification document in OWL format. Using an editor such as Protege, you can manipulate and create models, and then using the MFO code you can export the completed model back to SBML (while the import feature is complete, the export feature is not yet finished, but will be shortly).

For even more uses of MFO, see the next section.

Why not BioPAX?

All BioModels are available in it, and it’s OWL!

BioPAX Level 3, which isn’t broadly used yet, has a large number of quite interesting features. However, I’m not forgetting about BioPAX: it plays a large role in rule-based mediation for model annotation (more on that in another post, perhaps). It is a generic description of biological pathways and can handle many different types of interactions and pathway types. It’s already in OWL. BioModels exports its models in BioPAX as well as SBML. So, why don’t I just use the BioPAX export? There are a few reasons:

  1. Most importantly, MFO is more than just SBML, and the BioPAX export isn’t. As far as I can tell, the BioModels BioPAX export is a direct conversion from the SBML format. This means it should capture all of the information in an SBML model. But MFO does more than that – it stores logical restrictions and axioms that are only otherwise stored in either SBO itself or, more importantly, the purely human-readable content from the SBML specification document2. Therefore MFO is more than SBML, it is a bunch of extra constraints that aren’t present in the BioPAX version of SBML, and therefore, I need MFO as well as BioPAX.
  2. I’m making all this for modellers, especially those who are still building their models. None of the modellers at CISBAN, where I work, natively use BioPAX. The simulators accept SBML. They develop and test their models in SBML. Therefore I need to be able to fully parse and manipulate SBML models to be able to automatically or semi-automatically add new information to those models.
  3. Export of data from my rule-based mediation project needs to be done in SBML. The end result of my PhD work is a procedure that can create or add annotation to models. Therefore I need to export the newly-integrated data back to SBML. I can use MFO for this, but not BioPAX.
  4. For people familiar with SBML, MFO is a much more accessible view of models than BioPAX. If you wish to start understanding OWL and its benefits, using MFO (if you’re already familiar with SBML) is much easier to get your head around.

What about CellML?

You call MFO “Model” Format OWL, yet it only covers SBML.

Yes, there are other model formats out there. However, as you now know, I have special plans for BioPAX. But there’s also CellML. When I started work on MFO more than a year ago, I did have plans to make a CellML equivalent. However, Sarala Wimalaratne has since done some really nice work on that front. I am currently integrating her work on the CellML Ontology Framework. She’s got a CellML/OWL file that does for CellML what MFO does for SBML. This should allow me to access CellML models in the same way as I can access SBML models, pushing data from both sources into my “planet”-level core ontology.

It’s good times in my small “planet” of semantic data integration for model annotation. I’ll keep you all updated.


1. Thanks to Michael Hucka for adding the announcement of MFO 2 to the front page of the SBML website!.
2. Of course, not all restrictions and rules present in the SBML specification are present in MFO yet. Some are, though. I’m working on it!

Data Integration Research Blogging

One way for RDF to help a bioinformatician build a database: S3DB

This post is part of the PLoS One syncroblogging day, as part of the PLoS ONE @ Two birthday celebrations. Happy Synchroblogging! Here’s a link to the paper on the PLoS One website.

Biological data: vitally important, determinedly unruly. This challenge facing the life-science community has been present for decades, as witnessed by the often exponential growth of biological databases (see the classic curve in the current graphs of UniProt1 and EMBL if you don’t believe me). It’s important to me, as a bioinformatics researcher whose main focus is semantic data integration, but it should be important to everyone. Without manageable data that can be easily integrated, all of our work suffers. Nature thinks it’s important: it recently devoted an entire issue to Big Data. Similarly, the Journal of Biomedical Informatics just had a Semantic Mashup special issue. Deus et al. (the paper I’m blogging about, published in PLoS One this summer) agree, beginning with “Data, data everywhere”, nicely encapsulating both the joy and the challenge in one sentence.

This paper describes work on a distributed management system that can link disparate data sources using methodologies commonly associated with the semantic web (or is that Web 3.0?). I’m a little concerned (not at the paper, just in general) at the fact that we seem to already have a 3.0 version of the web, especially as I have yet to figure out a useful definition for semantic web vs Web 2.0 vs Web 3.0. Definitions of Web 3.0 seems to vary wildly: is it the semantic web? Is it the -rwx- to Web 1.0’s -r– and Web 2.0’s -rw– (as described here, and cited below)? Are these two definitions one and the same? Perhaps these are discussions for another day… Ultimately, however, I have to agree with the authors that “Web 3.0” is an unimaginative designation2.

So, how can the semantic web help manage our data? That would be a post in itself, and is the focus of many PhD projects (including mine). Perhaps a better question is how does the management model proposed by Deus et al. use the semantic web, and is it a useful example of integrative bioinformatics?

Their introduction focuses on two types of integration: data integration as an aid to holistic approaches such as mathematical modelling, and software integration which could provide tighter interoperability between data and services. They espouse (and I agree) the semantic web as a technology which will allow the semantically-meaningful specification of desired properties of data in a search, rather than retrieving data in a fixed way from fixed locations. They want to extend semantic data integration from the world of bioinformatics into clinical applications. Indeed, they want to move past “clandestine and inefficient flurry of datasets exchanged as spreadsheets through email”, a laudable goal.

Their focus is on a common data management and analysis infrastructure that does not place any restrictions on the data stored. This also means multiple instances of light-weight applications are part of the model, rather than a single central application. The storage format is of a more general, flexible nature. Their way of getting the data into a common format, they say, is to break down the “interoperable elements” of the data structures into RDF triples (subject-predicate-object statements). At its most basic, their data structure has two types of triples: Rules and Statements. Rules are phrases like “sky has_color”, while statements add a value to the phrase, e.g. “today’s_sky has_color blue”.

They make the interesting point that the reclassification of data from flat files to XML to RDF to Description Logics starts to dilute “the distinction between data management and data analysis”. While it is true that if you are able to store your data in formats such as OWL-DL3, the format is much more amenable to direct computational reasoning and inference, perhaps a more precise statement would be that the distinction between performance of data management tasks and data analysis tasks will blur with richer semantic descriptions of both the data and their applications. As they say later in the paper, once the data and the applications are described in a way that is meaningful for computation, new data being deposited online could automatically trigger a series of appropriate analysis steps without any human input.

A large focus of the paper was on identity, both of the people using it (and therefore addressing the user requirement of a strong permissions system) and of the entities in the model and database (each identified with some type of URI). This theme is core to ensuring that only those with the correct permissions may access possibly-sensitive data, and that each item of information can be unambiguously defined. I like that the sharing of “permissions between data elements in distinct S3DB deployments happens through the sharing the membership in external Collections and Rules…not through extending the permission inheritance beyond the local deployment”. It seems a useful and straightforward method of passing permissions.

I enjoyed the introduction, background, and conclusions. Their description of the Semantic Web and how it could be employed in the life sciences is well-written and useful for newcomers to this area of research. Their description of the management model as composed of subject-predicate-object RDF triples plus membership and access layers was interesting. Their website was clear and clean, and they had a demo that worked even when I was on the train4. It’s also rather charming that “S3DB” stands for Simple Sloppy Semantic Database – they have to get points for that one5! However, the description of their S3DB prototype was not extensive, and as a result I have some basic questions, which can be summarized as follows:

  • How do they determine what the interoperable elements of different data structures are? Manually? Computationally? Is this methodology generic, or does it have to be done with each new data type?
  • The determination of the maturity of a data format is not described, other than that it should be a “stable representation which remains useful to specialized tools”. For instance, the mzXML format is considered mature enough to use as the object of an RDF triple. What quality control is there in such cases: in theory, someone could make a bad mzXML file. Or is it not the format which is considered mature, but instead specific data sets that are known to be high quality?
  • I would have like to have seen more detail in their practical example. Their user testing was performed together with the Lung Cancer SPORE user community. How long did the trial last? Was there some qualitative measurement of how happy they were with it (e.g. a questionnaire)? The only requirement gathered seems to have been that of high-quality access control.
  • Putting information into RDF statements and rules in an unregulated way will not guarantee a data sets that can be integrated with other S3DB implementations, even if they are of the same experiment type. This problem is exemplified by a quote from the paper (p. 8): “The distinct domains are therefore integrated in an interoperable framework in spite of the fact that they are maintained, and regularly edited, by different communities of researchers.” The framework might be identical, but that doesn’t ensure that people will use the same terms and share the same rules and statements. Different communities could build different statements and rules, and use different terms to describe the same concept. Distributed implementations of S3DB databases, where each group can build their own data descriptions, do not lend themselves well to later integration unless they start by sharing the same ontology/terms and core rules. And, as the authors encourage the “incubation of experimental ontologies” within the S3DB framework, chances are that there will be multiple terms describing the same concept, or even one word that has multiple definitions in different implementations. While they state that data elements can be shared across implementations, it isn’t a requirement and could lead to the problems mentioned. I have the feeling I may have gotten the wrong end of the stick here, and it would be great to hear if I’ve gotten something wrong.
  • Their use of the rdfs:subClassOf relation is not ideal. A subclass relation is a bit like saying “is a”, (defined here as a transitive property where “all the instances of one class are instances of another”) therefore what their core model is saying with the statement “User rdfs:subClassOf Group” is “User is a Group”. The same thing happens with the other uses of this relation, e.g. Item is a Collection.  A user is not a group, in the same way that a single item is not a collection. There are relations between these classes of object, but rdfs:subClassOf is simply not semantically correct. A SKOS relation such as skos:narrower (defined here as “used to assert a direct hierarchical link between two SKOS concepts”) would be more suitable, if they wished to use a “standard” relationship. I particularly feel that I probably misinterpreted this section of their paper, but couldn’t immediately find any extra information on their website. I would really like to hear if I’ve gotten something wrong here, too.

Also, although this is not something that should have been included in the paper, I would be curious to discover what use they think they could make of OBI, which would seem to suit them very well6. An ontology for biological and biomedical investigations would seem a boon to them. Further, such a connection could be two-way: the S3DB people probably have a large number of terms, gathered from the various users who created terms to use within the system. It would be great to work with the S3DB people to add these to the OBI ontology. Let’s talk! 🙂

Thanks for an interesting read!

1. Yes, I’ve mentioned to the UniProt gang that they need to re-jig their axes in the first graph in this link. They’re aware of it! 🙂
2. Although I shouldn’t talk, I am horrible at naming things, as the title of this blog shows
3. A format for ontologies using Description Logics that may be saved as RDF. See the official OWL docs.
4. Which is a really flaky connection, believe me!
5. Note that this expanded acronym is *not* present in this PloS One paper, but is on their website.
6. Note on personal bias: I am one of the core developers of OBI 🙂

Helena F. Deus, Romesh Stanislaus, Diogo F. Veiga, Carmen Behrens, Ignacio I. Wistuba, John D. Minna, Harold R. Garner, Stephen G. Swisher, Jack A. Roth, Arlene M. Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H. Vogel, Jonas S. Almeida (2008). A Semantic Web Management Model for Integrative Biomedical Informatics PLoS ONE, 3 (8) DOI: 10.1371/journal.pone.0002946
Z. Zhang, K.-H. Cheung, J. P. Townsend (2008). Bringing Web 2.0 to bioinformatics Briefings in Bioinformatics DOI: 10.1093/bib/bbn041

Data Integration Research Blogging

Adding informative metadata to bioinformatics services

Carole Goble and the other authors of “Data curation + process curation = data integration + science” have written a paper on the importance of curating not just the services used in bioinformatics, but also how they are used. Just as more and more biologists are becoming convinced of the importance of storing and annotating their data in a common format, so should bioinformaticians take a little of their own medicine and ensure that the services they produce and use are annotated properly. I personally feel that it is just as important to ensure that in silico work is properly curated as it is in the more traditional, wet-lab biological fields.

They mention a common feature of web services and workflows: namely, that they are generally badly documented. Just as the majority of programmers leave it until the last possible minute to comment their code (if they comment at all!), so also are many web services annotated very sparsely, and not necessarily in a way that is useful to either humans or computers. I remember that my first experience with C code was trying to take over a bunch of code written by a C genius, who had but one flaw: a complete lack of commenting. Yes, I learnt a lot about writing efficient C code from his files, but it took me many hours more than it would have done if there had been comments in there!

They touch briefly on how semantic web services (SWS) could help, e.g. using formats such as OWL-S and SAWSDL. I recently read an article in the Journal of Biomedical Informatics (Garcia-Sanchez et al. 2008, citation at the end of the paper) that had a good introduction to both semantic web services and, to a lesser extent, multi-agent systems that could autonomously interact with such services. While the Goble et al. paper did not go into as much detail as the Garcia-Sanchez paper did on this point, it was nice to learn a little more about what was going on in the bioinformatics word with respect to SWS.

Their summary of the pitfalls to be aware of due to the lack of curated processes was good, as was their review of currently-existing catalogues and workflow and WS aggregators. The term “Web 2.0” was used, in my opinion correctly, but I was once again left with the feeling that I haven’t seen a good definition of what Web 2.0 is. I must hear it talked about every day, and haven’t come across any better definition than Tim O’Reilly’s. Does anyone reading this want to share their “favorite” definition? This isn’t a failing of this paper – more of my own lack of understanding. It’s a bit like trying to define “gene” (this is my favorite) or “systems biology” succinctly and in a way that pleases most people – it’s a very difficult undertaking! Another thing I would have liked to have seen in this paper, but which probably wasn’t suitable for the granularity level at which this paper was written, is a description and short analysis of the traffic and usage stats for myExperiment. Not a big deal – I’m just curious.

As with anything in standards development, even though there are proposed minimal information guidelines for web services out there (see MIAOWS), the main problem will always be lack of uptake and getting a critical mass (also important in community curation efforts, by the way). In my opinion, a more important consideration for this point is that getting a MIA* guideline to be followed does not guarantee any standard format. All it guarantees is a minimal amount of information to be provided.

They announce the BioCatalogue in the discussion section of this paper, which seems to be a welcome addition to the attempts to get people to annotate and curate their services in a standard way, and store them in a single location. It isn’t up and running yet, but is described in the paper as a web interface to more easily allow people to annotate their WSDL files, whereas previous efforts have mainly focused on the registry aspects. Further information can be associated with these files once they are uploaded to the website. However, I do have some questions about this service. What format is the further information (ontology terms, mappings) stored in? Are the ontology terms somehow put back into the WSDL file? How will information about the running of a WS or workflow be stored, if at all? Does it use a SWS format? I would like to see performances of Bioinformatics workflows stored publicly, just as performances of biological workflows (eg running a microarray experiment) can be. But I suppose many of these questions would be answered once BioCatalogue is in a state suitable for publishing on its own.

In keeping with this idea of storing the applications of in silico protocols and software in a standard format, I’d like to mention one syntax standard that might be of use in storing both descriptions of services and their implementation in specific in silico experiments: FuGE. While it does not currently have the structures required to implement everything mentioned in this paper (such as operational capability and usage/popularity scores) in a completely explicit way, many of the other metadata items that this paper suggests can already be stored within the FuGE object model (e.g. provenance, curation provenance, and functional capability). Further, FuGE is built as a model that can easily be extended. There is no reason why we cannot, for example, build a variety of Web services protocols and software within the FuGE structure. One downside of this method would be that information would be stored in the FuGE objects (e.g. a FuGE database or XML file) and not in the WSDL or Taverna workflow file. Further, there is no way to “execute” FuGE XML files, as there is with taverna files or WSs. However, if your in silico experiment is stored in FuGE, you immediately have your computational data stored in a format that can also store all of the wet-lab information, protocols, and applications of the protocols. The integration of your analyses with your wet-lab metadata would be immediate.

In conclusion, this paper presents a summary of a vital area of bioinformatics research: how, in order to aid data integration, it is imperative that we annotate not just wet-lab data and how they were generated, but also our in silico data and how they were generated. Imagine storing your web services in BioCatalogue and then sharing your entire experimental workflows, data and metadata with other bioinformaticians quickly and easily (perhaps using FuGE to integrate in silico analyses with wet-lab metadata, producing a full experimental metadata file that stores all the work of an experiment from test tube to final analysis).

Goble C, Stevens R, Hull D, Wolstencroft K, Lopez R. (2008). Data curation + process curation=data integration + science. Briefings in bioinformatics DOI: 19060304

F GARCIASANCHEZ, J FERNANDEZBREIS, R VALENCIAGARCIA, J GOMEZ, R MARTINEZBEJAR (2008). Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources Journal of Biomedical Informatics, 41 (5), 848-859 DOI: 10.1016/j.jbi.2008.05.007

Data Integration Meetings & Conferences

FuGE / ISA-TAB Workshop, Day 1

Today was the first day of the workshop – back at the good old EBI, though it isn't as recognizable as it used to be. Sure, there is the new EBI extension, but I am used to that now. However, they're renovating the inside of the old EBI building as well, reducing many of my friends to portakabin living over the winter months: better them than me!

Today definitely had an emphasis on the "work" part of "workshop". While a large part of the work on the XSLT for converting between FuGE and ISA-TAB is complete, some of the slightly stickier areas of the conversion are still being worked on. We spent today on trying to iron out some of the difficulties that arise from trying to convert the sort of rich tree structure that you get from the XML implementation of FuGE (FuGE-ML) into the flatter tabular format of ISA-TAB. Below are some of the more general ideas that we were throwing around as a result. (Some are more directly related to the conversion process than others – but all raise interesting points to me.)

  • One of the column names in the ISA-TAB Assay file is currently named "Raw Data File" in the 1.0 Specification. This caused a large amount of discussion as to what "raw" meant, and that many people would have a different idea of what a raw data file was. It was originally named this way to act as a foil against another (optional) column name, "Derived Data File". However, derived data files have a more precise definition in ISA-TAB – such a column can only be used to name files resulting from data transformations or processing. In the end, we are considering a name change, from "Raw Data File" to "Data File".
  • In the end, there will be a few simple ways to format your FuGE-ML files in a way that will aid the conversion into ISA-TAB. It would be useful to eventually produce a set of guidelines to aid in interoperability.
  • Some of the developers already using FuGE (myself included) are using the <Description> element within a FuGE-ML file as a way to allow our biologists to give a free-text description to both materials and data files. There is no specific element in these objects to add such information, and therefore the generic Description element is the best location. This isn't exactly as per FuGE best-practices, where the default Description elements are really only meant for private comments within a local FuGE implementation, and can normally be ignored by external bioinformaticians making use of your FuGE-ML. Such material and data descriptions can be copied into the ISA-TAB file as free text within the Comment[] columns, where what sits within the "[]" is the material or data identifier. We'll have to see if this idea turns out to be useful.
  • The main challenge in collapsing FuGE-ML into ISA-TAB is ensuring that the multi-level protocol application structures (for more information, see the GenericProtocolApplication and GenericProtocol objects within the FuGE Object Model) are correctly converted. We spent the majority of today trying to figure out an elegant way of doing this. We'll work on it again tomorrow, and will hopefully have a new version of the XSLT with a first-bash solution tomorrow evening!

Read and post comments |
Send to a friend


CISBAN Data Integration Meetings & Conferences Software and Tools Standards

Pre-workshop post on the FuGE / ISA-TAB Workshop, 8-9 December

Tomorrow is the first day of a two-day workshop set up to continue the integration process between the ISA-TAB format and the FuGE standard. (Well, technically, it starts tonight with a workshop dinner, where I'll get to catch up with the people in the workshop, many of whom I haven't seen since the MGED 11 meeting in Italy this past summer. Should be fun!)

ISA-TAB can be seen as the next generation of MAGE-TAB, a very popular format with biologists who need to get their data and metadata into a common format acceptable by public repositories such as ArrayExpress. ISA-TAB goes one step further, and does for tabular formats what FuGE does for object models and XML formats: that is, it is able to represent multi-omics experiments rather than just the transcriptomics experiments of MAGE-TAB. I encourage you to find out more about both FuGE and ISA-TAB by looking at their respective project pages. The FuGE group also has a very nice introduction to the model in their Nature Biotechnology article.

Each day I'll provide a summary of what's gone on at the workshop, which centers around the current status of both ISA-TAB and some relevant FuGE extensions, as well as the production of a seamless conversion from FuGE-ML to ISA-TAB and back again. ISA-TAB necessarily cannot handle as much detail as the FuGE model can (being limited by the tabular format), and therefore in the FuGE-ML to ISA-TAB direction, it is possible that it may not be entirely lossless. However, this workshop and all the work that's gone on around it aims to reconcile the two formats as much as possible. And, even though I have mentioned a caveat or two, this reconciliation is entirely possible: both ISA-TAB and FuGE share the same high-level structures. Indeed, ISA-TAB was created with FuGE in mind, to ensure that such a useful undertaking used all it could of the FuGE Object Model. It is important to remember that FuGE is an abstract model which can be converted into many formats, including XML. Because it is an abstract model, many projects can make use of its structures while maintaing whatever concrete format they wish.

Specific topics of the workshop include:

  • Advance and possibly finalize XSLT rendering of FUGE Documents into ISA-TAB. This includes the finishing-off of the generic FuGE XSL stylesheet.
  • Work on some of the extensions, including FCM, Gel-ML, and MAGE2. MAGE2 is the most interesting for me for this workshop, as I've heard that it's almost complete. This is the XML format that is a direct extension of the FuGE model, and will be very useful for bioinformaticians wishing to store, share and search their transcriptomics data using a multi-omics standard like FuGE.

Thanks to Philippe Rocca-Serra and Susanna-Assunta Sansone for the hard work they've done on the format specification, and for everyone who's coming today. It's a deliberately small group so that we can spend our time in technical discussion rather than in presentations. I'm a bit of a nut about data and metadata standards (and am in complete agreement with Frank over at peanutbutter on the triumverate of experimental standards) and so I love these types of meetings. It's going to be fun, and I'll keep you updated!

Read and post comments |
Send to a friend


Data Integration In The News

Two Journal Special Issues: Big Data, and Semantic Mashups for Bioinformatics

Both of these special issues are worth a look, as some of the papers look pretty interesting. I'll spend a little time in a later post on any articles I find particularly relevant.

  • Semantic Mashup of Biomedical Data Special Issue of the Journal of Biomedical Informatics. This includes a review article by Carole Goble and Robert Stevens: State of the nation in data integration for bioinformatics
  • Nature's Big Data Sepcial Issue. The article entitled "How do your data grow?" was one of the many articles in this issue that I enjoyed. It's interesting to note that these problems in management and curation of big data are only now getting special attention in Nature. When I worked at the EBI, it was common knowledge among the database curators that 1) it would be very difficult for them to find other work as curators if they left the EBI, and 2) the time and high skill level it takes to annotate and curate biological database entries means that it is very difficult to get high coverage in such databases. It's nice to finally see some recognition of all the work the biocurators do by a journal such as Nature. Finally, there are high-profile articles stating that curation begins at home, with the researcher, and that curation needs much more support from researcher-level all the way up to the level of the database curators.

Read and post comments |
Send to a friend


CISBAN Data Integration Semantics and Ontologies Software and Tools Standards

Of GelML and MFO

A couple of papers from here at Newcastle University have appeared over the past couple of weeks. Here's a summary of them both.

  • Data Standards
    From "An Update on Data Standards for Gel Electrophoresis" in Practical Proteomics Issue 1, September 2007, and by Andrew R. Jones and Frank Gibson.
    From the abstract: "We report on standards development by the Gel Analysis Workgroup of the
    Proteomics Standards Initiative. The workgroup develops reporting
    requirements, data formats and controlled vocabularies for experimental
    gel electrophoresis, and informatics performed on gel images. We
    present a tutorial on how such resources can be used and how the
    community should get involved with the on-going projects. Finally, we
    present a roadmap for future developments in this area."
    Provides a summary of ongoing work in the Gel electrophoresis and Gel informatics fields in terms of data and metadata standardization. This includes work on MIAPE GE and MIAPE GI, two checklists for minimal information required on these types of experiments and analyses. For both GE and GI, there are data formats (GelML and GelInfoML, respectively, both extensions of FuGE) and a suggested controlled vocabulary (sepCV). More information can be found on
    Frank works in the CARMEN neuroscience project here at Newcastle, and Andy is in Liverpool and works on, among other things, FuGE. CARMEN collaborates with the SyMBA project, which was originally developed by me and a few others within Neil Wipat's Integrative Bioinformatics Group here at Newcastle but which is now a sourceforge project at Andy Jones is a co-author with me, Neil Wipat, Matt Pocock and Olly Shaw on an upcoming SyMBA paper.
  • Semantic Data Integration
    A paper that was presented at the Integrative Bioinformatics Conference 2007 by me and my co-authors, Matt Pocock and Neil Wipat, is now available from the Journal of Integrative Bioinformatics website.
    Allyson L. Lister, Matthew Pocock, Anil Wipat. Integration of
    constraints documented in SBML, SBO, and the SBML Manual facilitates
    validation of biological models
    . Journal of Integrative Bioinformatics,
    4(3):80, 2007.

Read and post comments |
Send to a friend


CISBAN Papers Software and Tools Standards

Newcastle University Technical Report: CISBAN DPI

A Technical Report for the School of Computing Science of Newcastle University was released last month describing the CISBAN DPI, an implementation of the FuGE Milestone 3 STK. You can find and download that technical report here:

The Abstract follows:

The Centre for Integrated Systems Biology of Ageing and Nutrition has
developed a Data Portal and Integrator (CISBAN DPI) that is based on
the FuGE Object Model and which archives, stores, and retrieves raw
high-throughput data. Until now, few published systems have
successfully integrated multiple omics data types and information about
experiments in a single database. The CISBAN DPI is the first published
implementation of FuGE that includes a database back-end, expert and
standard interfaces, and utilizes a Life Science Identifier (LSID)
Resolution and Assigning service to identify objects and provide
programmatic access to the database. Having a central data
repository prevents deletion, loss, or accidental modification of
primary data, while giving convenient access to the data for
publication and analysis. It also provides a central location for
storage of metadata for the high-throughput data sets, and will
facilitate subsequent data integration strategies.


Functional Genomics, High-Throughput
Experiments, FuGE, LSID, Experimental Workflows, Databases, Data
Standards, Data Sharing, Metadata, Data Integration.

CS-TR: 1016 Implementing the FuGE Object Model: a Systems Biology Data Portal and Integrator
Lister, A. L., Jones, A. R., Pocock, M., Shaw, O., Wipat, A.
School of Computing Science, Newcastle University, Apr 2007

Read and post comments |
Send to a friend



Beta Release: CISBAN Data Portal and Integrator

The Centre for Integrated Systems Biology of Ageing and Nutrition has developed a Data Portal and Integrator
(CISBAN DPI) based on Milestone 3 of the Functional Genomics
Experiment (FuGE)
Object Model (FuGE-OM), and which archives,
stores, and retrieves raw high-throughput data.
We are pleased to announce that the CISBAN Data Portal and Integrator is now available in a
public sandbox version.

Please note that this release is still at an early beta stage, and any data you may upload to the
server may be deleted at any time. You will need a logon to access this database, which you may request
from the helpdesk. This is a low-level of security that will
only serve to prevent anonymous load on the database and to keep your sandbox area separate from others.
For more information on the sandbox DPI, please visit the DPI's
technical documentation.

Until now, few published systems have successfully integrated
multiple omics data types and information about experiments in a single database. The CISBAN DPI is the first
published implementation of FuGE that includes a database back-end, expert and standard interfaces, and utilizes
a Life Science Identifier (LSID) Resolution and Assigning service to identify objects and provide programmatic
access to the database. Having a central data repository prevents deletion, loss, or accidental modification of
primary data, while giving convenient access to the data for publication and analysis. It also provides a
central location for storage of metadata for the high-throughput data sets, and will facilitate subsequent data
integration strategies.

We encourage you to upload data and create as many experiments as you like
so that you may determine if this application may be of use to your own research group. We also appreciate
you contacting us with any comments or questions you may have.

Useful links:

Read and post comments |
Send to a friend