Modeling and Managing Experimental Data Using FuGE

Want to share your umpteen multi-omics data sets and experimental protocols with one common format? Encourage collaboration! Speak a common language! Share your work! How, you might ask? With FuGE!

In 2007, FuGE version 1 was released (website, Nature Biotechnology paper). FuGE allows biologists and bioinformaticians to describe any life science experiment using a single format, making collaboration and repeatability of experiments easier and more efficient. However, if you wanted to start using FuGE, until now it was difficult to know where to start. Do you use FuGE as it stands? Do you create an extension of FuGE that specifically meets your needs? What do the developers of FuGE suggest when taking your first steps using it? This paper focuses on best practices for using FuGE to model and manage your experimental data. Read this paper, and you’ll be taking your first steps with confidence!

Advertisements

ResearchBlogging.org

Want to share your umpteen multi-omics data sets and experimental protocols with one common format? Encourage collaboration! Speak a common language! Share your work! How, you might ask? With FuGE, and this latest paper (citation at the end of the post) tells you how.

In 2007, FuGE version 1 was released (website, Nature Biotechnology paper). FuGE allows biologists and bioinformaticians to describe any life science experiment using a single format, making collaboration and repeatability of experiments easier and more efficient. However, if you wanted to start using FuGE, until now it was difficult to know where to start. Do you use FuGE as it stands? Do you create an extension of FuGE that specifically meets your needs? What do the developers of FuGE suggest when taking your first steps using it? This paper focuses on best practices for using FuGE to model and manage your experimental data. Read this paper, and you’ll be taking your first steps with confidence!

[Aside: Please note that I am one of the authors of this paper.]

What is FuGE? I’ll leave it to the authors to define:

The approach of the Functional Genomics Experiment (FuGE) model is different, in that it attempts to generalize the modeling constructs that are shared across many omics techniques. The model is designed for three purposes: (1) to represent basic laboratory workflows, (2) to supplement existing data formats with metadata to give them context within larger workflows, and (3) to facilitate the development of new technology-specific formats. To support (3), FuGE provides extension points where developers wishing to create a data format for a specific technique can add constraints or additional properties.

A number of groups have started using FuGE, including MGED, PSI (for GelML and AnalysisXML), MSI, flow cytometry, RNA interference and e-Neuroscience (full details in the paper). This paper helps you get a handle on how to use FuGE by presenting two running examples of capturing experimental metadata in the fields of flow cytometry and proteomics of flow cytometry and gel electrophoresis. Part of Figure 2 from the paper is shown on the right, and describes one section of the flow cytometry FuGE extension from FICCS.

The flow cytometry equipment created as subclasses of the FuGE equipment class.
The flow cytometry equipment created as subclasses of the FuGE equipment class.

FuGE covers many areas of experimental metadata including the investgations, the protocols, the materials and the data. The paper starts by describing how protocols are designed in FuGE and how those protocols are applied. In doing so, it describes not just the protocols but also parameterization, materials, data, conceptual molecules, and ontology usage.

Examples of each of these FuGE packages are provided in the form of either the flow cytometry or the GelML extensions. Further, clear scenarios are provided to help the user determine when it is best to extend FuGE and when it is best to re-use existing FuGE classes. For instance, it is best to extend the Protocol class with an application-specific subclass when all of the following are true: when you wish to describe a complex Protocol that references specific sub-protocols, when the Protocol must be linked to specific classes of Equipment or Software, and when specific types of Parameter must be captured. I refer you to the paper for scenarios for each of the other FuGE packages such as Material and Protocol Application.

The paper makes liberal use of UML diagrams to help you understand the relationship between the generic FuGE classes and the specific sub-classes generated by extensions. A large part of the paper is concerned expressly with helping the user understand how to model an experiment type using FuGE, and also to understand when FuGE on its own is enough. But it also does more than that: it discusses the current tools that are already available for developers wishing to use FuGE, and it discusses the applicability of other implementations of FuGE that might be useful but do not yet exist. Validation of FuGE-ML and the storage of version information within the format are also described. Implementations of FuGE, including SyMBA and sysFusion for the XML format and ISA-TAB for compatibility with a spreadsheet (tab-delimited) format, are also summarised.

I strongly believe that the best way to solve the challenges in data integration faced by the biological community is to constantly strive to simply use the same (or compatible) formats for data and for metadata. FuGE succeeds in providing a common format for experimental metadata that can be used in many different ways, and with many different levels of uptake. You don’t have to use one of the provided STKs in order to make use of FuGE: you can simply offer your data as a FuGE export in addition to any other omics formats you might use. You could also choose to accept FuGE files as input. No changes need to be made to the underlying infrastructure of a project in order to become FuGE compatible. Hopefully this paper will flatten the learning curve associated for developers, and get them on the road to a common format. Just one thing to remember: formats are not something that the end user should see. We developers do all this hard work, but if it works correctly, the biologist won’t know about all the underpinnings! Don’t sell your biologists on a common format by describing the intricacies of FuGE to them (unless they want to know!), just remind them of the benefits of a common metadata standard: cooperation, collaboration, and sharing.

Jones, A., Lister, A.L., Hermida, L., Wilkinson, P., Eisenacher, M., Belhajjame, K., Gibson, F., Lord, P., Pocock, M., Rosenfelder, H., Santoyo-Lopez, J., Wipat, A., & Paton, N. (2009). Modeling and Managing Experimental Data Using FuGE OMICS: A Journal of Integrative Biology, 2147483647-13 DOI: 10.1089/omi.2008.0080

Informal Knowledge Sharing in Science via Social Networking

This is a cross-posted item available both from this, my home blog, and http://biosharing.org, a new blog specifically concerned with “news and information about activities related to the development of data policies and standards in the biological domain, in particular for the area of ‘omics”. You can find the post on biosharing.org at: http://biosharing.org/2009/04/informal-knowledge-sharing-in-science.html .

Recently, more and more biologists, bioinformaticians, and scientists in general have been discovering the usefulness of social networking, microblogging, and blogging for their work. Increasingly, social networking applications such as FriendFeed and Twitter are becoming popular for the discovery of new research in a timely manner, for interactions and possible collaborations with like-minded researchers, and for announcing work that you’re doing. Sharing data and knowledge in biology should not just be limited to formal publications and databases. “Biosharing” can also be informal, and social networking is an important tool in informally conveying scientific knowledge. But how should you get started in this new world? Here are my experiences with it, together with some links to and thoughts to help you get started.

ResearchBlogging.org

I created my FriendFeed account in Fall 2008 and my Twitter account last month. Why did I start using these social networking sites? Well, with FriendFeed, I had noticed many of my work colleagues starting to use it, but had no real understanding as to why they were so evangelical about it. With Twitter, I held out longer but eventually realised it was a really quick and easy way to get my messages across. The reason I did it and why they are useful to me comes down to a simple answer.

I am interested in sharing knowledge. Social networking promotes an informal sharing of knowledge in a way complementary to more traditional, formal methods of knowledge sharing.

And if you’re interested in knowledge sharing, then you should look into social networking. My research focuses on semantic data integration. I have a further interest in common data formats to enable data / knowledge sharing. As I am quite vocal about getting people interested in formal methods of knowledge sharing such as the triumvirate of MIBBI, OBI, and FuGE / ISA-TAB for experimental data1 (and many, many more), it behooved me to learn about the informal methods.

Social Networking: Day-To-Day

But what convinced me that social networking for science was useful? By December I had a realisation: this social networking stuff was giving me more information directly useful to my research than any other resource I had used in the past. Period. You can see my happiness in this blog post from December, where I showed how, these days, I get more useful citations of papers I’m interested in via my friends’ citeulike feeds on FriendFeed than I ever have managed from the PubMed email alerts. What convinced me is not a what, but a who.

Social networking for science is an informal data integrator because of the people that are in that network.

It’s all about the people. I have met loads of new friends that have similar research interests via the “big 2” (FriendFeed and Twitter). I get knowledge and stay up to date on what’s happening in my area of the research world. I make connections.

What is FriendFeed? At its most basic definition, it is an “personal” RSS Aggregator that allows comments on each item that is aggregated. For instance, I’ve added slides, citations, my blogs, my SourceForge activity and more to FriendFeed:

friendfeed-services
A screenshot of the services page of my FriendFeed account

There are loads of other RSS feeds you can add to FriendFeed. Then, when people add your feed to their accounts, they can see your activity and comment on each item. You gradually build up a network of like-minded people. Additionally, you can post questions and statements directly to FriendFeed. This is useful as a form of microblogging, or posting short pieces of useful information quickly.

A screenshot of my Twitter feed
A screenshot of my Twitter feed

What is Twitter? It’s a bit like instant messaging to the world. You can say whatever you like in 140-characters or less, and it is published on your page (here’s mine). Just like with FriendFeed, you can follow anyone else’s Twitter feed. You can even put your Twitter feed into FriendFeed. People have a tendency to over-tweet, and write loads of stuff. I use it, but only for work, and only for things that I think might be relevant for quick announcements. If Doug Kell tweets, shouldn’t you? 😉

Other people have posted on how FriendFeed is useful to them in their scientific work, such as Cameron Neylon (who has some practical advice too), Deepak Singh and Neil Saunders who talk about specific examples, and Simon Cockell who has written about his experiences with FriendFeed and Twitter. I encourage you to have a read of their posts.

You don’t have to spend ages on FriendFeed and Twitter to get useful information out of it. Start simply and don’t get social networking burnout.

Ask questions about science you can’t answer in your own physical network at the office (Andrew Clegg did it, and have a look at the resulting discussion on FriendFeed and blog summary from Frank Gibson!). Post interesting articles. Ignore it for a week or more if you want: interesting stuff will be “liked” by other people in your network and will stay at the top of the feed. Trust the people in your network, and make use of their expertise in sending the best stuff to the top, if you don’t have the time to read everything. Don’t be afraid to read everything, or to read just the top two or three items in your feed.

Social Networking: Conferences and Workshops

These “big 2” social networking apps are really useful when it comes to conferences, where they are used to microblog talks, discussions, and breakout sessions. For detailed information on how they are used in such situations, see the conference report for ISMB 2008 in PLoS Computational Biology by Saunders et al. BioSysBio 2009 also used these techniques (conference report, FriendFeed room, Twitter hashtag).

Social Networking: What should I use?

Other social networking sites, billed as “especially for scientists”, have been cropping up left, right and centre in the past year or two. There are “Facebooks for Scientists”2 (there are more than 20 listed here, just to get you started, and other efforts more directed at linking data workflows such as myExperiment). So, should we be using these scientist-specific sites? I certainly haven’t tried them all, so I cannot give you anything other than my personal experience.

As you can see from my FriendFeed screenshot, I belong to “Rooms” in FriendFeed as well as connecting directly with people’s feeds. Rooms such as The Life Scientists, with over 800 subscribers, gets me answers to sticky questions I wouldn’t otherwise know how or where to ask (see here for an example). These, and the people I choose to link with directly, give me all of the science-specific discussions I could want.

The more general the social networking application is and the larger the user-base it has, the more likely it is to be around next year.

Right now, I don’t need any of the specialty features I’d get with a scientist-specific social networking application. I think the big names are more likely to reach a wider audience of like-minded folk.

Final Thoughts

Remember you’re broadcasting to the world. Only put stuff in that you think others will be interested in. This is a public face for you and your career.

I am a strong believer in keeping the personal parts of my life private (the entire world doesn’t need – or want – to know about my cat or see the pictures of my nephew)  while at the same time making sure that I am really easy to reach for work-related discussions and collaborations. Through my blog, and my social networking, I am gaining a fuller appreciation of the work going on in the research community around me and contributing to the resulting large experiment in informal data integration.

It is fun: I meet new people and have interesting conversations. It is useful to my career: my blogging has resulted in an invitation to co-author two conference reports, and shows me new things happening in my field earlier than before. I’m all about sharing biological knowledge. I’m researching the formal side of data integration and sharing, and I’m using informal knowledge sharing to help me do my work.

I hope to see you there soon! Look me up!

Footnotes

  1. For a very nice overview of these standards, see Frank Gibson’s blog.
  2. While I am on Facebook, I do not use it for work purposes, and therefore cannot comment on its applicability for scientists.

Lister A, Charoensawan V, De S, James K, Janga SC, & Huppert J (2009). Interfacing systems biology and synthetic biology. Genome biology, 10 (6) PMID: 19591648
Saunders, N., Beltrão, P., Jensen, L., Jurczak, D., Krause, R., Kuhn, M., & Wu, S. (2009). Microblogging the ISMB: A New Approach to Conference Reporting PLoS Computational Biology, 5 (1) DOI: 10.1371/journal.pcbi.1000263

One way for RDF to help a bioinformatician build a database: S3DB

ResearchBlogging.org

This post is part of the PLoS One syncroblogging day, as part of the PLoS ONE @ Two birthday celebrations. Happy Synchroblogging! Here’s a link to the paper on the PLoS One website.

Biological data: vitally important, determinedly unruly. This challenge facing the life-science community has been present for decades, as witnessed by the often exponential growth of biological databases (see the classic curve in the current graphs of UniProt1 and EMBL if you don’t believe me). It’s important to me, as a bioinformatics researcher whose main focus is semantic data integration, but it should be important to everyone. Without manageable data that can be easily integrated, all of our work suffers. Nature thinks it’s important: it recently devoted an entire issue to Big Data. Similarly, the Journal of Biomedical Informatics just had a Semantic Mashup special issue. Deus et al. (the paper I’m blogging about, published in PLoS One this summer) agree, beginning with “Data, data everywhere”, nicely encapsulating both the joy and the challenge in one sentence.

This paper describes work on a distributed management system that can link disparate data sources using methodologies commonly associated with the semantic web (or is that Web 3.0?). I’m a little concerned (not at the paper, just in general) at the fact that we seem to already have a 3.0 version of the web, especially as I have yet to figure out a useful definition for semantic web vs Web 2.0 vs Web 3.0. Definitions of Web 3.0 seems to vary wildly: is it the semantic web? Is it the -rwx- to Web 1.0’s -r– and Web 2.0’s -rw– (as described here, and cited below)? Are these two definitions one and the same? Perhaps these are discussions for another day… Ultimately, however, I have to agree with the authors that “Web 3.0” is an unimaginative designation2.

So, how can the semantic web help manage our data? That would be a post in itself, and is the focus of many PhD projects (including mine). Perhaps a better question is how does the management model proposed by Deus et al. use the semantic web, and is it a useful example of integrative bioinformatics?

Their introduction focuses on two types of integration: data integration as an aid to holistic approaches such as mathematical modelling, and software integration which could provide tighter interoperability between data and services. They espouse (and I agree) the semantic web as a technology which will allow the semantically-meaningful specification of desired properties of data in a search, rather than retrieving data in a fixed way from fixed locations. They want to extend semantic data integration from the world of bioinformatics into clinical applications. Indeed, they want to move past “clandestine and inefficient flurry of datasets exchanged as spreadsheets through email”, a laudable goal.

Their focus is on a common data management and analysis infrastructure that does not place any restrictions on the data stored. This also means multiple instances of light-weight applications are part of the model, rather than a single central application. The storage format is of a more general, flexible nature. Their way of getting the data into a common format, they say, is to break down the “interoperable elements” of the data structures into RDF triples (subject-predicate-object statements). At its most basic, their data structure has two types of triples: Rules and Statements. Rules are phrases like “sky has_color”, while statements add a value to the phrase, e.g. “today’s_sky has_color blue”.

They make the interesting point that the reclassification of data from flat files to XML to RDF to Description Logics starts to dilute “the distinction between data management and data analysis”. While it is true that if you are able to store your data in formats such as OWL-DL3, the format is much more amenable to direct computational reasoning and inference, perhaps a more precise statement would be that the distinction between performance of data management tasks and data analysis tasks will blur with richer semantic descriptions of both the data and their applications. As they say later in the paper, once the data and the applications are described in a way that is meaningful for computation, new data being deposited online could automatically trigger a series of appropriate analysis steps without any human input.

A large focus of the paper was on identity, both of the people using it (and therefore addressing the user requirement of a strong permissions system) and of the entities in the model and database (each identified with some type of URI). This theme is core to ensuring that only those with the correct permissions may access possibly-sensitive data, and that each item of information can be unambiguously defined. I like that the sharing of “permissions between data elements in distinct S3DB deployments happens through the sharing the membership in external Collections and Rules…not through extending the permission inheritance beyond the local deployment”. It seems a useful and straightforward method of passing permissions.

I enjoyed the introduction, background, and conclusions. Their description of the Semantic Web and how it could be employed in the life sciences is well-written and useful for newcomers to this area of research. Their description of the management model as composed of subject-predicate-object RDF triples plus membership and access layers was interesting. Their website was clear and clean, and they had a demo that worked even when I was on the train4. It’s also rather charming that “S3DB” stands for Simple Sloppy Semantic Database – they have to get points for that one5! However, the description of their S3DB prototype was not extensive, and as a result I have some basic questions, which can be summarized as follows:

  • How do they determine what the interoperable elements of different data structures are? Manually? Computationally? Is this methodology generic, or does it have to be done with each new data type?
  • The determination of the maturity of a data format is not described, other than that it should be a “stable representation which remains useful to specialized tools”. For instance, the mzXML format is considered mature enough to use as the object of an RDF triple. What quality control is there in such cases: in theory, someone could make a bad mzXML file. Or is it not the format which is considered mature, but instead specific data sets that are known to be high quality?
  • I would have like to have seen more detail in their practical example. Their user testing was performed together with the Lung Cancer SPORE user community. How long did the trial last? Was there some qualitative measurement of how happy they were with it (e.g. a questionnaire)? The only requirement gathered seems to have been that of high-quality access control.
  • Putting information into RDF statements and rules in an unregulated way will not guarantee a data sets that can be integrated with other S3DB implementations, even if they are of the same experiment type. This problem is exemplified by a quote from the paper (p. 8): “The distinct domains are therefore integrated in an interoperable framework in spite of the fact that they are maintained, and regularly edited, by different communities of researchers.” The framework might be identical, but that doesn’t ensure that people will use the same terms and share the same rules and statements. Different communities could build different statements and rules, and use different terms to describe the same concept. Distributed implementations of S3DB databases, where each group can build their own data descriptions, do not lend themselves well to later integration unless they start by sharing the same ontology/terms and core rules. And, as the authors encourage the “incubation of experimental ontologies” within the S3DB framework, chances are that there will be multiple terms describing the same concept, or even one word that has multiple definitions in different implementations. While they state that data elements can be shared across implementations, it isn’t a requirement and could lead to the problems mentioned. I have the feeling I may have gotten the wrong end of the stick here, and it would be great to hear if I’ve gotten something wrong.
  • Their use of the rdfs:subClassOf relation is not ideal. A subclass relation is a bit like saying “is a”, (defined here as a transitive property where “all the instances of one class are instances of another”) therefore what their core model is saying with the statement “User rdfs:subClassOf Group” is “User is a Group”. The same thing happens with the other uses of this relation, e.g. Item is a Collection.  A user is not a group, in the same way that a single item is not a collection. There are relations between these classes of object, but rdfs:subClassOf is simply not semantically correct. A SKOS relation such as skos:narrower (defined here as “used to assert a direct hierarchical link between two SKOS concepts”) would be more suitable, if they wished to use a “standard” relationship. I particularly feel that I probably misinterpreted this section of their paper, but couldn’t immediately find any extra information on their website. I would really like to hear if I’ve gotten something wrong here, too.

Also, although this is not something that should have been included in the paper, I would be curious to discover what use they think they could make of OBI, which would seem to suit them very well6. An ontology for biological and biomedical investigations would seem a boon to them. Further, such a connection could be two-way: the S3DB people probably have a large number of terms, gathered from the various users who created terms to use within the system. It would be great to work with the S3DB people to add these to the OBI ontology. Let’s talk! 🙂

Thanks for an interesting read!

Footnotes:
1. Yes, I’ve mentioned to the UniProt gang that they need to re-jig their axes in the first graph in this link. They’re aware of it! 🙂
2. Although I shouldn’t talk, I am horrible at naming things, as the title of this blog shows
3. A format for ontologies using Description Logics that may be saved as RDF. See the official OWL docs.
4. Which is a really flaky connection, believe me!
5. Note that this expanded acronym is *not* present in this PloS One paper, but is on their website.
6. Note on personal bias: I am one of the core developers of OBI 🙂

Helena F. Deus, Romesh Stanislaus, Diogo F. Veiga, Carmen Behrens, Ignacio I. Wistuba, John D. Minna, Harold R. Garner, Stephen G. Swisher, Jack A. Roth, Arlene M. Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H. Vogel, Jonas S. Almeida (2008). A Semantic Web Management Model for Integrative Biomedical Informatics PLoS ONE, 3 (8) DOI: 10.1371/journal.pone.0002946
Z. Zhang, K.-H. Cheung, J. P. Townsend (2008). Bringing Web 2.0 to bioinformatics Briefings in Bioinformatics DOI: 10.1093/bib/bbn041

Adding informative metadata to bioinformatics services

ResearchBlogging.org

Carole Goble and the other authors of “Data curation + process curation = data integration + science” have written a paper on the importance of curating not just the services used in bioinformatics, but also how they are used. Just as more and more biologists are becoming convinced of the importance of storing and annotating their data in a common format, so should bioinformaticians take a little of their own medicine and ensure that the services they produce and use are annotated properly. I personally feel that it is just as important to ensure that in silico work is properly curated as it is in the more traditional, wet-lab biological fields.

They mention a common feature of web services and workflows: namely, that they are generally badly documented. Just as the majority of programmers leave it until the last possible minute to comment their code (if they comment at all!), so also are many web services annotated very sparsely, and not necessarily in a way that is useful to either humans or computers. I remember that my first experience with C code was trying to take over a bunch of code written by a C genius, who had but one flaw: a complete lack of commenting. Yes, I learnt a lot about writing efficient C code from his files, but it took me many hours more than it would have done if there had been comments in there!

They touch briefly on how semantic web services (SWS) could help, e.g. using formats such as OWL-S and SAWSDL. I recently read an article in the Journal of Biomedical Informatics (Garcia-Sanchez et al. 2008, citation at the end of the paper) that had a good introduction to both semantic web services and, to a lesser extent, multi-agent systems that could autonomously interact with such services. While the Goble et al. paper did not go into as much detail as the Garcia-Sanchez paper did on this point, it was nice to learn a little more about what was going on in the bioinformatics word with respect to SWS.

Their summary of the pitfalls to be aware of due to the lack of curated processes was good, as was their review of currently-existing catalogues and workflow and WS aggregators. The term “Web 2.0” was used, in my opinion correctly, but I was once again left with the feeling that I haven’t seen a good definition of what Web 2.0 is. I must hear it talked about every day, and haven’t come across any better definition than Tim O’Reilly’s. Does anyone reading this want to share their “favorite” definition? This isn’t a failing of this paper – more of my own lack of understanding. It’s a bit like trying to define “gene” (this is my favorite) or “systems biology” succinctly and in a way that pleases most people – it’s a very difficult undertaking! Another thing I would have liked to have seen in this paper, but which probably wasn’t suitable for the granularity level at which this paper was written, is a description and short analysis of the traffic and usage stats for myExperiment. Not a big deal – I’m just curious.

As with anything in standards development, even though there are proposed minimal information guidelines for web services out there (see MIAOWS), the main problem will always be lack of uptake and getting a critical mass (also important in community curation efforts, by the way). In my opinion, a more important consideration for this point is that getting a MIA* guideline to be followed does not guarantee any standard format. All it guarantees is a minimal amount of information to be provided.

They announce the BioCatalogue in the discussion section of this paper, which seems to be a welcome addition to the attempts to get people to annotate and curate their services in a standard way, and store them in a single location. It isn’t up and running yet, but is described in the paper as a web interface to more easily allow people to annotate their WSDL files, whereas previous efforts have mainly focused on the registry aspects. Further information can be associated with these files once they are uploaded to the website. However, I do have some questions about this service. What format is the further information (ontology terms, mappings) stored in? Are the ontology terms somehow put back into the WSDL file? How will information about the running of a WS or workflow be stored, if at all? Does it use a SWS format? I would like to see performances of Bioinformatics workflows stored publicly, just as performances of biological workflows (eg running a microarray experiment) can be. But I suppose many of these questions would be answered once BioCatalogue is in a state suitable for publishing on its own.

In keeping with this idea of storing the applications of in silico protocols and software in a standard format, I’d like to mention one syntax standard that might be of use in storing both descriptions of services and their implementation in specific in silico experiments: FuGE. While it does not currently have the structures required to implement everything mentioned in this paper (such as operational capability and usage/popularity scores) in a completely explicit way, many of the other metadata items that this paper suggests can already be stored within the FuGE object model (e.g. provenance, curation provenance, and functional capability). Further, FuGE is built as a model that can easily be extended. There is no reason why we cannot, for example, build a variety of Web services protocols and software within the FuGE structure. One downside of this method would be that information would be stored in the FuGE objects (e.g. a FuGE database or XML file) and not in the WSDL or Taverna workflow file. Further, there is no way to “execute” FuGE XML files, as there is with taverna files or WSs. However, if your in silico experiment is stored in FuGE, you immediately have your computational data stored in a format that can also store all of the wet-lab information, protocols, and applications of the protocols. The integration of your analyses with your wet-lab metadata would be immediate.

In conclusion, this paper presents a summary of a vital area of bioinformatics research: how, in order to aid data integration, it is imperative that we annotate not just wet-lab data and how they were generated, but also our in silico data and how they were generated. Imagine storing your web services in BioCatalogue and then sharing your entire experimental workflows, data and metadata with other bioinformaticians quickly and easily (perhaps using FuGE to integrate in silico analyses with wet-lab metadata, producing a full experimental metadata file that stores all the work of an experiment from test tube to final analysis).

Goble C, Stevens R, Hull D, Wolstencroft K, Lopez R. (2008). Data curation + process curation=data integration + science. Briefings in bioinformatics DOI: 19060304

F GARCIASANCHEZ, J FERNANDEZBREIS, R VALENCIAGARCIA, J GOMEZ, R MARTINEZBEJAR (2008). Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources Journal of Biomedical Informatics, 41 (5), 848-859 DOI: 10.1016/j.jbi.2008.05.007

From OBO to OWL and Back Again: OBO capabilities of the OWL API

ResearchBlogging.org

Golbreich et al describe a formal method of converting OBO to OWL
1.1 files, and vice versa. Their code has been integrated into the OWL
API, a set of classes that is well-used within the OWL community. For
instance, Protege 4 is built on the OWL API. While there have been
other efforts in the past to map between the OBO flat-file format and
OWL (they specifically mention Chris Mungall’s work on an XLST used as
a plugin within Protege that can perform the conversion), none were
done in a formal or rigorous manner. By defining an exact relationship
between OBO and OWL constructs using consensus information provided by
the OBO community, the authors have provided a more robust method of
mapping than has been available to date.

Consequently, the entire library of tools, reasoners and editors
available to the OWL community are now also available to OBO developers
in a way that does not force them to permanently leave the format and
environment that they are used to.

OBO ontologies are ontologies generated within the biological and
biomedical domain and which follow a standard, if often
non-rigorously-defined, syntax and semantics. The most well-known of
the OBO ontologies is the Gene Ontology (GO). Not only do you subscribe
to the format when you choose OBO, you are also subscribing to the
ideas behind the OBO Foundry, which aims to limit overlap of ontologies
in related fields, and which provides a communal environment (mailing
lists, websites, etc) in which to develop. OWL (the Web Ontology
Language) has three dialects, of which OWL-DL (DL stands for
Description Logics) is the most commonly used. OWL-DL is favored by
ontologists wishing to perform computational analyses over ontologies
as it has not just rigorously-defined formal semantics, but also a wide
user-base and a suite of reasoning tools developed by multiple groups.

OBO is composed of stanzas describing elements of the ontology.
Below is an example of a term in its stanza, which describes its
location in the larger ontology:

[Term]
id: GO:0001555
name: oocyte growth
is_a: GO:0016049 ! cell growth
relationship: part_of GO:0048601 ! oocyte morphogenesis
intersection_of: GO:0040007 ! growth
intersection_of: has_central_participant CL:0000023 ! oocyte

Before they could start writing the parsing and mapping programs,
they had to formalize both the semantics and the syntax of OBO. This is
not something that would normally be done by the developers of the
format, not the users of the format, but both the syntax and semantics
of OBO are only defined in natural language. These natural language
definitions often lead to imprecision and, in extreme cases, no
consensus was reached for some of the OBO constructs. However, the
diligence of the authors in getting consensus from the OBO community
should be rewarded in future by the OBO community feeling confident in
the mapping, and therefore also in using the OWL tools now available to
them. An example of natural language defintions in the OBO User Guide
follows:

This tag describes a typed relationship between this term and
another term. […] The necessary modifier allows a relationship to be
marked as “not necessarily true”. […]

Neither “necessarily true” nor relationship have been defined. You
can, in fact, computationally define a relation in three different ways
(taking their stanza example from above):

  • existantially, where each instance of GO:0001555 must have at least
    one part_of relationship to an instance of the term GO:0048601;
  • universally, where instances of GO:0001555 can *only* be connected to instances of GO:0048601;
  • via a constraint interpretation, where the endpoints of the
    relationship *must* be known, but which cannot in any case be expressed
    with DL, so is not useful to this dicussion.

OBO-Edit does not always infer what should be inferred if all of the
rules of its User Guide are followed. There is a good example of this
in the text.In their formal representation of the OBO syntax they used
BNF, which is backwards-compatible with OBO. Many of the mappings are
quite straightforward: OBO terms become OWL classes, OBO relationship
types become OWL properties, OBO instances become OWL individuals, OBO
ids are the URIs in OWL, and the OBO names become the OWL labels. is_a,
disjoint_from, domain and range have direct OWL equivalents. There had
to be some more complex mapping in other places, such as trying to map
OBO relationship types to either OWL object or datatype properties.

Using OWL reasoners over OBO ontologies not only works, but in the
case of the Sequence Ontology (SO), found a term that only had a single
intersection_of statement, and was thus illegal according to OBO rules,
but which hadn’t been found by OBO-Edit.

Up until now, I’ve been unsure as to how the OWL files are created
from files in the OBO format. This was a paper that was clear and to
the point. Thanks very much!

Update December 2008: I originally posted this without the BPR3 /
ResearchBlogging.org tag, as I was unsure where conference proceedings
came in the “peer-reviewed research” part of the guidelines. However,
as I’m now getting back into the whole researchblogging thing, I feel
(having read many of the posts of my fellow research bloggers) that
this would be suitable. If anyone has any opinions, I’d be most
interested!

Golbreich, C., Horridge, M., Horrocks, I., Motik, B., Shearer, R. (2008). OBO and OWL: Leveraging Semantic Web Technologies for the Life Sciences Lecture Notes in Computer Science, 4825/2008, 169-182 DOI: 10.1007/978-3-540-76298-0_13

Read and post comments |
Send to a friend

original

From OBO to OWL and Back Again: OBO capabilities of the OWL API

ResearchBlogging.org

Golbreich et al describe a formal method of converting OBO to OWL 1.1 files, and vice versa. Their code has been integrated into the OWL API, a set of classes that is well-used within the OWL community. For instance, Protege 4 is built on the OWL API. While there have been other efforts in the past to map between the OBO flat-file format and OWL (they specifically mention Chris Mungall’s work on an XLST used as a plugin within Protege that can perform the conversion), none were done in a formal or rigorous manner. By defining an exact relationship between OBO and OWL constructs using consensus information provided by the OBO community, the authors have provided a more robust method of mapping than has been available to date. Consequently, the entire library of tools, reasoners and editors available to the OWL community are now also available to OBO developers in a way that does not force them to permanently leave the format and environment that they are used to.

OBO ontologies are ontologies generated within the biological and biomedical domain and which follow a standard, if often non-rigorously-defined, syntax and semantics. The most well-known of the OBO ontologies is the Gene Ontology (GO). Not only do you subscribe to the format when you choose OBO, you are also subscribing to the ideas behind the OBO Foundry, which aims to limit overlap of ontologies in related fields, and which provides a communal environment (mailing lists, websites, etc) in which to develop. OWL (the Web Ontology Language) has three dialects, of which OWL-DL (DL stands for Description Logics) is the most commonly used. OWL-DL is favored by ontologists wishing to perform computational analyses over ontologies as it has not just rigorously-defined formal semantics, but also a wide user-base and a suite of reasoning tools developed by multiple groups.

OBO is composed of stanzas describing elements of the ontology. Below is an example of a term in its stanza, which describes its location in the larger ontology:

[Term]
id: GO:0001555
name: oocyte growth
is_a: GO:0016049 ! cell growth
relationship: part_of GO:0048601 ! oocyte morphogenesis
intersection_of: GO:0040007 ! growth
intersection_of: has_central_participant CL:0000023 ! oocyte

Before they could start writing the parsing and mapping programs, they had to formalize both the semantics and the syntax of OBO. This is not something that would normally be done by the developers of the format, not the users of the format, but both the syntax and semantics of OBO are only defined in natural language. These natural language definitions often lead to imprecision and, in extreme cases, no consensus was reached for some of the OBO constructs. However, the diligence of the authors in getting consensus from the OBO community should be rewarded in future by the OBO community feeling confident in the mapping, and therefore also in using the OWL tools now available to them. An example of natural language defintions in the OBO User Guide follows:

This tag describes a typed relationship between this term and another term. [...] The necessary modifier allows a relationship to be marked as “not necessarily true”. [...]

Neither “necessarily true” nor relationship have been defined. You can, in fact, computationally define a relation in three different ways (taking their stanza example from above):

  • existantially, where each instance of GO:0001555 must have at least one part_of relationship to an instance of the term GO:0048601;
  • universally, where instances of GO:0001555 can *only* be connected to instances of GO:0048601;
  • via a constraint interpretation, where the endpoints of the relationship *must* be known, but which cannot in any case be expressed with DL, so is not useful to this dicussion.

OBO-Edit does not always infer what should be inferred if all of the rules of its User Guide are followed. There is a good example of this in the text.In their formal representation of the OBO syntax they used BNF, which is backwards-compatible with OBO. Many of the mappings are quite straightforward: OBO terms become OWL classes, OBO relationship types become OWL properties, OBO instances become OWL individuals, OBO ids are the URIs in OWL, and the OBO names become the OWL labels. is_a, disjoint_from, domain and range have direct OWL equivalents. There had to be some more complex mapping in other places, such as trying to map OBO relationship types to either OWL object or datatype properties.

Using OWL reasoners over OBO ontologies not only works, but in the case of the Sequence Ontology (SO), found a term that only had a single intersection_of statement, and was thus illegal according to OBO rules, but which hadn’t been found by OBO-Edit.

Up until now, I’ve been unsure as to how the OWL files are created from files in the OBO format. This was a paper that was clear and to the point. Thanks very much!

Update December 2008: I originally posted this without the BPR3 / ResearchBlogging.org tag, as I was unsure where conference proceedings came in the “peer-reviewed research” part of the guidelines. However, as I’m now getting back into the whole researchblogging thing, I feel (having read many of the posts of my fellow research bloggers) that this would be suitable. If anyone has any opinions, I’d be most interested!

Golbreich, C., Horridge, M., Horrocks, I., Motik, B., Shearer, R. (2008). OBO and OWL: Leveraging Semantic Web Technologies for the Life Sciences Lecture Notes in Computer Science, 4825/2008, 169-182 DOI: 10.1007/978-3-540-76298-0_13

Whole-Genome Reference Networks for the Community

ResearchBlogging.org

Srinivasan et al use this paper as a call to the community to begin the development of whole-genome reference networks for key model organisms. This paper is a combination of a review (in that it summarizes methods of network generation and analysis) and a call to arms, stating that reference networks are needed. It begins by describing systems biology as “the science of quantitatively defining and analyzing” functional modules, or components of biological systems.

There are many different definitions of systems biology (see here, here, here, here and here, just to name a few), but generally it seems the twin pillars of data integration and study – at various levels of granularity – of biological systems are present in most of them. A focus on integration and top-down research rather than the more traditional reductionist point of view is also often mentioned.

The authors then divide systems biology into three broad categories: high-level networks of the interactome or metabolome, deterministic models of kinetics and diffusion, and finally stochastic models of variation in cell lines. This division would be slightly clearer if they specified continuous deterministic models and discrete stochastic models. I realize that these adjectives are generally assumed for these model types, but as it is their discrete- or continuous-ness that increases the complexity of the models, it is something that would be useful to include.

They collapse many different types of network data into a single global interaction network, stating that it would be prohibitively expensive to try to prise out all of the sub-graphs, as variables such as time or sub-cellular location are often not simple to pull out on their own. This “lowest common denominator” method of network generation is not ideal, but does provide more information than, they attest, a simple genome sequence. In their networks, nodes represent proteins and edge weights are probabilities of association between proteins.

Noise is a real problem in most of these high-throughput data sets, and such data sets are not all created equal: one group may make a very good gene expression data set, and another may not. How can variable quality of data be dealt with? Early efforts focused on integrating multiple networks and only taking those nodes and edges that were present in more than one network. After that, methods of network generation that used “gold standards” created better integrated networks.

Descriptions of network analyses (rather than network creation) focus on network alignment and experiment prioritization. The latter is a general term for pulling out elements of the network that haven’t been experimentally verified, such as likely additions to known pathways or important disease genes. The former is an interesting extrapolation to networks of sequence alignments for genomes. In network alignments, conserved modules of nodes are identified if they have “both conserved primary sequences and conserved pair-wise interactions between species”. They specifically mention Graemlin, which is a tool they have developed that can identify conserved functional modules across multiple networks.

Finally, they suggest that the reference networks should show only those reactions present in the “‘average cell’ of a given organism near the median of the norm of reaction”.

While they acknowledge that, like the reference human genome sequence, such a creation is a “useful fiction”, it is my opinion that finding the average cell will be much more difficult, and perhaps less illuminating, than its equivalent in the sequencing world. Further, describing what is “normal” is something that is truly difficult, and will vary from species to species. The PATO / quality ontology people (http://obofoundry.org/cgi-bin/detail.cgi?quality) have known about the problems facing the “average” phenotype for a while now. I do, however, like their idea of storing the reference networks using RDF, as that seems a fitting format for networks. Overall, a laudable goal but one which will need some more thinking about. I’ve tried to run Graemlin using one of their example searches, and it didn’t run (at least today), and the main author’s website won’t load for me to today, though one of the other author’s pages did work.

All-in-all, a useful review of recent network methods in bioinformatics, and an interesting goal described. Low-noise reference networks for key model organisms, together with the annotation tracks that would describe deviations from the norm is a good idea.

Topics for discussion (aka leading questions): More fine-grained reference implementations are available, such as Reactome. Reactome provides a curated database of human biological pathways, with inferred orthologous events for 22 other organisms. Do we need reference networks when we’re gradually growing our knowledge of reference pathways? Are reference networks of “normal” organisms states helpful? How do we define average? Would the median of the norm of a reaction be different under different environmental conditions? What if what one group considers an average cell differs from another group’s average cell? Having reference networks would mean easier comparisons of different network analysis programs. Would this end up being a major use of the networks? Would such comparisons just lead to network analysis programs that fit the reference network, but not work in a generic manner? What do others think?

Srinivasan, B.S., Shah, N.H., Flannick, J.A., Abeliuk, E., Novak, A.F., Batzoglou, S. (2007). Current progress in network research: toward reference networks for key model organisms. Briefings in Bioinformatics, 8(5), 318-332. DOI: 10.1093/bib/bbm038