Data Integration Meetings & Conferences

PT11: Domain-oriented edge-based alignment of protein interaction networks (ISMB 2009)

Xin Guo

Introducing… the DOMain-oriented Alignment of Interaction Networks (DOMAIN). Previous paradigms include the node-then-edge-alignment paradigm and direct-edge-alignment paradigm. In the latter, interactions are more likely to be conserved. Many studies have suggested that direct PPIs can be mediated by interactions of their domains.

Their method follows the direct-edge-alignment paradigm. In the method: try to find a set of alignable pairs of edgees (APEs), and then try to add some edges between the APEs. Finally they try to find high-scoring alignments. In step 1 (finding APEs) there are two assumptions: two proteins interact if at least one pair of their constituent domains interact, and second assumptions is that two DDIs are independent of each other. APEs are a pair of cross-species PPIs sharing at least one pair of DDIs. DDIs in common are plausibly responsible for PPIs. In scoring an APE, you esitmate species-specific DDI probabilities, and then calculate a mean as their joined probability. For all common DDI they yous a noisy-or formulation to calculate the score. The APE graph is the aligned network graph, and is motiviated by duplication-divergence models, and there are two parts: link dynamics and gene duplication.

To evaluate their method, they used data from DIP for 3 different species (PPI networks), and Pfam-A domains (protein-to-domain mapping), and the backbone DIP network (a subset of DIP). Two other similar methods are NetworkBLAST and MaWISh. In all of their metrics except one, they came out best. NetworkBLAST was the second best.

DOMAIN is the first algorithm to introduce domains to PPI network alignment problem, and the first attempt to align PPIs directly. It has better/similar performance than others, but it can only be applied to a subset of PPIs, however most functionally-annotated proteins are involved.

FriendFeed discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Data Integration Housekeeping & Self References Meetings & Conferences Semantics and Ontologies

Annotation of SBML Models Through Rule-Based Semantic Integration (ISMB Bio-Ont SIG 2009)

Allyson Lister et al.

I didn’t take any notes on this talk, as it was my own talk and I was giving it. However, I can link you out to the paper on Nature Precedings and the Bio-Ontologies programme on the ISMB website. Let me know if you have questions!

You can download the slides for this presentation from SlideShare.

FriendFeed Discussion:

Data Integration Research Blogging Semantics and Ontologies Standards

Informal Knowledge Sharing in Science via Social Networking

This is a cross-posted item available both from this, my home blog, and, a new blog specifically concerned with “news and information about activities related to the development of data policies and standards in the biological domain, in particular for the area of ‘omics”. You can find the post on at: .

Recently, more and more biologists, bioinformaticians, and scientists in general have been discovering the usefulness of social networking, microblogging, and blogging for their work. Increasingly, social networking applications such as FriendFeed and Twitter are becoming popular for the discovery of new research in a timely manner, for interactions and possible collaborations with like-minded researchers, and for announcing work that you’re doing. Sharing data and knowledge in biology should not just be limited to formal publications and databases. “Biosharing” can also be informal, and social networking is an important tool in informally conveying scientific knowledge. But how should you get started in this new world? Here are my experiences with it, together with some links to and thoughts to help you get started.

I created my FriendFeed account in Fall 2008 and my Twitter account last month. Why did I start using these social networking sites? Well, with FriendFeed, I had noticed many of my work colleagues starting to use it, but had no real understanding as to why they were so evangelical about it. With Twitter, I held out longer but eventually realised it was a really quick and easy way to get my messages across. The reason I did it and why they are useful to me comes down to a simple answer.

I am interested in sharing knowledge. Social networking promotes an informal sharing of knowledge in a way complementary to more traditional, formal methods of knowledge sharing.

And if you’re interested in knowledge sharing, then you should look into social networking. My research focuses on semantic data integration. I have a further interest in common data formats to enable data / knowledge sharing. As I am quite vocal about getting people interested in formal methods of knowledge sharing such as the triumvirate of MIBBI, OBI, and FuGE / ISA-TAB for experimental data1 (and many, many more), it behooved me to learn about the informal methods.

Social Networking: Day-To-Day

But what convinced me that social networking for science was useful? By December I had a realisation: this social networking stuff was giving me more information directly useful to my research than any other resource I had used in the past. Period. You can see my happiness in this blog post from December, where I showed how, these days, I get more useful citations of papers I’m interested in via my friends’ citeulike feeds on FriendFeed than I ever have managed from the PubMed email alerts. What convinced me is not a what, but a who.

Social networking for science is an informal data integrator because of the people that are in that network.

It’s all about the people. I have met loads of new friends that have similar research interests via the “big 2” (FriendFeed and Twitter). I get knowledge and stay up to date on what’s happening in my area of the research world. I make connections.

What is FriendFeed? At its most basic definition, it is an “personal” RSS Aggregator that allows comments on each item that is aggregated. For instance, I’ve added slides, citations, my blogs, my SourceForge activity and more to FriendFeed:

A screenshot of the services page of my FriendFeed account

There are loads of other RSS feeds you can add to FriendFeed. Then, when people add your feed to their accounts, they can see your activity and comment on each item. You gradually build up a network of like-minded people. Additionally, you can post questions and statements directly to FriendFeed. This is useful as a form of microblogging, or posting short pieces of useful information quickly.

A screenshot of my Twitter feed
A screenshot of my Twitter feed

What is Twitter? It’s a bit like instant messaging to the world. You can say whatever you like in 140-characters or less, and it is published on your page (here’s mine). Just like with FriendFeed, you can follow anyone else’s Twitter feed. You can even put your Twitter feed into FriendFeed. People have a tendency to over-tweet, and write loads of stuff. I use it, but only for work, and only for things that I think might be relevant for quick announcements. If Doug Kell tweets, shouldn’t you? 😉

Other people have posted on how FriendFeed is useful to them in their scientific work, such as Cameron Neylon (who has some practical advice too), Deepak Singh and Neil Saunders who talk about specific examples, and Simon Cockell who has written about his experiences with FriendFeed and Twitter. I encourage you to have a read of their posts.

You don’t have to spend ages on FriendFeed and Twitter to get useful information out of it. Start simply and don’t get social networking burnout.

Ask questions about science you can’t answer in your own physical network at the office (Andrew Clegg did it, and have a look at the resulting discussion on FriendFeed and blog summary from Frank Gibson!). Post interesting articles. Ignore it for a week or more if you want: interesting stuff will be “liked” by other people in your network and will stay at the top of the feed. Trust the people in your network, and make use of their expertise in sending the best stuff to the top, if you don’t have the time to read everything. Don’t be afraid to read everything, or to read just the top two or three items in your feed.

Social Networking: Conferences and Workshops

These “big 2” social networking apps are really useful when it comes to conferences, where they are used to microblog talks, discussions, and breakout sessions. For detailed information on how they are used in such situations, see the conference report for ISMB 2008 in PLoS Computational Biology by Saunders et al. BioSysBio 2009 also used these techniques (conference report, FriendFeed room, Twitter hashtag).

Social Networking: What should I use?

Other social networking sites, billed as “especially for scientists”, have been cropping up left, right and centre in the past year or two. There are “Facebooks for Scientists”2 (there are more than 20 listed here, just to get you started, and other efforts more directed at linking data workflows such as myExperiment). So, should we be using these scientist-specific sites? I certainly haven’t tried them all, so I cannot give you anything other than my personal experience.

As you can see from my FriendFeed screenshot, I belong to “Rooms” in FriendFeed as well as connecting directly with people’s feeds. Rooms such as The Life Scientists, with over 800 subscribers, gets me answers to sticky questions I wouldn’t otherwise know how or where to ask (see here for an example). These, and the people I choose to link with directly, give me all of the science-specific discussions I could want.

The more general the social networking application is and the larger the user-base it has, the more likely it is to be around next year.

Right now, I don’t need any of the specialty features I’d get with a scientist-specific social networking application. I think the big names are more likely to reach a wider audience of like-minded folk.

Final Thoughts

Remember you’re broadcasting to the world. Only put stuff in that you think others will be interested in. This is a public face for you and your career.

I am a strong believer in keeping the personal parts of my life private (the entire world doesn’t need – or want – to know about my cat or see the pictures of my nephew)  while at the same time making sure that I am really easy to reach for work-related discussions and collaborations. Through my blog, and my social networking, I am gaining a fuller appreciation of the work going on in the research community around me and contributing to the resulting large experiment in informal data integration.

It is fun: I meet new people and have interesting conversations. It is useful to my career: my blogging has resulted in an invitation to co-author two conference reports, and shows me new things happening in my field earlier than before. I’m all about sharing biological knowledge. I’m researching the formal side of data integration and sharing, and I’m using informal knowledge sharing to help me do my work.

I hope to see you there soon! Look me up!


  1. For a very nice overview of these standards, see Frank Gibson’s blog.
  2. While I am on Facebook, I do not use it for work purposes, and therefore cannot comment on its applicability for scientists.

Lister A, Charoensawan V, De S, James K, Janga SC, & Huppert J (2009). Interfacing systems biology and synthetic biology. Genome biology, 10 (6) PMID: 19591648
Saunders, N., Beltrão, P., Jensen, L., Jurczak, D., Krause, R., Kuhn, M., & Wu, S. (2009). Microblogging the ISMB: A New Approach to Conference Reporting PLoS Computational Biology, 5 (1) DOI: 10.1371/journal.pcbi.1000263

Data Integration Research Blogging

One way for RDF to help a bioinformatician build a database: S3DB

This post is part of the PLoS One syncroblogging day, as part of the PLoS ONE @ Two birthday celebrations. Happy Synchroblogging! Here’s a link to the paper on the PLoS One website.

Biological data: vitally important, determinedly unruly. This challenge facing the life-science community has been present for decades, as witnessed by the often exponential growth of biological databases (see the classic curve in the current graphs of UniProt1 and EMBL if you don’t believe me). It’s important to me, as a bioinformatics researcher whose main focus is semantic data integration, but it should be important to everyone. Without manageable data that can be easily integrated, all of our work suffers. Nature thinks it’s important: it recently devoted an entire issue to Big Data. Similarly, the Journal of Biomedical Informatics just had a Semantic Mashup special issue. Deus et al. (the paper I’m blogging about, published in PLoS One this summer) agree, beginning with “Data, data everywhere”, nicely encapsulating both the joy and the challenge in one sentence.

This paper describes work on a distributed management system that can link disparate data sources using methodologies commonly associated with the semantic web (or is that Web 3.0?). I’m a little concerned (not at the paper, just in general) at the fact that we seem to already have a 3.0 version of the web, especially as I have yet to figure out a useful definition for semantic web vs Web 2.0 vs Web 3.0. Definitions of Web 3.0 seems to vary wildly: is it the semantic web? Is it the -rwx- to Web 1.0’s -r– and Web 2.0’s -rw– (as described here, and cited below)? Are these two definitions one and the same? Perhaps these are discussions for another day… Ultimately, however, I have to agree with the authors that “Web 3.0” is an unimaginative designation2.

So, how can the semantic web help manage our data? That would be a post in itself, and is the focus of many PhD projects (including mine). Perhaps a better question is how does the management model proposed by Deus et al. use the semantic web, and is it a useful example of integrative bioinformatics?

Their introduction focuses on two types of integration: data integration as an aid to holistic approaches such as mathematical modelling, and software integration which could provide tighter interoperability between data and services. They espouse (and I agree) the semantic web as a technology which will allow the semantically-meaningful specification of desired properties of data in a search, rather than retrieving data in a fixed way from fixed locations. They want to extend semantic data integration from the world of bioinformatics into clinical applications. Indeed, they want to move past “clandestine and inefficient flurry of datasets exchanged as spreadsheets through email”, a laudable goal.

Their focus is on a common data management and analysis infrastructure that does not place any restrictions on the data stored. This also means multiple instances of light-weight applications are part of the model, rather than a single central application. The storage format is of a more general, flexible nature. Their way of getting the data into a common format, they say, is to break down the “interoperable elements” of the data structures into RDF triples (subject-predicate-object statements). At its most basic, their data structure has two types of triples: Rules and Statements. Rules are phrases like “sky has_color”, while statements add a value to the phrase, e.g. “today’s_sky has_color blue”.

They make the interesting point that the reclassification of data from flat files to XML to RDF to Description Logics starts to dilute “the distinction between data management and data analysis”. While it is true that if you are able to store your data in formats such as OWL-DL3, the format is much more amenable to direct computational reasoning and inference, perhaps a more precise statement would be that the distinction between performance of data management tasks and data analysis tasks will blur with richer semantic descriptions of both the data and their applications. As they say later in the paper, once the data and the applications are described in a way that is meaningful for computation, new data being deposited online could automatically trigger a series of appropriate analysis steps without any human input.

A large focus of the paper was on identity, both of the people using it (and therefore addressing the user requirement of a strong permissions system) and of the entities in the model and database (each identified with some type of URI). This theme is core to ensuring that only those with the correct permissions may access possibly-sensitive data, and that each item of information can be unambiguously defined. I like that the sharing of “permissions between data elements in distinct S3DB deployments happens through the sharing the membership in external Collections and Rules…not through extending the permission inheritance beyond the local deployment”. It seems a useful and straightforward method of passing permissions.

I enjoyed the introduction, background, and conclusions. Their description of the Semantic Web and how it could be employed in the life sciences is well-written and useful for newcomers to this area of research. Their description of the management model as composed of subject-predicate-object RDF triples plus membership and access layers was interesting. Their website was clear and clean, and they had a demo that worked even when I was on the train4. It’s also rather charming that “S3DB” stands for Simple Sloppy Semantic Database – they have to get points for that one5! However, the description of their S3DB prototype was not extensive, and as a result I have some basic questions, which can be summarized as follows:

  • How do they determine what the interoperable elements of different data structures are? Manually? Computationally? Is this methodology generic, or does it have to be done with each new data type?
  • The determination of the maturity of a data format is not described, other than that it should be a “stable representation which remains useful to specialized tools”. For instance, the mzXML format is considered mature enough to use as the object of an RDF triple. What quality control is there in such cases: in theory, someone could make a bad mzXML file. Or is it not the format which is considered mature, but instead specific data sets that are known to be high quality?
  • I would have like to have seen more detail in their practical example. Their user testing was performed together with the Lung Cancer SPORE user community. How long did the trial last? Was there some qualitative measurement of how happy they were with it (e.g. a questionnaire)? The only requirement gathered seems to have been that of high-quality access control.
  • Putting information into RDF statements and rules in an unregulated way will not guarantee a data sets that can be integrated with other S3DB implementations, even if they are of the same experiment type. This problem is exemplified by a quote from the paper (p. 8): “The distinct domains are therefore integrated in an interoperable framework in spite of the fact that they are maintained, and regularly edited, by different communities of researchers.” The framework might be identical, but that doesn’t ensure that people will use the same terms and share the same rules and statements. Different communities could build different statements and rules, and use different terms to describe the same concept. Distributed implementations of S3DB databases, where each group can build their own data descriptions, do not lend themselves well to later integration unless they start by sharing the same ontology/terms and core rules. And, as the authors encourage the “incubation of experimental ontologies” within the S3DB framework, chances are that there will be multiple terms describing the same concept, or even one word that has multiple definitions in different implementations. While they state that data elements can be shared across implementations, it isn’t a requirement and could lead to the problems mentioned. I have the feeling I may have gotten the wrong end of the stick here, and it would be great to hear if I’ve gotten something wrong.
  • Their use of the rdfs:subClassOf relation is not ideal. A subclass relation is a bit like saying “is a”, (defined here as a transitive property where “all the instances of one class are instances of another”) therefore what their core model is saying with the statement “User rdfs:subClassOf Group” is “User is a Group”. The same thing happens with the other uses of this relation, e.g. Item is a Collection.  A user is not a group, in the same way that a single item is not a collection. There are relations between these classes of object, but rdfs:subClassOf is simply not semantically correct. A SKOS relation such as skos:narrower (defined here as “used to assert a direct hierarchical link between two SKOS concepts”) would be more suitable, if they wished to use a “standard” relationship. I particularly feel that I probably misinterpreted this section of their paper, but couldn’t immediately find any extra information on their website. I would really like to hear if I’ve gotten something wrong here, too.

Also, although this is not something that should have been included in the paper, I would be curious to discover what use they think they could make of OBI, which would seem to suit them very well6. An ontology for biological and biomedical investigations would seem a boon to them. Further, such a connection could be two-way: the S3DB people probably have a large number of terms, gathered from the various users who created terms to use within the system. It would be great to work with the S3DB people to add these to the OBI ontology. Let’s talk! 🙂

Thanks for an interesting read!

1. Yes, I’ve mentioned to the UniProt gang that they need to re-jig their axes in the first graph in this link. They’re aware of it! 🙂
2. Although I shouldn’t talk, I am horrible at naming things, as the title of this blog shows
3. A format for ontologies using Description Logics that may be saved as RDF. See the official OWL docs.
4. Which is a really flaky connection, believe me!
5. Note that this expanded acronym is *not* present in this PloS One paper, but is on their website.
6. Note on personal bias: I am one of the core developers of OBI 🙂

Helena F. Deus, Romesh Stanislaus, Diogo F. Veiga, Carmen Behrens, Ignacio I. Wistuba, John D. Minna, Harold R. Garner, Stephen G. Swisher, Jack A. Roth, Arlene M. Correa, Bradley Broom, Kevin Coombes, Allen Chang, Lynn H. Vogel, Jonas S. Almeida (2008). A Semantic Web Management Model for Integrative Biomedical Informatics PLoS ONE, 3 (8) DOI: 10.1371/journal.pone.0002946
Z. Zhang, K.-H. Cheung, J. P. Townsend (2008). Bringing Web 2.0 to bioinformatics Briefings in Bioinformatics DOI: 10.1093/bib/bbn041

Data Integration Research Blogging

Adding informative metadata to bioinformatics services

Carole Goble and the other authors of “Data curation + process curation = data integration + science” have written a paper on the importance of curating not just the services used in bioinformatics, but also how they are used. Just as more and more biologists are becoming convinced of the importance of storing and annotating their data in a common format, so should bioinformaticians take a little of their own medicine and ensure that the services they produce and use are annotated properly. I personally feel that it is just as important to ensure that in silico work is properly curated as it is in the more traditional, wet-lab biological fields.

They mention a common feature of web services and workflows: namely, that they are generally badly documented. Just as the majority of programmers leave it until the last possible minute to comment their code (if they comment at all!), so also are many web services annotated very sparsely, and not necessarily in a way that is useful to either humans or computers. I remember that my first experience with C code was trying to take over a bunch of code written by a C genius, who had but one flaw: a complete lack of commenting. Yes, I learnt a lot about writing efficient C code from his files, but it took me many hours more than it would have done if there had been comments in there!

They touch briefly on how semantic web services (SWS) could help, e.g. using formats such as OWL-S and SAWSDL. I recently read an article in the Journal of Biomedical Informatics (Garcia-Sanchez et al. 2008, citation at the end of the paper) that had a good introduction to both semantic web services and, to a lesser extent, multi-agent systems that could autonomously interact with such services. While the Goble et al. paper did not go into as much detail as the Garcia-Sanchez paper did on this point, it was nice to learn a little more about what was going on in the bioinformatics word with respect to SWS.

Their summary of the pitfalls to be aware of due to the lack of curated processes was good, as was their review of currently-existing catalogues and workflow and WS aggregators. The term “Web 2.0” was used, in my opinion correctly, but I was once again left with the feeling that I haven’t seen a good definition of what Web 2.0 is. I must hear it talked about every day, and haven’t come across any better definition than Tim O’Reilly’s. Does anyone reading this want to share their “favorite” definition? This isn’t a failing of this paper – more of my own lack of understanding. It’s a bit like trying to define “gene” (this is my favorite) or “systems biology” succinctly and in a way that pleases most people – it’s a very difficult undertaking! Another thing I would have liked to have seen in this paper, but which probably wasn’t suitable for the granularity level at which this paper was written, is a description and short analysis of the traffic and usage stats for myExperiment. Not a big deal – I’m just curious.

As with anything in standards development, even though there are proposed minimal information guidelines for web services out there (see MIAOWS), the main problem will always be lack of uptake and getting a critical mass (also important in community curation efforts, by the way). In my opinion, a more important consideration for this point is that getting a MIA* guideline to be followed does not guarantee any standard format. All it guarantees is a minimal amount of information to be provided.

They announce the BioCatalogue in the discussion section of this paper, which seems to be a welcome addition to the attempts to get people to annotate and curate their services in a standard way, and store them in a single location. It isn’t up and running yet, but is described in the paper as a web interface to more easily allow people to annotate their WSDL files, whereas previous efforts have mainly focused on the registry aspects. Further information can be associated with these files once they are uploaded to the website. However, I do have some questions about this service. What format is the further information (ontology terms, mappings) stored in? Are the ontology terms somehow put back into the WSDL file? How will information about the running of a WS or workflow be stored, if at all? Does it use a SWS format? I would like to see performances of Bioinformatics workflows stored publicly, just as performances of biological workflows (eg running a microarray experiment) can be. But I suppose many of these questions would be answered once BioCatalogue is in a state suitable for publishing on its own.

In keeping with this idea of storing the applications of in silico protocols and software in a standard format, I’d like to mention one syntax standard that might be of use in storing both descriptions of services and their implementation in specific in silico experiments: FuGE. While it does not currently have the structures required to implement everything mentioned in this paper (such as operational capability and usage/popularity scores) in a completely explicit way, many of the other metadata items that this paper suggests can already be stored within the FuGE object model (e.g. provenance, curation provenance, and functional capability). Further, FuGE is built as a model that can easily be extended. There is no reason why we cannot, for example, build a variety of Web services protocols and software within the FuGE structure. One downside of this method would be that information would be stored in the FuGE objects (e.g. a FuGE database or XML file) and not in the WSDL or Taverna workflow file. Further, there is no way to “execute” FuGE XML files, as there is with taverna files or WSs. However, if your in silico experiment is stored in FuGE, you immediately have your computational data stored in a format that can also store all of the wet-lab information, protocols, and applications of the protocols. The integration of your analyses with your wet-lab metadata would be immediate.

In conclusion, this paper presents a summary of a vital area of bioinformatics research: how, in order to aid data integration, it is imperative that we annotate not just wet-lab data and how they were generated, but also our in silico data and how they were generated. Imagine storing your web services in BioCatalogue and then sharing your entire experimental workflows, data and metadata with other bioinformaticians quickly and easily (perhaps using FuGE to integrate in silico analyses with wet-lab metadata, producing a full experimental metadata file that stores all the work of an experiment from test tube to final analysis).

Goble C, Stevens R, Hull D, Wolstencroft K, Lopez R. (2008). Data curation + process curation=data integration + science. Briefings in bioinformatics DOI: 19060304

F GARCIASANCHEZ, J FERNANDEZBREIS, R VALENCIAGARCIA, J GOMEZ, R MARTINEZBEJAR (2008). Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources Journal of Biomedical Informatics, 41 (5), 848-859 DOI: 10.1016/j.jbi.2008.05.007

Data Integration Meetings & Conferences

FuGE / ISA-TAB Workshop, Day 1

Today was the first day of the workshop – back at the good old EBI, though it isn't as recognizable as it used to be. Sure, there is the new EBI extension, but I am used to that now. However, they're renovating the inside of the old EBI building as well, reducing many of my friends to portakabin living over the winter months: better them than me!

Today definitely had an emphasis on the "work" part of "workshop". While a large part of the work on the XSLT for converting between FuGE and ISA-TAB is complete, some of the slightly stickier areas of the conversion are still being worked on. We spent today on trying to iron out some of the difficulties that arise from trying to convert the sort of rich tree structure that you get from the XML implementation of FuGE (FuGE-ML) into the flatter tabular format of ISA-TAB. Below are some of the more general ideas that we were throwing around as a result. (Some are more directly related to the conversion process than others – but all raise interesting points to me.)

  • One of the column names in the ISA-TAB Assay file is currently named "Raw Data File" in the 1.0 Specification. This caused a large amount of discussion as to what "raw" meant, and that many people would have a different idea of what a raw data file was. It was originally named this way to act as a foil against another (optional) column name, "Derived Data File". However, derived data files have a more precise definition in ISA-TAB – such a column can only be used to name files resulting from data transformations or processing. In the end, we are considering a name change, from "Raw Data File" to "Data File".
  • In the end, there will be a few simple ways to format your FuGE-ML files in a way that will aid the conversion into ISA-TAB. It would be useful to eventually produce a set of guidelines to aid in interoperability.
  • Some of the developers already using FuGE (myself included) are using the <Description> element within a FuGE-ML file as a way to allow our biologists to give a free-text description to both materials and data files. There is no specific element in these objects to add such information, and therefore the generic Description element is the best location. This isn't exactly as per FuGE best-practices, where the default Description elements are really only meant for private comments within a local FuGE implementation, and can normally be ignored by external bioinformaticians making use of your FuGE-ML. Such material and data descriptions can be copied into the ISA-TAB file as free text within the Comment[] columns, where what sits within the "[]" is the material or data identifier. We'll have to see if this idea turns out to be useful.
  • The main challenge in collapsing FuGE-ML into ISA-TAB is ensuring that the multi-level protocol application structures (for more information, see the GenericProtocolApplication and GenericProtocol objects within the FuGE Object Model) are correctly converted. We spent the majority of today trying to figure out an elegant way of doing this. We'll work on it again tomorrow, and will hopefully have a new version of the XSLT with a first-bash solution tomorrow evening!

Read and post comments |
Send to a friend


CISBAN Data Integration Meetings & Conferences Software and Tools Standards

Pre-workshop post on the FuGE / ISA-TAB Workshop, 8-9 December

Tomorrow is the first day of a two-day workshop set up to continue the integration process between the ISA-TAB format and the FuGE standard. (Well, technically, it starts tonight with a workshop dinner, where I'll get to catch up with the people in the workshop, many of whom I haven't seen since the MGED 11 meeting in Italy this past summer. Should be fun!)

ISA-TAB can be seen as the next generation of MAGE-TAB, a very popular format with biologists who need to get their data and metadata into a common format acceptable by public repositories such as ArrayExpress. ISA-TAB goes one step further, and does for tabular formats what FuGE does for object models and XML formats: that is, it is able to represent multi-omics experiments rather than just the transcriptomics experiments of MAGE-TAB. I encourage you to find out more about both FuGE and ISA-TAB by looking at their respective project pages. The FuGE group also has a very nice introduction to the model in their Nature Biotechnology article.

Each day I'll provide a summary of what's gone on at the workshop, which centers around the current status of both ISA-TAB and some relevant FuGE extensions, as well as the production of a seamless conversion from FuGE-ML to ISA-TAB and back again. ISA-TAB necessarily cannot handle as much detail as the FuGE model can (being limited by the tabular format), and therefore in the FuGE-ML to ISA-TAB direction, it is possible that it may not be entirely lossless. However, this workshop and all the work that's gone on around it aims to reconcile the two formats as much as possible. And, even though I have mentioned a caveat or two, this reconciliation is entirely possible: both ISA-TAB and FuGE share the same high-level structures. Indeed, ISA-TAB was created with FuGE in mind, to ensure that such a useful undertaking used all it could of the FuGE Object Model. It is important to remember that FuGE is an abstract model which can be converted into many formats, including XML. Because it is an abstract model, many projects can make use of its structures while maintaing whatever concrete format they wish.

Specific topics of the workshop include:

  • Advance and possibly finalize XSLT rendering of FUGE Documents into ISA-TAB. This includes the finishing-off of the generic FuGE XSL stylesheet.
  • Work on some of the extensions, including FCM, Gel-ML, and MAGE2. MAGE2 is the most interesting for me for this workshop, as I've heard that it's almost complete. This is the XML format that is a direct extension of the FuGE model, and will be very useful for bioinformaticians wishing to store, share and search their transcriptomics data using a multi-omics standard like FuGE.

Thanks to Philippe Rocca-Serra and Susanna-Assunta Sansone for the hard work they've done on the format specification, and for everyone who's coming today. It's a deliberately small group so that we can spend our time in technical discussion rather than in presentations. I'm a bit of a nut about data and metadata standards (and am in complete agreement with Frank over at peanutbutter on the triumverate of experimental standards) and so I love these types of meetings. It's going to be fun, and I'll keep you updated!

Read and post comments |
Send to a friend


Data Integration In The News

Two Journal Special Issues: Big Data, and Semantic Mashups for Bioinformatics

Both of these special issues are worth a look, as some of the papers look pretty interesting. I'll spend a little time in a later post on any articles I find particularly relevant.

  • Semantic Mashup of Biomedical Data Special Issue of the Journal of Biomedical Informatics. This includes a review article by Carole Goble and Robert Stevens: State of the nation in data integration for bioinformatics
  • Nature's Big Data Sepcial Issue. The article entitled "How do your data grow?" was one of the many articles in this issue that I enjoyed. It's interesting to note that these problems in management and curation of big data are only now getting special attention in Nature. When I worked at the EBI, it was common knowledge among the database curators that 1) it would be very difficult for them to find other work as curators if they left the EBI, and 2) the time and high skill level it takes to annotate and curate biological database entries means that it is very difficult to get high coverage in such databases. It's nice to finally see some recognition of all the work the biocurators do by a journal such as Nature. Finally, there are high-profile articles stating that curation begins at home, with the researcher, and that curation needs much more support from researcher-level all the way up to the level of the database curators.

Read and post comments |
Send to a friend


CISBAN Data Integration Semantics and Ontologies Software and Tools Standards

Of GelML and MFO

A couple of papers from here at Newcastle University have appeared over the past couple of weeks. Here's a summary of them both.

  • Data Standards
    From "An Update on Data Standards for Gel Electrophoresis" in Practical Proteomics Issue 1, September 2007, and by Andrew R. Jones and Frank Gibson.
    From the abstract: "We report on standards development by the Gel Analysis Workgroup of the
    Proteomics Standards Initiative. The workgroup develops reporting
    requirements, data formats and controlled vocabularies for experimental
    gel electrophoresis, and informatics performed on gel images. We
    present a tutorial on how such resources can be used and how the
    community should get involved with the on-going projects. Finally, we
    present a roadmap for future developments in this area."
    Provides a summary of ongoing work in the Gel electrophoresis and Gel informatics fields in terms of data and metadata standardization. This includes work on MIAPE GE and MIAPE GI, two checklists for minimal information required on these types of experiments and analyses. For both GE and GI, there are data formats (GelML and GelInfoML, respectively, both extensions of FuGE) and a suggested controlled vocabulary (sepCV). More information can be found on
    Frank works in the CARMEN neuroscience project here at Newcastle, and Andy is in Liverpool and works on, among other things, FuGE. CARMEN collaborates with the SyMBA project, which was originally developed by me and a few others within Neil Wipat's Integrative Bioinformatics Group here at Newcastle but which is now a sourceforge project at Andy Jones is a co-author with me, Neil Wipat, Matt Pocock and Olly Shaw on an upcoming SyMBA paper.
  • Semantic Data Integration
    A paper that was presented at the Integrative Bioinformatics Conference 2007 by me and my co-authors, Matt Pocock and Neil Wipat, is now available from the Journal of Integrative Bioinformatics website.
    Allyson L. Lister, Matthew Pocock, Anil Wipat. Integration of
    constraints documented in SBML, SBO, and the SBML Manual facilitates
    validation of biological models
    . Journal of Integrative Bioinformatics,
    4(3):80, 2007.

Read and post comments |
Send to a friend