Papers Standards

BioSharing is Caring: Being FAIR

FAIR: Findable, Accessible, Interoperable, Reusable
Source: Scientific Data via March 16, 2016.

In my work for BioSharing, I get to see a lot of biological data standards. Although you might laugh at the odd dichotomy of multiple standards (rather than One Standard to Rule Them All), there are reasons for it. Some of those reasons are historical, such as a lack of cross-community involvement during inception of standards, and some are technical, such as vastly different requirements in different communities. The FAIR paper, published yesterday by Wilkinson et al. (and by a number of my colleagues at BioSharing) in Scientific Data, helps guide researchers towards the correct standards and databases by clarifying data stewardship and management requirements. If used correctly, a researcher can be assured that as long as a resource is FAIR, it’s fine.

This article describes four foundational principles—Findability, Accessibility, Interoperability, and Reusability—that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added-value gained by contemporary, formal scholarly digital publishing. Importantly, it is our intent that the principles apply not only to ‘data’ in the conventional sense, but also to the algorithms, tools, and workflows that led to that data. All scholarly digital research objects—from data to analytical pipelines—benefit from application of these principles, since all components of the research process must be available to ensure transparency, reproducibility, and reusability.(doi:10.1038/sdata.2016.18)

This isn’t the first time curators, bioinformaticians and other researchers have shouted out the importance of being able to find, understand, copy and use data. But any help in spreading the message is more than welcome.


Need more help finding the right standard or database for your work? Visit BioSharing!

Further information:

Data Integration Housekeeping & Self References Papers

PhD Thesis: Table of Contents

Below you can find a complete table of contents for all thesis-related posts (you can also get to the posts via the “thesis” tag I have used for each). Enjoy!

Thesis Posts

And, if you’re interested in how I performed the conversion, I’ve written about that too.

Papers Software and Tools

Google Scholar Citations and Microsoft Academic Search: first impressions

There’s been a lot of chat in my scientific circles lately about the recent improvements in freely-available, easily-accessible web applications to organise and store your publications. Google Scholar Citations (my profile) and Microsoft Academic Search (my profile) are the two main contenders, but there are many other resources to use which are mighty fine in their own right (see my publication list on Citeulike for an example of this). Some useful recent blog posts and articles on this subject are:

  • Egon Willighagen’s post includes lots of good questions about the future of free services like GSC and MAS, and how they relate to his use of more traditional services such as the Web Of Science.
  • Alan Cann’s impressions of GSC, including a nice breakdown of his citations by type.
  • Jane Tinkler’s comparison of GSC and MAS, together with a nice description of what happens when a GSC account is made (HT Chris Taylor @memotypic for the link to the article).
  • Nature News’ comparison of GSC and MAS, and impressions of how these players might change the balance of power between free and non-free services.

While I do like each of these technologies for a number of reasons, there are also reasons to be less happy with them. Firstly, their similarities (please be aware I am not trying to make an exhaustive list – just my impressions after using each product for a few days). They both allow the merging of authors, a feature that was very useful to me as I changed my name when I got married. Neither service has a fantastic interface for the merge, but it worked. Both provided some basic metrics: GSC has the h-index and the i10 index, while MAS uses the g and h indexes. Both tell you how many other papers have cited each of your publications. Both seem to get things mostly right (and a little bit wrong) when assigning publications to me – I had to manually fix both apps. Both provide links to co-authors, though GSC’s is rather limited, as you have to actively create a profile there while with MAS profiles are built automatically.

Things I like about Microsoft Academic Search:

  1. Categorisation of publications. You can look down the left-hand side and see your papers categorised by type, keyword, etc.
  2. Looks nicer. Yes, I like Google. But Microsoft’s offering is just a lot better looking.
  3. Found more ancillary stuff. It found my work page (though the URL has since changed), and from there a picture of me. Links out to Bing (of course) and nice organisation of basic info really just makes it look more professional than GSC.
  4. Bulk import of citations in Bibtex format. I really like this feature – I was able to bulk add the missing citations in one fell swoop using a bibtex upload. Shiny!

Things I don’t like about Microsoft Academic Search:

  1. Really slow update time. It insists on vetting each change with a mysterious Microsoftian in the sky. I’ve made a bunch of changes to the profile, updated and added publications, and days later it still hasn’t shown those changes. It’s got to get better if it doesn’t want to irritate me. Sure, do a confirmation step to ensure I am who I say I am, but then give me free rein to change things!
  2. Silverlight. I’ve tried installing moonlight, which seemed to install just fine, but then the co-author graph just showed up empty. Is that a fault with moonlight, or with the website?
  3. Did I mention the really slow update time?

Things I like about Google Scholar Citations:

  1. Changes are immediately visible. Yes, if I merge authors or remove publications or anything else, it shows up immediately on my profile.
  2. No incompatibilities with Linux. All links work, no plugins required.

Things I don’t like about Google Scholar Citations:

  1. Lack of interesting extras. The graphs, fancy categorisations etc. you get with MAS you don’t (yet) get with the Google service.
  2. No connection with the Google profile. Why can’t I provide a link to my Google profile, and then get integration with Google+, e.g. announcements when new publications are added? This is a common complaint with Google+, as many other Google services (e.g. Google reader) aren’t yet linked with it, but hopefully this will come eventually.
  3. Not as pretty. Also, I’m not sure if it’s just my types of papers, but the links in GSC to the individual citations are difficult to read, and it’s hard to determine the ultimate source of the article (e.g. journal or conference name).

I will still use Citeulike as my main publication application. I use it to maintain the library of my own papers and other people’s papers. Its import and export features for bibtex are great, and it can convert web pages to citations with just one click (or via a bookmarklet). It has loads of other bells and whistles as well. While I’m writing up my thesis, I visit it virtually every day to add citations and export bibtex.

So, between Google and Microsoft, which do I like better? They’re very similar, but Microsoft Academic Search wins right now. But both services are improving daily, and we’ll have to see how things change in future.

And the thing that really annoys me? I now feel the need to keep my publications up to date on three systems: Citeulike (it’s the thing I actually use when writing papers etc.), Microsoft Academic Search, and Google Scholar Citations. No, I don’t *have* to maintain all three, but people can find out about me from all of them, and I want to try to ensure what they see is correct. Irritating. Can we just have some sensible researcher IDs in common use, and from that an unambiguous way to discover which publications are mine? I know efforts are under way, but it can’t come soon enough.

Housekeeping & Self References Meetings & Conferences Papers Science Online

Social Networking and Guidelines for Life Science Conferences
I had a great time in Sweden this past summer, at ISMB 2009 (ISMB/ECCB 2009 FriendFeed room). I listened to a lot of interesting talks, reconnected with old friends and met new ones. I went to an ice bar, explored a 17th-century ship that had been dragged from the bottom of the sea, and visited the building where the Nobel Prizes are handed out.

While there, many of us took notes and provided commentary through live blogging either on our own blogs or via FriendFeed and Twitter. The ISCB were very helpful, having announced and advertised the live blogging possibilities prior to the event. Once at the conference, they provided internet access, and even provided extension cords where necessary so that we could continue blogging on mains power.

Those of us who spent a large proportion of our time live blogging were asked to write a paper about our experiences. This quickly became two papers, as there were two clear subjects on our minds: firstly, how the live blogging went in the context of ISMB 2009 specifically; and secondly, how our experiences (and that of the organisers) might form the basis of a set of guidelines to conference organisers trying to create live blogging policies. The first paper became the conference report, a Message from ISCB published today in PLoS Computational Biology. This was published in conjunction with the second paper, a Perspective published jointly today in PLoS Computational Biology, that aims to help organisers create policies of their own. Particularly, it provides “top ten”(-ish) lists for organisers, bloggers and presenters.

So, thanks again to my co-authors:
Ruchira S. Datta: Blog FriendFeed
Oliver Hofmann: Blog FriendFeed Twitter
Roland Krause: Blog FriendFeed Twitter
Michael Kuhn: Blog FriendFeed Twitter
Bettina Roth
Reinhard Schneider: Blog FriendFeed
(you can find links to my social networking accounts on the About page on this blog)

If you have any questions or comments about either of these articles, please comment on the PLoS articles themselves, so there can be a record of the discussion.

Lister, A., Datta, R., Hofmann, O., Krause, R., Kuhn, M., Roth, B., & Schneider, R. (2010). Live Coverage of Scientific Conferences Using Web Technologies PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000563

Lister, A., Datta, R., Hofmann, O., Krause, R., Kuhn, M., Roth, B., & Schneider, R. (2010). Live Coverage of Intelligent Systems for Molecular Biology/European Conference on Computational Biology (ISMB/ECCB) 2009 PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000640

Housekeeping & Self References Papers Research Blogging Software and Tools Standards

Modeling and Managing Experimental Data Using FuGE

Want to share your umpteen multi-omics data sets and experimental protocols with one common format? Encourage collaboration! Speak a common language! Share your work! How, you might ask? With FuGE, and this latest paper (citation at the end of the post) tells you how.

In 2007, FuGE version 1 was released (website, Nature Biotechnology paper). FuGE allows biologists and bioinformaticians to describe any life science experiment using a single format, making collaboration and repeatability of experiments easier and more efficient. However, if you wanted to start using FuGE, until now it was difficult to know where to start. Do you use FuGE as it stands? Do you create an extension of FuGE that specifically meets your needs? What do the developers of FuGE suggest when taking your first steps using it? This paper focuses on best practices for using FuGE to model and manage your experimental data. Read this paper, and you’ll be taking your first steps with confidence!

[Aside: Please note that I am one of the authors of this paper.]

What is FuGE? I’ll leave it to the authors to define:

The approach of the Functional Genomics Experiment (FuGE) model is different, in that it attempts to generalize the modeling constructs that are shared across many omics techniques. The model is designed for three purposes: (1) to represent basic laboratory workflows, (2) to supplement existing data formats with metadata to give them context within larger workflows, and (3) to facilitate the development of new technology-specific formats. To support (3), FuGE provides extension points where developers wishing to create a data format for a specific technique can add constraints or additional properties.

A number of groups have started using FuGE, including MGED, PSI (for GelML and AnalysisXML), MSI, flow cytometry, RNA interference and e-Neuroscience (full details in the paper). This paper helps you get a handle on how to use FuGE by presenting two running examples of capturing experimental metadata in the fields of flow cytometry and proteomics of flow cytometry and gel electrophoresis. Part of Figure 2 from the paper is shown on the right, and describes one section of the flow cytometry FuGE extension from FICCS.

The flow cytometry equipment created as subclasses of the FuGE equipment class.
The flow cytometry equipment created as subclasses of the FuGE equipment class.

FuGE covers many areas of experimental metadata including the investgations, the protocols, the materials and the data. The paper starts by describing how protocols are designed in FuGE and how those protocols are applied. In doing so, it describes not just the protocols but also parameterization, materials, data, conceptual molecules, and ontology usage.

Examples of each of these FuGE packages are provided in the form of either the flow cytometry or the GelML extensions. Further, clear scenarios are provided to help the user determine when it is best to extend FuGE and when it is best to re-use existing FuGE classes. For instance, it is best to extend the Protocol class with an application-specific subclass when all of the following are true: when you wish to describe a complex Protocol that references specific sub-protocols, when the Protocol must be linked to specific classes of Equipment or Software, and when specific types of Parameter must be captured. I refer you to the paper for scenarios for each of the other FuGE packages such as Material and Protocol Application.

The paper makes liberal use of UML diagrams to help you understand the relationship between the generic FuGE classes and the specific sub-classes generated by extensions. A large part of the paper is concerned expressly with helping the user understand how to model an experiment type using FuGE, and also to understand when FuGE on its own is enough. But it also does more than that: it discusses the current tools that are already available for developers wishing to use FuGE, and it discusses the applicability of other implementations of FuGE that might be useful but do not yet exist. Validation of FuGE-ML and the storage of version information within the format are also described. Implementations of FuGE, including SyMBA and sysFusion for the XML format and ISA-TAB for compatibility with a spreadsheet (tab-delimited) format, are also summarised.

I strongly believe that the best way to solve the challenges in data integration faced by the biological community is to constantly strive to simply use the same (or compatible) formats for data and for metadata. FuGE succeeds in providing a common format for experimental metadata that can be used in many different ways, and with many different levels of uptake. You don’t have to use one of the provided STKs in order to make use of FuGE: you can simply offer your data as a FuGE export in addition to any other omics formats you might use. You could also choose to accept FuGE files as input. No changes need to be made to the underlying infrastructure of a project in order to become FuGE compatible. Hopefully this paper will flatten the learning curve associated for developers, and get them on the road to a common format. Just one thing to remember: formats are not something that the end user should see. We developers do all this hard work, but if it works correctly, the biologist won’t know about all the underpinnings! Don’t sell your biologists on a common format by describing the intricacies of FuGE to them (unless they want to know!), just remind them of the benefits of a common metadata standard: cooperation, collaboration, and sharing.

Jones, A., Lister, A.L., Hermida, L., Wilkinson, P., Eisenacher, M., Belhajjame, K., Gibson, F., Lord, P., Pocock, M., Rosenfelder, H., Santoyo-Lopez, J., Wipat, A., & Paton, N. (2009). Modeling and Managing Experimental Data Using FuGE OMICS: A Journal of Integrative Biology, 2147483647-13 DOI: 10.1089/omi.2008.0080

Meetings & Conferences Papers Semantics and Ontologies

ISMB Bio-Ontologies SIG 2009: Let’s talk about ontologies

I can’t resist posting a short announcement about two papers I’m an author on which have been accepted to this year’s Bio-Ontologies SIG at ISMB. 🙂 I’ll post more about both papers during or just before the SIG, which is on Sunday, June 28, 2009. However, here’s a taster of both.

I am first author on one of the papers,  which covers the current state of work on my PhD: “Annotation of SBML Models Through Rule-Based Semantic Integration”, by Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Here’s the abstract:

Motivation: The creation of accurate quantitative Systems Biology Markup Language (SBML) models is a time-intensive, manual process often complicated by the many data sources and formats required to annotate even a small and well-scoped model. Ideally, the retrieval and integration of biological knowledge for model annotation should be performed quickly, precisely, and with a minimum of manual effort. Here, we present a method using off-the-shelf semantic web technology which enables this process: the heterogeneous data sources are first syntactically converted into ontologies; these are then aligned to a small domain ontology by applying a rule base. Integrating resources in this way can accommodate multiple formats with different semantics; it provides richly modelled biological knowledge suitable for annotation of SBML models.
Results: We demonstrate proof-of-principle for this rule-based mediation with two use cases for SBML model annotation. This was implemented with existing tools, decreasing development time and increasing reusability. This initial work establishes the feasibility of this approach as part of an automated SBML model annotation system.

And to whet the appetite a little further, here’s an overview diagram from the paper describing the overall flow through the data integration process:

Rule-based mediation in the context of SBML model annotation.
Rule-based mediation in the context of SBML model annotation.

The second paper discusses the Ontology for Biomedical Investigations (OBI) (OWL file, website): “Modeling biomedical experimental processes with OBI”, by the OBI Consortium (of which I am a part). You can read the full paper, and here is the abstract:

Motivation: Experimental metadata are stored in many different formats and styles, creating challenges in comparison, reproduction and analysis. These difficulties impose severe limitations on the usability of such metadata in a wider context. The Ontology for Biomedical Investigations (OBI), developed as part of a global, cross-community effort, provides an approach to represent biological and clinical investigations in an explicit and integrative framework which facilitates computational processing and semantic web compatibility. Here we detail two real-world applications of OBI and show how OBI satisfies such use cases.

Papers Semantics and Ontologies

Distributed Ontology Development

Last Friday, while I was discussing ontologies and decisions that need to be made in ontology development with some work colleagues, one of the phrases that cropped up more than once is “be sensible”. Being sensible isn’t always as easy as it seems, but one way to be sensible is to choose an ontology development methodology and make use of before you even write down your first ontology class name. If you want lots of people to use an ontology, you need to involve at least some of those people in its development.

As a timely accompaniment to this thought, in the past week Frank Gibson has published a pre-print version of a methodology for distributed ontology development called Developing ontologies in decentralised settings (by Alexander Garcia, Kieran O’Neill, Leyla J. Garcia, Phillip Lord, Robert Stevens, Oscar Corcho, & Frank Gibson).

While Frank himself has referred to it as “dry”, I think that does it a disservice (but perhaps I’m biased because I know him and also because I like methodologies and standards!). This paper would better be described as comprehensive. I’d like to cover a few sections of the paper that I found the most interesting, to whet your appetite for reading the whole thing.

Firstly, Garcia et al. mention one overriding focus of the bio-ontology community: ontology development without any accompanying ontology development methodology:

‘The research focus for the bio-ontology community to date has typically centred on the development of domain specific ontologies for particular applications, as opposed to the actual “how to” of building the ontology or the “materials and methods”[…] This has resulted in a proliferation of bio-ontologies, developed in different ways, often presenting overlap in terminology or application domain.’

Both in programming and in ontology development, I find it very hard not to head straight for working on the “interesting” bits without thinking through the best way to go about it. However, even though I find it difficult to follow a particular methodology, the benefits outweigh the downsides.

Garcia et al also list a kind of minimal set of requirements for an ontology methodology:

‘A general purpose methodology should aim to provide ontology engineers with a sufficient perspective of the stages of the development process and the components of the ontology life cycle, and account for community development. In addition, detailed examples of use should be included for those stages, outcomes, deliverables, methods and techniques; all of which form part of the ontology life cycle.’

So far, these are useful statements for anyone building an ontology, but this paper concentrates on distributed ontology development, and presents Melting Point (MP), an ontology methodology specifically designed for distributed, community-driven ontology development. It was created as a “convergence of existing methodologies, with the addition of new aspects” as “no methodology completely satisfies all the criteria for collaborative development” (pg. 2). A useful overview of MP is available from Figure 3 in the paper, which describes the life cycle of the MP methodology including its processes and activities.

This paper has a thorough review of nine existing ontology and knowledge engineering methodologies (see Table 1 and Section 4.2 particularly), and clearly explains why MP was important to develop. I encourage anyone interested in building ontologies to read this paper for its background information, and especially encourage anyone interested in distributed, community-driven development of ontologies to read this and determine if MP might be the right methodology for you.

I’ll finish as Garcia et al. has, with their concluding paragraph. Enjoy!

‘As we increasingly build large ontologies against complex domain knowledge in a community and collaborative manner there is an identified need for a methodology to provide a framework for this process. A shared methodology tailored for the decentralized development environment, facilitated by the internet should increasingly enable and encourage the development of ontologies fit for purpose. The Melting point methodology provides this framework which should enable the ontology community to cope with the escalating demands for scalability and repeatability in the representation of community derived knowledge bases, such as those in biomedicine and the semantic web.’


Writing Groups: U can haz gud riting

Every Thursday morning, we get together and go over a portion of a thesis, a short paper, a conference submission, an abstract or suchlike that is in the process of being written by a member of the group. This morning was my turn: 4 pages, 1 hour. In the end, we only got through 2 1/2 pages. It's not that there were huge changes: it's more that we cover both grammar and meaning. And it was fantastically useful (see Matt, I can get the word "fantastically" into a work document! Or was it supposed to be fantastical?).

This writing group has been helpful to so many of us in our group, and my thanks go to Jen for starting it up. I highly recommend such a practice anyway. Not only does it encourage people to write more to ensure that each week is filled up, it also means that what comes out of our group is (hopefully) more consistent and more polished. Consistent because we have all started adopting group-wide conventions for dashes ("group-wide conventions"), commas (Oxford comma or no?), mixed casing (Web Services or web services?) and more; polished because we get an end result which is more cohesive, reads better, and is generally more succinct.

So, what would my tips be after getting a good set of comments this morning from the group?

  • Try very hard to give 18-24 hours' notice to the rest of your group. Often the paper doesn't get sent around ahead of time until 5pm the day before (as happened with my paper this week), and don't go too hard on them if that's the case. The fact remains, however: earlier is better. You'll get more complete feedback from your peers
  • Always bring along a copy of Strunk and White. Masters of concise writing and the answering of tricky grammar questions.
  • You might also like Truss' Eats, Shoots & Leaves. Humorous, and gives you the right idea about punctuation.
  • Send an editable version of the file if you are also sending a PDF. If you are a latex person, ALWAYS send the tex version around as well as the PDF. Not everyone can comment directly onto PDF, and making it possible for them to easily edit the tex file (and track changes) is really helpful.
  • A writing group is always useful, at any stage of the writing process. It doesn't matter how early on or late into the paper-writing procedure you are: other people always catch things that you haven't seen, no matter how often you've read it!
  • No more than 4 single-spaced pages, max! And, as I've said, even that is often too long. Takes less time for the group members to read, and it means you can get a greater depth of responses for the section you've sent out.

Summary: It's great. Go for it!

And the result? U can haz gud riting too. But maybe, you'd prefer a cheezburger.

Read and post comments |
Send to a friend


Papers Research Blogging Semantics and Ontologies Software and Tools

From OBO to OWL and Back Again: OBO capabilities of the OWL API

Golbreich et al describe a formal method of converting OBO to OWL
1.1 files, and vice versa. Their code has been integrated into the OWL
API, a set of classes that is well-used within the OWL community. For
instance, Protege 4 is built on the OWL API. While there have been
other efforts in the past to map between the OBO flat-file format and
OWL (they specifically mention Chris Mungall’s work on an XLST used as
a plugin within Protege that can perform the conversion), none were
done in a formal or rigorous manner. By defining an exact relationship
between OBO and OWL constructs using consensus information provided by
the OBO community, the authors have provided a more robust method of
mapping than has been available to date.

Consequently, the entire library of tools, reasoners and editors
available to the OWL community are now also available to OBO developers
in a way that does not force them to permanently leave the format and
environment that they are used to.

OBO ontologies are ontologies generated within the biological and
biomedical domain and which follow a standard, if often
non-rigorously-defined, syntax and semantics. The most well-known of
the OBO ontologies is the Gene Ontology (GO). Not only do you subscribe
to the format when you choose OBO, you are also subscribing to the
ideas behind the OBO Foundry, which aims to limit overlap of ontologies
in related fields, and which provides a communal environment (mailing
lists, websites, etc) in which to develop. OWL (the Web Ontology
Language) has three dialects, of which OWL-DL (DL stands for
Description Logics) is the most commonly used. OWL-DL is favored by
ontologists wishing to perform computational analyses over ontologies
as it has not just rigorously-defined formal semantics, but also a wide
user-base and a suite of reasoning tools developed by multiple groups.

OBO is composed of stanzas describing elements of the ontology.
Below is an example of a term in its stanza, which describes its
location in the larger ontology:

id: GO:0001555
name: oocyte growth
is_a: GO:0016049 ! cell growth
relationship: part_of GO:0048601 ! oocyte morphogenesis
intersection_of: GO:0040007 ! growth
intersection_of: has_central_participant CL:0000023 ! oocyte

Before they could start writing the parsing and mapping programs,
they had to formalize both the semantics and the syntax of OBO. This is
not something that would normally be done by the developers of the
format, not the users of the format, but both the syntax and semantics
of OBO are only defined in natural language. These natural language
definitions often lead to imprecision and, in extreme cases, no
consensus was reached for some of the OBO constructs. However, the
diligence of the authors in getting consensus from the OBO community
should be rewarded in future by the OBO community feeling confident in
the mapping, and therefore also in using the OWL tools now available to
them. An example of natural language defintions in the OBO User Guide

This tag describes a typed relationship between this term and
another term. […] The necessary modifier allows a relationship to be
marked as “not necessarily true”. […]

Neither “necessarily true” nor relationship have been defined. You
can, in fact, computationally define a relation in three different ways
(taking their stanza example from above):

  • existantially, where each instance of GO:0001555 must have at least
    one part_of relationship to an instance of the term GO:0048601;
  • universally, where instances of GO:0001555 can *only* be connected to instances of GO:0048601;
  • via a constraint interpretation, where the endpoints of the
    relationship *must* be known, but which cannot in any case be expressed
    with DL, so is not useful to this dicussion.

OBO-Edit does not always infer what should be inferred if all of the
rules of its User Guide are followed. There is a good example of this
in the text.In their formal representation of the OBO syntax they used
BNF, which is backwards-compatible with OBO. Many of the mappings are
quite straightforward: OBO terms become OWL classes, OBO relationship
types become OWL properties, OBO instances become OWL individuals, OBO
ids are the URIs in OWL, and the OBO names become the OWL labels. is_a,
disjoint_from, domain and range have direct OWL equivalents. There had
to be some more complex mapping in other places, such as trying to map
OBO relationship types to either OWL object or datatype properties.

Using OWL reasoners over OBO ontologies not only works, but in the
case of the Sequence Ontology (SO), found a term that only had a single
intersection_of statement, and was thus illegal according to OBO rules,
but which hadn’t been found by OBO-Edit.

Up until now, I’ve been unsure as to how the OWL files are created
from files in the OBO format. This was a paper that was clear and to
the point. Thanks very much!

Update December 2008: I originally posted this without the BPR3 / tag, as I was unsure where conference proceedings
came in the “peer-reviewed research” part of the guidelines. However,
as I’m now getting back into the whole researchblogging thing, I feel
(having read many of the posts of my fellow research bloggers) that
this would be suitable. If anyone has any opinions, I’d be most

Golbreich, C., Horridge, M., Horrocks, I., Motik, B., Shearer, R. (2008). OBO and OWL: Leveraging Semantic Web Technologies for the Life Sciences Lecture Notes in Computer Science, 4825/2008, 169-182 DOI: 10.1007/978-3-540-76298-0_13

Read and post comments |
Send to a friend



Whole-Genome Reference Networks for the Community

Srinivasan et al use this paper as a call to the community to begin the development of whole-genome reference networks for key model organisms. This paper is a combination of a review (in
that it summarizes methods
of network generation and analysis) and a call to arms, stating that
reference networks are needed. It begins by describing systems biology as "the science of quantitatively
defining and analyzing" functional modules, or components of
biological systems.

There are many different definitions of systems
biology (see here, here, here, here and here,
just to name a few), but generally it seems the twin pillars of data
integration and study – at various levels of granularity – of
biological systems are present in most of them. A focus on integration
and top-down research rather than the more traditional reductionist
point of view is also often mentioned.

The authors then divide systems
biology into three broad categories: high-level networks of the
interactome or metabolome, deterministic models of kinetics and
diffusion, and finally stochastic models of variation in cell lines.
This division would be slightly clearer if they specified continuous deterministic models and discrete
stochastic models. I realize that these adjectives are generally
assumed for these model types, but as it is their discrete- or
continuous-ness that increases the complexity of the models, it is
something that would be useful to include.

They collapse many
different types of network data into a single global interaction
network, stating that it would be prohibitively expensive to try to
prise out all of the sub-graphs, as variables such as time or
sub-cellular location are often not simple to pull out on their own.
This "lowest common denominator" method of network generation is not
ideal, but does provide more information than, they attest, a simple
genome sequence. In their networks, nodes represent proteins and edge
weights are probabilities of association between proteins.

Noise is
a real problem in most of these high-throughput data sets, and such
data sets are not all created equal: one group may make a very good
gene expression data set, and another may not. How can variable quality
of data be dealt with? Early efforts focused on integrating multiple
networks and only taking those nodes and edges that were present in
more than one network. After that, methods of network generation that
used "gold standards" created better integrated networks.

of network analyses (rather than network creation) focus on network
alignment and experiment prioritization. The latter is a general term
for pulling out elements of the network that haven't been
experimentally verified, such as likely additions to known pathways or
important disease genes. The former is an interesting extrapolation to
networks of sequence alignments for genomes. In network alignments,
conserved modules of nodes are identified if they have "both conserved
primary sequences and conserved pair-wise interactions
between species". They specifically mention Graemlin, which is a tool
they have developed that can identify conserved functional modules across multiple networks.

Finally, they suggest that the reference networks should show only those reactions present in the "‘average cell’ of a given organism near the median of the norm of reaction".

While they acknowledge that, like the reference human genome sequence,
such a creation is a "useful fiction", it is my opinion that finding
the average cell will be much more difficult, and perhaps less
illuminating, than its equivalent in the sequencing world. Further,
describing what is "normal" is something that is truly difficult, and
will vary from species to species. The PATO / quality ontology people
( have known about the
problems facing the "average" phenotype for a while now. I do, however, like their
idea of storing the reference networks using RDF, as that seems a
fitting format for networks. Overall, a laudable goal but one which
will need some more thinking about. I've tried to run Graemlin
using one of their example searches, and
it didn't run (at least today), and the main author's website won't load for me to today, though one of
the other author's pages
did work.

All-in-all, a useful review of recent network methods in bioinformatics, and an interesting goal described. Low-noise reference networks for key model organisms, together with the annotation tracks that would describe deviations from the norm is a good idea.

Topics for discussion (aka leading questions): More fine-grained reference implementations are available, such as Reactome. Reactome provides a curated database of human biological pathways, with inferred orthologous events for 22 other organisms. Do we need reference networks when we're gradually growing our knowledge of reference pathways? Are reference networks of "normal" organisms states helpful? How do we define average? Would the median of the norm of a reaction be different under different environmental conditions? What if what one group considers an average cell differs from another group's average cell? Having reference networks would mean easier comparisons of different network analysis programs. Would this end up being a major use of the networks? Would such comparisons just lead to network analysis programs that fit the reference network, but not work in a generic manner? What do others think?

Srinivasan, B.S., Shah, N.H., Flannick, J.A., Abeliuk, E., Novak, A.F., Batzoglou, S. (2007). Current progress in network research: toward reference networks for key model organisms. Briefings in Bioinformatics, 8(5), 318-332. DOI: 10.1093/bib/bbm038

Read and post comments
Send to a friend