BioSharing is Caring: Being FAIR

FAIR: Findable, Accessible, Interoperable, Reusable
Source: Scientific Data via http://www.isa-tools.org/join-the-funfair/ March 16, 2016.

In my work for BioSharing, I get to see a lot of biological data standards. Although you might laugh at the odd dichotomy of multiple standards (rather than One Standard to Rule Them All), there are reasons for it. Some of those reasons are historical, such as a lack of cross-community involvement during inception of standards, and some are technical, such as vastly different requirements in different communities. The FAIR paper, published yesterday by Wilkinson et al. (and by a number of my colleagues at BioSharing) in Scientific Data, helps guide researchers towards the correct standards and databases by clarifying data stewardship and management requirements. If used correctly, a researcher can be assured that as long as a resource is FAIR, it’s fine.

This article describes four foundational principles—Findability, Accessibility, Interoperability, and Reusability—that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added-value gained by contemporary, formal scholarly digital publishing. Importantly, it is our intent that the principles apply not only to ‘data’ in the conventional sense, but also to the algorithms, tools, and workflows that led to that data. All scholarly digital research objects—from data to analytical pipelines—benefit from application of these principles, since all components of the research process must be available to ensure transparency, reproducibility, and reusability.(doi:10.1038/sdata.2016.18)

This isn’t the first time curators, bioinformaticians and other researchers have shouted out the importance of being able to find, understand, copy and use data. But any help in spreading the message is more than welcome.

Standards
Source: https://xkcd.com/927/

Need more help finding the right standard or database for your work? Visit BioSharing!

Further information:

Advertisements

PhD Thesis: Table of Contents

Below you can find a complete table of contents for all thesis-related posts (you can also get to the posts via the “thesis” tag I have used for each). Enjoy!

Thesis Posts

And, if you’re interested in how I performed the conversion, I’ve written about that too.

Google Scholar Citations and Microsoft Academic Search: first impressions

There’s been a lot of chat in my scientific circles lately about the recent improvements in freely-available, easily-accessible web applications to organise and store your publications. Google Scholar Citations (my profile) and Microsoft Academic Search (my profile) are the two main contenders, but there are many other resources to use which are mighty fine in their own right (see my publication list on Citeulike for an example of this). Some useful recent blog posts and articles on this subject are:

  • Egon Willighagen’s post includes lots of good questions about the future of free services like GSC and MAS, and how they relate to his use of more traditional services such as the Web Of Science.
  • Alan Cann’s impressions of GSC, including a nice breakdown of his citations by type.
  • Jane Tinkler’s comparison of GSC and MAS, together with a nice description of what happens when a GSC account is made (HT Chris Taylor @memotypic for the link to the article).
  • Nature News’ comparison of GSC and MAS, and impressions of how these players might change the balance of power between free and non-free services.

While I do like each of these technologies for a number of reasons, there are also reasons to be less happy with them. Firstly, their similarities (please be aware I am not trying to make an exhaustive list – just my impressions after using each product for a few days). They both allow the merging of authors, a feature that was very useful to me as I changed my name when I got married. Neither service has a fantastic interface for the merge, but it worked. Both provided some basic metrics: GSC has the h-index and the i10 index, while MAS uses the g and h indexes. Both tell you how many other papers have cited each of your publications. Both seem to get things mostly right (and a little bit wrong) when assigning publications to me – I had to manually fix both apps. Both provide links to co-authors, though GSC’s is rather limited, as you have to actively create a profile there while with MAS profiles are built automatically.

Things I like about Microsoft Academic Search:

  1. Categorisation of publications. You can look down the left-hand side and see your papers categorised by type, keyword, etc.
  2. Looks nicer. Yes, I like Google. But Microsoft’s offering is just a lot better looking.
  3. Found more ancillary stuff. It found my work page (though the URL has since changed), and from there a picture of me. Links out to Bing (of course) and nice organisation of basic info really just makes it look more professional than GSC.
  4. Bulk import of citations in Bibtex format. I really like this feature – I was able to bulk add the missing citations in one fell swoop using a bibtex upload. Shiny!

Things I don’t like about Microsoft Academic Search:

  1. Really slow update time. It insists on vetting each change with a mysterious Microsoftian in the sky. I’ve made a bunch of changes to the profile, updated and added publications, and days later it still hasn’t shown those changes. It’s got to get better if it doesn’t want to irritate me. Sure, do a confirmation step to ensure I am who I say I am, but then give me free rein to change things!
  2. Silverlight. I’ve tried installing moonlight, which seemed to install just fine, but then the co-author graph just showed up empty. Is that a fault with moonlight, or with the website?
  3. Did I mention the really slow update time?

Things I like about Google Scholar Citations:

  1. Changes are immediately visible. Yes, if I merge authors or remove publications or anything else, it shows up immediately on my profile.
  2. No incompatibilities with Linux. All links work, no plugins required.

Things I don’t like about Google Scholar Citations:

  1. Lack of interesting extras. The graphs, fancy categorisations etc. you get with MAS you don’t (yet) get with the Google service.
  2. No connection with the Google profile. Why can’t I provide a link to my Google profile, and then get integration with Google+, e.g. announcements when new publications are added? This is a common complaint with Google+, as many other Google services (e.g. Google reader) aren’t yet linked with it, but hopefully this will come eventually.
  3. Not as pretty. Also, I’m not sure if it’s just my types of papers, but the links in GSC to the individual citations are difficult to read, and it’s hard to determine the ultimate source of the article (e.g. journal or conference name).

I will still use Citeulike as my main publication application. I use it to maintain the library of my own papers and other people’s papers. Its import and export features for bibtex are great, and it can convert web pages to citations with just one click (or via a bookmarklet). It has loads of other bells and whistles as well. While I’m writing up my thesis, I visit it virtually every day to add citations and export bibtex.

So, between Google and Microsoft, which do I like better? They’re very similar, but Microsoft Academic Search wins right now. But both services are improving daily, and we’ll have to see how things change in future.

And the thing that really annoys me? I now feel the need to keep my publications up to date on three systems: Citeulike (it’s the thing I actually use when writing papers etc.), Microsoft Academic Search, and Google Scholar Citations. No, I don’t *have* to maintain all three, but people can find out about me from all of them, and I want to try to ensure what they see is correct. Irritating. Can we just have some sensible researcher IDs in common use, and from that an unambiguous way to discover which publications are mine? I know efforts are under way, but it can’t come soon enough.

Social Networking and Guidelines for Life Science Conferences

ResearchBlogging.org
I had a great time in Sweden this past summer, at ISMB 2009 (ISMB/ECCB 2009 FriendFeed room). I listened to a lot of interesting talks, reconnected with old friends and met new ones. I went to an ice bar, explored a 17th-century ship that had been dragged from the bottom of the sea, and visited the building where the Nobel Prizes are handed out.

While there, many of us took notes and provided commentary through live blogging either on our own blogs or via FriendFeed and Twitter. The ISCB were very helpful, having announced and advertised the live blogging possibilities prior to the event. Once at the conference, they provided internet access, and even provided extension cords where necessary so that we could continue blogging on mains power.

Those of us who spent a large proportion of our time live blogging were asked to write a paper about our experiences. This quickly became two papers, as there were two clear subjects on our minds: firstly, how the live blogging went in the context of ISMB 2009 specifically; and secondly, how our experiences (and that of the organisers) might form the basis of a set of guidelines to conference organisers trying to create live blogging policies. The first paper became the conference report, a Message from ISCB published today in PLoS Computational Biology. This was published in conjunction with the second paper, a Perspective published jointly today in PLoS Computational Biology, that aims to help organisers create policies of their own. Particularly, it provides “top ten”(-ish) lists for organisers, bloggers and presenters.

So, thanks again to my co-authors:
Ruchira S. Datta: Blog FriendFeed
Oliver Hofmann: Blog FriendFeed Twitter
Roland Krause: Blog FriendFeed Twitter
Michael Kuhn: Blog FriendFeed Twitter
Bettina Roth
Reinhard Schneider: Blog FriendFeed
(you can find links to my social networking accounts on the About page on this blog)

If you have any questions or comments about either of these articles, please comment on the PLoS articles themselves, so there can be a record of the discussion.

Lister, A., Datta, R., Hofmann, O., Krause, R., Kuhn, M., Roth, B., & Schneider, R. (2010). Live Coverage of Scientific Conferences Using Web Technologies PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000563

Lister, A., Datta, R., Hofmann, O., Krause, R., Kuhn, M., Roth, B., & Schneider, R. (2010). Live Coverage of Intelligent Systems for Molecular Biology/European Conference on Computational Biology (ISMB/ECCB) 2009 PLoS Computational Biology, 6 (1) DOI: 10.1371/journal.pcbi.1000640

Modeling and Managing Experimental Data Using FuGE

Want to share your umpteen multi-omics data sets and experimental protocols with one common format? Encourage collaboration! Speak a common language! Share your work! How, you might ask? With FuGE!

In 2007, FuGE version 1 was released (website, Nature Biotechnology paper). FuGE allows biologists and bioinformaticians to describe any life science experiment using a single format, making collaboration and repeatability of experiments easier and more efficient. However, if you wanted to start using FuGE, until now it was difficult to know where to start. Do you use FuGE as it stands? Do you create an extension of FuGE that specifically meets your needs? What do the developers of FuGE suggest when taking your first steps using it? This paper focuses on best practices for using FuGE to model and manage your experimental data. Read this paper, and you’ll be taking your first steps with confidence!

ResearchBlogging.org

Want to share your umpteen multi-omics data sets and experimental protocols with one common format? Encourage collaboration! Speak a common language! Share your work! How, you might ask? With FuGE, and this latest paper (citation at the end of the post) tells you how.

In 2007, FuGE version 1 was released (website, Nature Biotechnology paper). FuGE allows biologists and bioinformaticians to describe any life science experiment using a single format, making collaboration and repeatability of experiments easier and more efficient. However, if you wanted to start using FuGE, until now it was difficult to know where to start. Do you use FuGE as it stands? Do you create an extension of FuGE that specifically meets your needs? What do the developers of FuGE suggest when taking your first steps using it? This paper focuses on best practices for using FuGE to model and manage your experimental data. Read this paper, and you’ll be taking your first steps with confidence!

[Aside: Please note that I am one of the authors of this paper.]

What is FuGE? I’ll leave it to the authors to define:

The approach of the Functional Genomics Experiment (FuGE) model is different, in that it attempts to generalize the modeling constructs that are shared across many omics techniques. The model is designed for three purposes: (1) to represent basic laboratory workflows, (2) to supplement existing data formats with metadata to give them context within larger workflows, and (3) to facilitate the development of new technology-specific formats. To support (3), FuGE provides extension points where developers wishing to create a data format for a specific technique can add constraints or additional properties.

A number of groups have started using FuGE, including MGED, PSI (for GelML and AnalysisXML), MSI, flow cytometry, RNA interference and e-Neuroscience (full details in the paper). This paper helps you get a handle on how to use FuGE by presenting two running examples of capturing experimental metadata in the fields of flow cytometry and proteomics of flow cytometry and gel electrophoresis. Part of Figure 2 from the paper is shown on the right, and describes one section of the flow cytometry FuGE extension from FICCS.

The flow cytometry equipment created as subclasses of the FuGE equipment class.
The flow cytometry equipment created as subclasses of the FuGE equipment class.

FuGE covers many areas of experimental metadata including the investgations, the protocols, the materials and the data. The paper starts by describing how protocols are designed in FuGE and how those protocols are applied. In doing so, it describes not just the protocols but also parameterization, materials, data, conceptual molecules, and ontology usage.

Examples of each of these FuGE packages are provided in the form of either the flow cytometry or the GelML extensions. Further, clear scenarios are provided to help the user determine when it is best to extend FuGE and when it is best to re-use existing FuGE classes. For instance, it is best to extend the Protocol class with an application-specific subclass when all of the following are true: when you wish to describe a complex Protocol that references specific sub-protocols, when the Protocol must be linked to specific classes of Equipment or Software, and when specific types of Parameter must be captured. I refer you to the paper for scenarios for each of the other FuGE packages such as Material and Protocol Application.

The paper makes liberal use of UML diagrams to help you understand the relationship between the generic FuGE classes and the specific sub-classes generated by extensions. A large part of the paper is concerned expressly with helping the user understand how to model an experiment type using FuGE, and also to understand when FuGE on its own is enough. But it also does more than that: it discusses the current tools that are already available for developers wishing to use FuGE, and it discusses the applicability of other implementations of FuGE that might be useful but do not yet exist. Validation of FuGE-ML and the storage of version information within the format are also described. Implementations of FuGE, including SyMBA and sysFusion for the XML format and ISA-TAB for compatibility with a spreadsheet (tab-delimited) format, are also summarised.

I strongly believe that the best way to solve the challenges in data integration faced by the biological community is to constantly strive to simply use the same (or compatible) formats for data and for metadata. FuGE succeeds in providing a common format for experimental metadata that can be used in many different ways, and with many different levels of uptake. You don’t have to use one of the provided STKs in order to make use of FuGE: you can simply offer your data as a FuGE export in addition to any other omics formats you might use. You could also choose to accept FuGE files as input. No changes need to be made to the underlying infrastructure of a project in order to become FuGE compatible. Hopefully this paper will flatten the learning curve associated for developers, and get them on the road to a common format. Just one thing to remember: formats are not something that the end user should see. We developers do all this hard work, but if it works correctly, the biologist won’t know about all the underpinnings! Don’t sell your biologists on a common format by describing the intricacies of FuGE to them (unless they want to know!), just remind them of the benefits of a common metadata standard: cooperation, collaboration, and sharing.

Jones, A., Lister, A.L., Hermida, L., Wilkinson, P., Eisenacher, M., Belhajjame, K., Gibson, F., Lord, P., Pocock, M., Rosenfelder, H., Santoyo-Lopez, J., Wipat, A., & Paton, N. (2009). Modeling and Managing Experimental Data Using FuGE OMICS: A Journal of Integrative Biology, 2147483647-13 DOI: 10.1089/omi.2008.0080

ISMB Bio-Ontologies SIG 2009: Let’s talk about ontologies

I can’t resist posting a short announcement about two papers I’m an author on which have been accepted to this year’s Bio-Ontologies SIG at ISMB. 🙂 I’ll post more about both papers during or just before the SIG, which is on Sunday, June 28, 2009. However, here’s a taster of both.

I am first author on one of the papers,  which covers the current state of work on my PhD: “Annotation of SBML Models Through Rule-Based Semantic Integration”, by Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Here’s the abstract:

Motivation: The creation of accurate quantitative Systems Biology Markup Language (SBML) models is a time-intensive, manual process often complicated by the many data sources and formats required to annotate even a small and well-scoped model. Ideally, the retrieval and integration of biological knowledge for model annotation should be performed quickly, precisely, and with a minimum of manual effort. Here, we present a method using off-the-shelf semantic web technology which enables this process: the heterogeneous data sources are first syntactically converted into ontologies; these are then aligned to a small domain ontology by applying a rule base. Integrating resources in this way can accommodate multiple formats with different semantics; it provides richly modelled biological knowledge suitable for annotation of SBML models.
Results: We demonstrate proof-of-principle for this rule-based mediation with two use cases for SBML model annotation. This was implemented with existing tools, decreasing development time and increasing reusability. This initial work establishes the feasibility of this approach as part of an automated SBML model annotation system.

And to whet the appetite a little further, here’s an overview diagram from the paper describing the overall flow through the data integration process:

Rule-based mediation in the context of SBML model annotation.
Rule-based mediation in the context of SBML model annotation.

The second paper discusses the Ontology for Biomedical Investigations (OBI) (OWL file, website): “Modeling biomedical experimental processes with OBI”, by the OBI Consortium (of which I am a part). You can read the full paper, and here is the abstract:

Motivation: Experimental metadata are stored in many different formats and styles, creating challenges in comparison, reproduction and analysis. These difficulties impose severe limitations on the usability of such metadata in a wider context. The Ontology for Biomedical Investigations (OBI), developed as part of a global, cross-community effort, provides an approach to represent biological and clinical investigations in an explicit and integrative framework which facilitates computational processing and semantic web compatibility. Here we detail two real-world applications of OBI and show how OBI satisfies such use cases.

Distributed Ontology Development

Last Friday, while I was discussing ontologies and decisions that need to be made in ontology development with some work colleagues, one of the phrases that cropped up more than once is “be sensible”. Being sensible isn’t always as easy as it seems, but one way to be sensible is to choose an ontology development methodology and make use of before you even write down your first ontology class name. If you want lots of people to use an ontology, you need to involve at least some of those people in its development.

As a timely accompaniment to this thought, in the past week Frank Gibson has published a pre-print version of a methodology for distributed ontology development called Developing ontologies in decentralised settings (by Alexander Garcia, Kieran O’Neill, Leyla J. Garcia, Phillip Lord, Robert Stevens, Oscar Corcho, & Frank Gibson).

Last Friday, while I was discussing ontologies and decisions that need to be made in ontology development with some work colleagues, one of the phrases that cropped up more than once is “be sensible”. Being sensible isn’t always as easy as it seems, but one way to be sensible is to choose an ontology development methodology and make use of before you even write down your first ontology class name. If you want lots of people to use an ontology, you need to involve at least some of those people in its development.

As a timely accompaniment to this thought, in the past week Frank Gibson has published a pre-print version of a methodology for distributed ontology development called Developing ontologies in decentralised settings (by Alexander Garcia, Kieran O’Neill, Leyla J. Garcia, Phillip Lord, Robert Stevens, Oscar Corcho, & Frank Gibson).

While Frank himself has referred to it as “dry”, I think that does it a disservice (but perhaps I’m biased because I know him and also because I like methodologies and standards!). This paper would better be described as comprehensive. I’d like to cover a few sections of the paper that I found the most interesting, to whet your appetite for reading the whole thing.

Firstly, Garcia et al. mention one overriding focus of the bio-ontology community: ontology development without any accompanying ontology development methodology:

‘The research focus for the bio-ontology community to date has typically centred on the development of domain specific ontologies for particular applications, as opposed to the actual “how to” of building the ontology or the “materials and methods”[…] This has resulted in a proliferation of bio-ontologies, developed in different ways, often presenting overlap in terminology or application domain.’

Both in programming and in ontology development, I find it very hard not to head straight for working on the “interesting” bits without thinking through the best way to go about it. However, even though I find it difficult to follow a particular methodology, the benefits outweigh the downsides.

Garcia et al also list a kind of minimal set of requirements for an ontology methodology:

‘A general purpose methodology should aim to provide ontology engineers with a sufficient perspective of the stages of the development process and the components of the ontology life cycle, and account for community development. In addition, detailed examples of use should be included for those stages, outcomes, deliverables, methods and techniques; all of which form part of the ontology life cycle.’

So far, these are useful statements for anyone building an ontology, but this paper concentrates on distributed ontology development, and presents Melting Point (MP), an ontology methodology specifically designed for distributed, community-driven ontology development. It was created as a “convergence of existing methodologies, with the addition of new aspects” as “no methodology completely satisfies all the criteria for collaborative development” (pg. 2). A useful overview of MP is available from Figure 3 in the paper, which describes the life cycle of the MP methodology including its processes and activities.

This paper has a thorough review of nine existing ontology and knowledge engineering methodologies (see Table 1 and Section 4.2 particularly), and clearly explains why MP was important to develop. I encourage anyone interested in building ontologies to read this paper for its background information, and especially encourage anyone interested in distributed, community-driven development of ontologies to read this and determine if MP might be the right methodology for you.

I’ll finish as Garcia et al. has, with their concluding paragraph. Enjoy!

‘As we increasingly build large ontologies against complex domain knowledge in a community and collaborative manner there is an identified need for a methodology to provide a framework for this process. A shared methodology tailored for the decentralized development environment, facilitated by the internet should increasingly enable and encourage the development of ontologies fit for purpose. The Melting point methodology provides this framework which should enable the ontology community to cope with the escalating demands for scalability and repeatability in the representation of community derived knowledge bases, such as those in biomedicine and the semantic web.’