Google Scholar Citations and Microsoft Academic Search: first impressions

August 3, 2011 Leave a comment

There’s been a lot of chat in my scientific circles lately about the recent improvements in freely-available, easily-accessible web applications to organise and store your publications. Google Scholar Citations (my profile) and Microsoft Academic Search (my profile) are the two main contenders, but there are many other resources to use which are mighty fine in their own right (see my publication list on Citeulike for an example of this). Some useful recent blog posts and articles on this subject are:

  • Egon Willighagen’s post includes lots of good questions about the future of free services like GSC and MAS, and how they relate to his use of more traditional services such as the Web Of Science.
  • Alan Cann’s impressions of GSC, including a nice breakdown of his citations by type.
  • Jane Tinkler’s comparison of GSC and MAS, together with a nice description of what happens when a GSC account is made (HT Chris Taylor @memotypic for the link to the article).
  • Nature News’ comparison of GSC and MAS, and impressions of how these players might change the balance of power between free and non-free services.

While I do like each of these technologies for a number of reasons, there are also reasons to be less happy with them. Firstly, their similarities (please be aware I am not trying to make an exhaustive list – just my impressions after using each product for a few days). They both allow the merging of authors, a feature that was very useful to me as I changed my name when I got married. Neither service has a fantastic interface for the merge, but it worked. Both provided some basic metrics: GSC has the h-index and the i10 index, while MAS uses the g and h indexes. Both tell you how many other papers have cited each of your publications. Both seem to get things mostly right (and a little bit wrong) when assigning publications to me – I had to manually fix both apps. Both provide links to co-authors, though GSC’s is rather limited, as you have to actively create a profile there while with MAS profiles are built automatically.

Things I like about Microsoft Academic Search:

  1. Categorisation of publications. You can look down the left-hand side and see your papers categorised by type, keyword, etc.
  2. Looks nicer. Yes, I like Google. But Microsoft’s offering is just a lot better looking.
  3. Found more ancillary stuff. It found my work page (though the URL has since changed), and from there a picture of me. Links out to Bing (of course) and nice organisation of basic info really just makes it look more professional than GSC.
  4. Bulk import of citations in Bibtex format. I really like this feature – I was able to bulk add the missing citations in one fell swoop using a bibtex upload. Shiny!

Things I don’t like about Microsoft Academic Search:

  1. Really slow update time. It insists on vetting each change with a mysterious Microsoftian in the sky. I’ve made a bunch of changes to the profile, updated and added publications, and days later it still hasn’t shown those changes. It’s got to get better if it doesn’t want to irritate me. Sure, do a confirmation step to ensure I am who I say I am, but then give me free rein to change things!
  2. Silverlight. I’ve tried installing moonlight, which seemed to install just fine, but then the co-author graph just showed up empty. Is that a fault with moonlight, or with the website?
  3. Did I mention the really slow update time?

Things I like about Google Scholar Citations:

  1. Changes are immediately visible. Yes, if I merge authors or remove publications or anything else, it shows up immediately on my profile.
  2. No incompatibilities with Linux. All links work, no plugins required.

Things I don’t like about Google Scholar Citations:

  1. Lack of interesting extras. The graphs, fancy categorisations etc. you get with MAS you don’t (yet) get with the Google service.
  2. No connection with the Google profile. Why can’t I provide a link to my Google profile, and then get integration with Google+, e.g. announcements when new publications are added? This is a common complaint with Google+, as many other Google services (e.g. Google reader) aren’t yet linked with it, but hopefully this will come eventually.
  3. Not as pretty. Also, I’m not sure if it’s just my types of papers, but the links in GSC to the individual citations are difficult to read, and it’s hard to determine the ultimate source of the article (e.g. journal or conference name).

I will still use Citeulike as my main publication application. I use it to maintain the library of my own papers and other people’s papers. Its import and export features for bibtex are great, and it can convert web pages to citations with just one click (or via a bookmarklet). It has loads of other bells and whistles as well. While I’m writing up my thesis, I visit it virtually every day to add citations and export bibtex.

So, between Google and Microsoft, which do I like better? They’re very similar, but Microsoft Academic Search wins right now. But both services are improving daily, and we’ll have to see how things change in future.

And the thing that really annoys me? I now feel the need to keep my publications up to date on three systems: Citeulike (it’s the thing I actually use when writing papers etc.), Microsoft Academic Search, and Google Scholar Citations. No, I don’t *have* to maintain all three, but people can find out about me from all of them, and I want to try to ensure what they see is correct. Irritating. Can we just have some sensible researcher IDs in common use, and from that an unambiguous way to discover which publications are mine? I know efforts are under way, but it can’t come soon enough.

A new journal, a new bogus review: again, the culprits are banned

July 19, 2011 1 comment

This is an update on an earlier post, A case of stolen professional identity, which is just a little too long to add directly to the original post.

Last month I received a request for review. But it wasn’t just any request: the title looked suspiciously similar to the title of a paper whose review I had been asked to confirm back in March. In the original incident (described here), the culprits were found to have created a false gmail account for me and submitted my name as a reviewer with that false email account. They were banned from that journal (let’s call it Journal1). I really didn’t think they would try again, at least not with the same fake review setup (specifically, my name and a false email account).

But they did.

Journal2 sent me a standard review request in June. It turns out that behind the scenes, a bogus email address was used, though I’m not sure it was the same one used for Journal1. The affiliation that was provided to Editor2 by the author of the submission appeared strange to him, and so Editor2 searched for me online and found my institutional address. And, as with last time, mine wasn’t the only identity used fraudulently. I noticed the similarities, and I put Editor1 in touch with Editor2, they compared notes, and discovered it was the same paper, the same authors, and the same trick being attempted again.

Turns out fake emails weren’t the only fake thing about them. Though I don’t know the details, I believe there were also fake phone numbers and perhaps fake affiliations.

The authors have been formally banned from Journal2 (as happened with Journal1), and I have to say I wanted to cheer when I saw the words “Please note that this type of behavior is not acceptable in science and will not be tolerated.” Editor2 is going to move things forward, including bringing it to the attention of the Committee on Publication Ethics. I am also looking into ways to take away the false gmail account from whoever owns it, so that hopefully at least that permutation of my name cannot be used again. However, that isn’t a practical solution in the long term, as there are many, many possible permutations.

I had hoped it wouldn’t happen again, but it has, and quickly. Seems like the single thing that would help the most, while needing the smallest change to the existing system, would be to require institutional addresses. Additionally, open peer review might help, though you’d have to do a regular search to ensure that someone didn’t publish an open peer review pretending to be you. My thanks go to both Editor1 and Editor2 for allowing me to, once again, write about these experiences. With knowledge comes great responsibility, yes, but also forewarning.

And I hope it doesn’t happen again. Again.

What should you think about when you think about standards?

July 12, 2011 Leave a comment

The creation of a new standard is very exciting (yes, really). You can easily get caught up in the fun of the moment, and just start creating requirements and minimal checklists and formats and ontologies…. But what should you be thinking about when you start down this road? Today, the second and final day of the BBSRC Synthetic Biology Standards Workshop, was about discussing what parts of a synthetic biology standard are unique to that standard, and what can be drawn from other sources. And, ultimately, it was about reminding ourselves not to reinvent the wheel and not to require more information than the community was willing to provide.

Matthew Pocock had a great introduction into this topic when he summarized what he thinks about when he thinks about standards.  Make sure you don’t miss my notes on his presentation further down this post.

(If you’re interested, have a look at yesterday’s blog post on the first day of this workshop: The more things change, the more they stay the same.)

Half a day was a perfect amount of time to get the ball rolling, but we could have talked all day and into the next. Other workshops are planned for the coming months, and it will be very interesting to see what happens as things progress, both in person and via remote discussions.

Once again, for the time constrained among us, here are my favorite sentences from the presentations and discussions of the day:

  1. Dick Kitney: Synthetic biology is already important in industry, and if you want to work with major industrial companies, you need to get acceptance for your standards, making the existing standard (DICOM) very relevant to what we do here.
  2. Matthew Pocock: Divide your nascent standard into a continuum of uniqueness, from the components of your standard which are completely unique to your field, through to those which are important but have overlap with a few other related fields , and finally to the components which are integral to the standard but which are also almost completely generic.
  3. Discussion 1: Modelling for the purposes of design is very different from modelling for the purposes of analysis and explanation of existing biology.
  4. Discussion 2: I learnt that, just as in every other field I’ve been involved in, there are terms in synthetic biology so overloaded with meaning (for example, “part”) it is better to use a new word when you want to add those concepts to an ontology or controlled vocabulary.

Dick Kitney – Imperial College London: “Systematic Design and Standards in Synthetic Biology”

Dick Kitney discussed how SynBIS, a synthetic biology web-based information system with an integrated BioCAD and modelling suite, was developed and how it is currently used. There are three parts to the CAD in SynBIS: DNA assembly, characterization, and chassis (data for SynBIS). They are using automation in the lab as much as possible. With BioCAD, you can use a parallel strategy for both computer modelling and the synthetic biology itself.

With SynBIS, you can get inputs from other systems as well as part descriptions, models and model data from internal sources. SynBIS has 4 layers: an Interface/HTML layer, a communication layer, an application layer and and a database layer.

Information can be structured into four types: the biological “continuum” (or the squishy stuff), modalities (experimental types, standards relating to such), (sorry – missed this one), and ontologies. SynBIS incorporates the DICOM standard for their biological information. DICOM can be used and modified to store/send parts and associated metadata, related images, and related/collected data. They are interested in DICOM because of the industrialization of synthetic biology. Most major industries and companies already use the DICOM standard. If you want to work with major industrial companies, you need to get acceptance for your standards, making DICOM very important. The large number of users of DICOM are a result of large amounts of effort going into the creation of this modular, modality-friendly standard.

Images are getting more and more important for synthetic biology. If you rely on GFP fluorescence, for example, then you need high levels of accuracy in order to replicate results. DICOM helps you do this. It isn’t just a file format, and includes transfer protocols etc. Each image in DICOM has its own metadata.

What are the downsides of DICOM? DICOM is very complex, and most academics might not have the resources to make use of it (it has a huge 3,000-page document). In actuality, however, it is a lot easier to use then you might think. There are libraries, viewers and standard packages that hide most of the complexity. What is the most popular use of DICOM right now? MRCT, ultrasound, light microscopy, lab data, and many other modalities. In a hospital, most machines’ outputs are compliant with DICOM.

As SBOL develops and expands, they plan to incorporate it into SynBIS.

Issues relating to the standard – Run by Matthew Pocock

The rest of the workshop was structured discussion on the practical aspects of building this standard. Matthew Pocock corralled us all and made sure we remained useful, and also provided the discussion points.

To start, Matt provided some background. What does he ponder when he thinks about standards? Adoption of the standard for one, and who your adopters might be. Such people would be both/either providers of data and/or consumers of data. Also, both machines and humans will interact with the standard. The standard should be easy-to-implement, with a low buy-in.

You need to think about copyright and licensing issues: who owns it, maintains it. Are people allowed to change it for their own or public use? Your standard needs to have a clearly-defined scope: you don’t want it to force you to think about what you’re not interested in. To do this, you should have a list of competency questions.

You want the standard to be orthogonal with other standards and compose into it any other related standards you wish to use but which don’t belong in your new standard. You should have a minimal level of compliance in order for your data to be accepted.

Finally, above all, users of your standard would like it to be lightweight and agile.

What are the technical areas that standards often cover? You should have domain-specific models of what you’re interested in (terminologies, ontologies, UML): essentially, what your data looks like. You also need to have a method of data persistence and protocols, e.g. how you write it down (format, XML, etc.). You also need to think about transport of the data, or how you move it about (SOAP, REST, etc.). Access has to be thought about as well, or how you query for some of the data (SQL, DAS, custom API, etc.).

Within synthetic biology, there is a continuum from incredibly generic, useful standards through to things that are absolutely unique to our (synthetic biology) use case, and then in between is stuff that’s really important, but which might be shared with some other areas such as systems biology. For example, LIMS, and generic metadata are completely generic and can be taken care of by things like Dublin Core. DNA sequence and features are important to synthetic biology, but are not unique to it. Synthetic biology’s peculiar constraints include things like a chassis. You could say that host is synonymous with chassis, but in fact they are completely different roles. Chassis is a term used to describe something very specific in synthetic biology.

Some fields relevant to synthetic biology: microscopy, all the ‘omics, genetic and metabolic engineering, bioinformatics.

Discussion 1

Consider the unique ↔ generic continuum: where do activities in the synthetic biology lifecycle lie on the diagram? What standards already exist for these? What standards are missing?

The notes that follow are a merge of the results from the two groups, but it may be an imperfect merge and as a consequence, there may be some overlap.

UNIQUE (to synthetic biology)

  • design (the composition of behaviour (rather than of DNA, for example)).
    • modelling a novel design is different than modelling for systems biology, which seeks to discover information about existing pathways and interactions
    • quantification for design
  • Desired behaviour: higher-level design, intention. I am of the opinion that other fields also have an intention when performing an experiment, which may or may not be realized during the course of an experiment. I may be wrong in this, however. And I don’t mean an expected outcome – that is something different again.
  • Device (reusable) / parts / components
  • Multi-component, multiple-stage assembly
    • biobricks
    • assembly and machine-automated characterization, experiments and protocols (some of this might be covered in more generic standards such as OBI)
  • Scale and scaling of design
  • engineering approaches
  • characterization
  • computational accessibility
  • positional information
  • metabolic load (burden)
  • evolutionary stability

IMPORTANT

  • modelling (from systems biology): some aspects of both types of modelling are common.
    • you use modelling tools in different ways when you are starting from a synbio viewpoint
    • SBML, CellML, BioPAX
  • module/motifs/components – reusable models
  • Biological interfaces (rips, pops)
  • parts catalogues
  • interactions between parts (and hosts)
  • sequence information
  • robustness to various conditions
  • scaling of production

GENERIC

  • Experimental (Data, Protocols)
    • OBI + FuGE
  • sequence and feature metadata
    • SO, GO
  • LIMS
  • success/performance metrics (comparison with specs)
  • manufacturing/production cost

Discussion 2

From the components of a synthetic biology standard identified in discusison 1, choose two and answer:

  • what data must be captured by the standard?
  • What existing standards should it leverage?
  • Where do the boundaries lie?

Parts and Devices

What data must be captured by the standard? Part/device definition/nomenclature, sequence data, type (enumerated list), relationships between parts (enumerated list / ontology), part aggregation (ordering and composition of nested parts), incompatibilities/contraindications (including range of hosts where the chassis is viable), part buffers and interfaces/Input/Output (as a sub-type of part), provenance, curation level. Any improvements (include what changes were made, and why they were made (e.g. mcherry with the linkers removed)); versioning information (version number, release notes, feature list, and known issues); equivalent parts which are customized for other chassis (codon optimization and usage, chassis-agnostic part); Provenance information including authorship, originating lab, and the date/age of the part (much covered by the SBOL-seq standard); the derivation of the part from other parts or other biological sequence databases, and a human- and machine-readable description of the derivation.

What existing standards? SBOL, DICOM, SO, EMBL, MIBBI

Boundaries: Device efficiency (only works in the biological contexts it’s been described in), chassis and its environment, related parts could be organized into part ‘families’ (perhaps use GO for some of this), also might be able to attach other quantitative information that could be common across some parts.

Characterization

We need to state the type of the device, and we would need a new specification for each type of device, e.g. a promoter is not a GFP. We need to know some measurement information such as statistics, experimental conditions required to record, lab, protocols. Another important value is whether or not you’re using a reference part or device. The context information would include the chassis, in vitro/in vivo, conditions, half-life, and interactions with other devices/hosts.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

The more things change, the more they stay the same

July 11, 2011 Leave a comment

…also known as Day 1 of the BBSRC Synthetic Biology Standards Workshop at Newcastle University, and musings arising from the day’s experiences.

In my relatively short career (approximately 12 years – wait, how long?) in bioinformatics, I have been involved to a greater or lesser degree in a number of standards efforts. It started in 1999 at the EBI, where I worked on the production of the protein sequence database UniProt. Now, I’m working with systems biology data and beginning to look into synthetic biology. I’ve been involved in the development (or maintenance) of a standard syntax for protein sequence data; standardized biological investigation semantics and syntax; standardized content for genomics and metagenomics information; and standardized systems biology modelling and simulation semantics.

(Bear with me – the reason for this wander through memory lane becomes apparent soon.)

How many standards have you worked on? How can there be multiple standards, and why do we insist on creating new ones? Doesn’t the definition of a standard mean that we would only need one? Not exactly. Take the field of systems biology as an example. Some people are interested in describing a mathematical model, but have no need for storing either the details of how to simulate that model or the results of multiple simulation runs. These are logically separate activities, yet they fall within a single community (systems biology) and are broadly connected. A model is used in a simulation, which then produces results. So, when building a standard, you end up with the same separation: have one standard for the modelling, another for describing a simulation, and a third for structuring the results of a simulation. All that information does not need to be stored in a single location all the time. The separation becomes even more clear when you move across fields.

But this isn’t completely clear cut. Some types of information overlap within standards of a single domain and even among domains, and this is where it gets interesting. Not only do you need a single community talking to each other about standard ways of doing things, but you also need cross-community participation. Such efforts result in even more high-level standards which many different communities can utilize. This is where work such as OBI and FuGE sit: with such standards, you can describe virtually any experiment. The interconnectedness of standards is a whole job (or jobs) in itself – just look at the BioSharing and MIBBI projects. And sometimes standards that seem (at least mostly) orthogonal do share a common ground. Just today, Oliver Ruebenacker posted some thoughts on the biopax-discuss mailing list where he suggests that at least some of BioPAX and SBML share a common ground and might be usefully “COMBINE“d more formally (yes, I’d like to go to COMBINE; no, I don’t think I’ll be able to this year!). (Scroll down that thread for a response by Nicolas Le Novère as to why that isn’t necessarily correct.) So, orthogonality, or the extent to which two or more standards overlap, is sometimes a hard thing to determine.

So, what have I learnt? As always, we must be practical. We should try to develop an elegant solution, but it really, really should be one which is easy to use and intuitive to understand. It’s hard to get to that point, especially as I think that point is (and should be) a moving target. From my perspective, group standards begin with islands of initial research in a field, which then gradually develop into a nascent community. As a field evolves, ‘just-enough’ strategies for storing and structuring data become ‘nowhere-near-enough’. Communication with your peers becomes more and more important, and it becomes imperative that standards are developed.

This may sound obvious, but the practicalities of creating a community standard means such work requires a large amount of effort and continued goodwill. Even with the best of intentions, with every participant working towards the same goal, it can take months – or years – of meetings, document revisions and conference calls to hash out a working standard. This isn’t necessarily a bad thing, though. All voices do need to be heard, and you cannot have a viable standard without input from the community you are creating that standard for. You can have the best structure or semantics in the world, but if it’s been developed without the input of others, you’ll find people strangely reluctant to use it.

Every time I take part in a new standard, I see others like me who have themselves been involved in the creation of standards. It’s refreshing and encouraging. Hopefully the time it takes to create standards will drop as the science community as a whole gets more used to the idea. When I started, the only real standards in biological data (at least that I had heard of) were the structures defined by SWISS-PROT and EMBL/GenBank/DDBJ. By the time I left the EBI in 2006, I could have given you a list a foot long (GO, PSI, and many others), and that list continues to grow. Community engagement and cross-community discussions continue to be popular.

In this context, I can now add synthetic biology standards to my list of standards I’ve been involved in. And, as much as I’ve seen new communities and new standards, I’ve also seen a large overlap in the standardization efforts and an even greater willingness for lots of different researchers to work together, even taking into account the sometimes violent disagreements I’ve witnessed! The more things change, the more they stay the same…

At this stage, it is just a limited involvement, but the BBSRC Synthetic Biology Standards Workshop I’m involved in today and tomorrow is a good place to start with synthetic biology. I describe most of today’s talks in this post, and will continue with another blog post tomorrow. Enjoy!

For those with less time, here is a single sentence for each talk that most resounded with me:

  1. Mike Cooling: Emphasising the ‘re’ in reusable, and make it easier to build and understand large models from reusable components.
  2. Neil Wipat: For a standard to be useful, it must be computationally amenable as well as useful for humans.
  3. Herbert Sauro: Currently there is no formal ontology for synthetic biology, but one will need to be developed.

This meeting is organized by Jen Hallinan and Neil Wipat of Newcastle University. Its purpose is to set up key relationships in the synthetic biology community to aid the development of a standard for that community. Today, I listened to talks by Mike Cooling, Neil Wipat, and Herbert Sauro. I was – unfortunately – unable to be present for the last couple of talks, but will be around again for the second – and final – day of the workshop tomorrow.

Mike Cooling – Bioengineering Institute Auckland, New Zealand

Mike uses CellML (it’s made where he works, but that’s not the only reason…) in his work with systems and synthetic biology models. Among other things, it wraps MathML and partitions the maths, variables and units into reusable pieces. Although many of the parts seem domain specific, CellML itself is actually not domain specific. Further, unlike other modelling languages such as SBML, components in CellML are reusable and can be imported into other models. (Yes, a new package called comp in SBML Level 3 is being created to allow the importing of models into other models, but it isn’t mature – yet.)

How are models stored? There is the CellML repository, but what is out there for synthetic biology? The Registry of Standard Biological Parts was available, but only described physical parts. Therefore they created a Registry of Standard Virtual Parts (SVPs) to complement the original registry. This was developed as a group effort with a number of people including Neil Wipat and Goksel Misirli at Newcastle University.

They start with template mathematical structures (which are little parts of CellML), and then use the import functionality available as part of CellML to combine the templates into larger physical things/processes (‘SVPs’) and ultimately to combine things into system models.

They extended the CellMLRepository to hold the resulting larger multi-file models, which included adding a method of distributed version control and allow the sharing of models between projects through embedded workspaces.

What can these pieces be used for? Some of this work included the creation of a CellML model of the biology represented in Levskaya et al. 2005 and deposit all of the pieces of the model in the CellML repository. Another example is a model he’s working on about shear stress and multi-scale modelling for aneurysms.

Modules are being used and are growing in number, which is great, but he wants to concentrate more at the moment on the ‘re’ of the reusable goal, and make it easier to build and understand large models from reusable components. Some of the integrated services he’d like to have: search and retrieval, (semi-automated) visualization, semantically-meaningful metadata and annotations, and semi-automated composition.

All this work above converges on the importance of metadata. With the CellML Metadata Framework 1.0, not many used it. With version 2.0 they have developed a core specification with is very simple and then provide many additional satellite specifications. For example, there is a biological information satellite, where you use the biomodels qualifiers as relationships between your data and MIRIAM URNs. The main challenge is to find a database that is at the right level of abstraction (e.g. canonical forms of your concept of interest).

Neil Wipat – Newcastle University

Please note Neil Wipat is my PhD supervisor.

Speaking about data standards, tool interoperability, data integration and synthetic biology, a.k.a “Why we need standards”. They would like to promote interoperability and data exchange between their own tools (important!) as well as other tools. They’d also like to facilitate data integration to inform the design of biological systems both from a manual designer’s perspective and from the POV of what is necessary for computational tool use. They’d also like to enable the iterative exchange of data and experimental protocols in the synthetic biology life cycle.

A description of some of the tools developed in Neil’s group (and elsewhere) exemplify the differences in data structures present within synthetic biology. BacilloBricks was created to help get, filter and understand the information from the MIT registry of standard parts. They also created the Repository of Standard Virtual Biological Parts. This SVP repository was then extended with parts from Bacillus and was extended to make use of SBML as well as CellML. This project is called BacilloBricks Virtual. All of these tools use different formats.

It’s great having a database of SVPs, but you need a way of accessing and utilizing the database. Hallinan and Wipat have started a collaboration with Microsoft Research with the people who created a programming language for genetic engineering of living cells called the genetic engineering of cells (GEC) simulator. Some work a summer student did created a GEC compiler for SVPs from BacilloBricks virtual. Goksel has also created the MoSeC system where you can automatically go from a model to a graph to a EMBL file.

They also have BacillusRegNet, which is an information repository about transcription factors for Bacillus spp. It is also a source of orthogonal transcription factors for use in B. subtilis and Geobacillus. Again, it is very important to allow these tools to communicate efficiently.

The data warehouse they’re using is ONDEX. They feed information from the ONDEX data store to the biological parts database. ONDEX was created for systems biology to combine large experimental datasets. ONDEX views everything as a network, and is therefore a graph-based data warehouse. ONDEX has a “mini-ontology” to describe the nodes and edges within it, which makes querying the data (and understanding how the data is structured) much easier. However, it doesn’t include any information about the synthetic biology side of things. Ultimately, they’d like an integrated knowledgebase using ONDEX to provide information about biological virtual parts. Therefore they need a rich data model for synthetic biology data integration (perhaps including an RDF triplestore).

Interoperabiligy, Design and Automation: why we need standards.

Requirement 1. There needs to be interoperability and data exchange among these tools as well as among these tools and other external tools. Requirement 2. Standards for data integration aid the design of synthetic systems. The format must be both computationally amenable and useful for humans. Requirement 3. Automation of the design and characterization of synthetic systems, and this also requires standards.

The requirements of synthetic biology research labs such as Neil Wipat’s make it clear that standards are needed.

KEYNOTE: Herbert Sauro – University of Washington, US

Herbert Sauro described the developing community within synthetic biology, the work on standards that has already begun, and the Synthetic Biology Open Language (SBOL).

He asks us to remember that Synthetic Biology is not biology – it’s engineering! Beware of sending synthetic biology grant proposals to a biology panel! It is a workflow of design-build-test. He’s mainly interested in the bit between building and testing, where verification and debugging happens.

What’s so important about standards? It’s critical in engineering, where if increases productivity and lowers costs. In order to identify the requirement you must describe a need. There is one immediate need: store everything you need to reconstruct an experiment within a paper (for more on this see the Nature Biotech paper by Peccoud et al. 2011: Essential information for synthetic DNA sequences). Currently, it’s almost impossible to reconstruct a synthetic biology experiment from a paper.

There are many areas requiring standards to support the synthetic biology workflow: assembly, design, distributed repositories, laboratory parts management, and simulation/analysis. From a practical POV, the standards effort needs to allow researchers to electronically exchange designs with round tripping, and much more.

The standardization effort for synthetic biology began with a grant from Microsoft in 2008 and the first meeting was in Seattle. The first draft proposal was called PoBoL but was renamed to SBOL. It is a largely unfunded project. In this way, it is very similar to other standardization projects such as OBI.

DARPA mandated 2 weeks ago that all projects funded from Living Foundries must use SBOL.

SBOL is involved in the specification, design and build part of the synthetic biology life cycle (but not in the analysis stage). There are a lot of tools and information resources in the community where communication is desperately needed.

SBOL Semantic, SBOL Visual, and SBOL Script. SBOL Semantic is the one that’s going to be doing all of the exchange between people and tools. SBOL Visual is a controlled vocabulary and symbols for sequence features.

Have you been able to learn anything from SBML/SBGN, as you have a foot in both worlds? SBGN doesn’t address any of the genetic side, and is pretty complicated. You ideally want a very minimalistic design. SBOL semantic is written in UML and is relatively small, though has taken three years to get to this point. But you need host context above and beyond what’s modelled in SBOL Semantic. Without it, you cannot recreate the experiment.

Feature types such as operator sites, promoter sites, terminators, restriction sites etc can go into the sequence ontology (SO). The SO people are quite happy to add these things into their ontology.

SBOLr is a web front end for a knowledgebase of standard biological parts that they used for testing (not publicly accessible yet). TinkerCell is a drag and drop CAD tool for design and simulation. There is a lot of semantic information underneath to determine what is/isn’t possible, though there is no formal ontology. However, you can semantically-annotate all parts within TinkerCell, allowing the plugins to interpret a given design. A TinkerCell model can be composed of sub-models. Makes it easy to swap in new bits of models to see what happens.

WikiDust is a TinkerCell plugin written in Python which searches SBPkb for design components, and ultimately uploads them to a wiki. LibSBOLj is a library for developers to help them connect software to SBOL.

The physical and host context must be modelled to make all of this useful. By using semantic web standards, SBOL becomes extensible.

Currently there is no formal ontology for synthetic biology but one will need to be developed.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

How can socks facilitate scientific outreach?

July 4, 2011 Leave a comment

OK, it seems odd, at least on the surface. How can socks help science generally, and science outreach specifically? I asked myself the same question a few months ago when I found an email lurking in my inbox, hidden since just before my maternity leave started. It seems a sock company called Sock It To Me socks featured a different Cool Girl each month, and they wished to feature me. I had a lot of questions. Was this for real? Was it an appropriate platform to be talking about myself, and about what I do? Would it seem as if I was trading my online work persona for socks?

Well, the other Cool Girls’ profiles seemed eclectic and interesting: dancers, astrophysicists, mathematicians and many others. So, not bad company to be in, and it was a genuine request. And, before you ask, I will be getting two pairs of socks for my efforts – ah, temptation. But ultimately, I need to take care of my work/public online persona, and I had to decide whether this was a good addition. But then I realized I could talk about ontologies to people who may have never even heard of bioinformatics and, for me, that was too exciting an opportunity to miss. True, it was limited to 600 words, and a writer used the information I gave her to write the final piece, but I think it was all worth it. She did a great job, and within the confines of the article format, I’m happy about how my field of research is portrayed. I really feel strongly about science outreach, and I do think that novel methods of information dissemination shouldn’t necessarily be ruled out.

So, here it is: Ms Cool Girl of the Month, July 2011. What do you think? Did I benefit science or just myself (well, maybe not just myself – I namechecked my high school biology teacher, and mentioned Cameron too)?

Current Research into Reasoning over BioPAX and SBML

June 30, 2011 Leave a comment

What’s going on these days in the world of reasoning and systems biology modelling? What were people’s experiences when trying to reason over systems biology data in BioPAX and/or SBML format? These were the questions that Andrea Splendiani wanted to answer, and so he collected three of us with some experience in the field to give 10-minute presentations to interested parties at a BioPAX telecon. About 15 people turned up for the call, and there were some very interesting talks. I’ll leave you to decide for yourselves if you’d class my presentation as interesting: it was my first talk since getting back from leave, and so I may have been a little rusty!

The first talk was given by Michel Dumontier, and covered some recent work that he and colleagues performed on converting SBML to OWL and reasoning over the resulting entities.

Essentially, with the SBMLHarvester project, the entities in the resulting OWL file can be divided into two broad categories: in silico entites covering the model elements themselves, and in vivo entities covering the (generally biological) subjects the model elements represent. They copied all of BioModels into the OWL format and performed reasoning and analysis over the resulting information. Inconsistencies were found in the annotation of some of the models, and additionally queries can be performed over the resulting data set.

I gave the second talk about my experiences a few years ago converting SBML to OWL using Model Format OWL (MFO) (paper) and then, more recently, using MFO as part of a larger semantic data integration project whose ultimate aim is to annotate systems biology models as well as create skeleton (sub)models.

I first started working on MFO in 2007, and started applying that work to the wider data integration methodology called rule-based mediation (RBM) (paper) in 2009. As with SBMLHarvester, libSBML and the OWLAPI are used in the creation of the OWL files based on BioModels entries. All MFO entries can be reasoned over and constraints present within MFO from the SBML XSD, the SBML Manual, and from SBO do provide some useful checks on converted SBML entries. The semantics of SBMLHarvester are more advanced than that of MFO, however MFO is intended to be a conversion of a format only, so that SWRL mappings can be used to input/output data from MFO to/from the core of the rule-based mediation. Slide 8 of the above presentation provides a graphic of how rule-based mediation works. In summary, you start with a core ontology which should be small and tightly-scoped to your biological domain of interest. Data is fed to the core from multiple syntactic ontologies using SWRL mappings. These syntactic ontologies can be either direct format conversions from other, non-OWL, formats or pre-existing ontologies in their own right. I use BioPAX in this integration work, and while I have mainly reasoned over MFO (and therefore SBML), I do also work with BioPAX and plan to work more with it in the near future.

The final presenter was Ismael Navas Delgado, whose presentation is available from Dropbox. His talk covered two topics: reasoning over BioPAX data taken from Reactome, and the use of a database back-end called DBOWL for the OWL data. By simply performing reasoning over a large number of BioPAX entries, Ismael and colleagues were able to discover not just inconsistencies in the data entries themselves, but also in the structure of BioPAX. It was a very interesting summary of their work, and I highly recommend looking over the slides.

And what is the result of this TC? Andrea has suggested that, after discussion on the mailing list (contact Andrea Splendiani if you are not on it and want to be added) and then have another TC in a couple of weeks. Andrea has also suggested that it would be nice to “setup a task force within this group to prepare a proof of concept of reasoning on BioPAX, across BioPAX/SBML, or across information resources (BioPAX/OMIM…)”. I think that would be a lot of fun. Join us if you do too!

A Case of Stolen Professional Identity

May 4, 2011 1 comment

…or when a bogus review with my name on it was submitted to a journal.

Update: Please read how the same people tried it again, with another journal, in this post.

At the beginning of last month, just as I was starting to work part-time on my PhD again after maternity leave, a curious and worrying thing happened. My name – and, as such, my professional identity – was stolen and used in a bogus review for a journal submission.

How it Happened

The story begins with an email I received from an Editor asking me to confirm that I had recently submitted a review to his Journal. He was suspicious, as the email address provided for me was a gmail account rather than my institutional address. While my name and affiliation in the review were correct, the gmail address was not mine. A little while later, the Editor let me know that other reviewer email addresses were equally dubious, and at least one other person had confirmed that their professional identity had also been stolen and used to create one of the other reviews for the same submission.

The review itself was badly written and very short, and I am indebted to the Editor for catching the oddness of the email address and for delving into this situation so deeply. Despite all the help provided by author and reviewer databases, a little personal attention by editors goes a long way. This Journal’s rules for reviewing are pretty standard, and as with many journals, they allow authors to submit reviewer suggestions. I don’t think this practice should be stopped, as many research communities are relatively specialised or small, and you are more likely to get suitable reviewers if the authors are able to suggest options. However, abuse of this system is possible, and I would be very surprised if nothing like this scam has happened before.

The Outcome

It was caught early here, just after the reviews were submitted. The culprits were banned. Though I’m not privy to whether or not any further legal action can or will be taken, at least there was a positive result for the Journal. The only way it could have been caught earlier is if the odd email addresses were noticed at the point the reviewer names were suggested, rather than once the reviews came in. I sincerely hope there aren’t other bogus reviews out there in other journals using anyone else’s name.

Personally Speaking…

I’d like to compliment the Editor and his Journal for discovering this unprofessional behaviour early on and for taking action. While it is a kind of dubious honour to be selected for such a scam (the scammers must think I’m a good reviewer choice?), it has been an uncomfortable experience for me personally. I expend a reasonable amount of effort on maintaining my professional online appearance. A search on my name retrieves mainly work-related hits, and this is a useful aid for both sharing work and finding other like-minded researchers. I assume this is how the scammers came up with my name, and the names of the others whose professional personae were misused in the same way. Such sub-standard reviews could harm the perception of the real researcher in the eyes of the journals concerned, and this is a worry.

Catching the Crooks

This isn’t a post on the purpose or usefulness of peer review. Whatever your views (and some are quite negative), the process is firmly entrenched in our community, at least for now. But how should we be working to prevent such scams in future?

Should journals require institutional email addresses? Should journals not accept email addresses from authors at all, and search for reviewers’ addresses independently? Certainly there are few reasons why honest reviewers would be using a non-institutional address, but is it a little too much to force such a constraint?

Additionally, there are many proponents of getting rid of anonymity in the refereeing process. Indeed, PLoS journals encourage reviewers to name themselves. Would be more difficult to perform this kind of a scam if the name of the reviewer were visible? What if the scammers managed to succeed, and the wronged party never noticed their name on that review, visible for all to see? It could be a real blow for professional reputations.

A Final Note

I’m happy that the wrongdoers were caught, and that the Journal and Editor were open enough about what happened to encourage me to write about it: they hope that this openness will make it harder for people to perform the same stunt again. Bad reviews lead to substandard papers being accepted, which lowers the standing of whatever journal publishes them: a bad outcome for the whole community.

Hopefully this will be a timely warning to others, as I’ve never heard of it happening before. Please let me know if you’ve ever had a similar experience, as I’d be interested to hear about it.

One final thought: having written a review in my name, do you think these scammers could write my PhD thesis for me too? Hmmm, perhaps not such a good idea after all….

What are your ideas? How could such a scam be prevented in future? Let me know about your suggestions on this topic, or your own experiences. Is this more common than we think? You can contact me via the comments on this post or via the various social networking methods I use. Further information is available from my About page.

Pause in Posts: New Arrival

July 28, 2010 Leave a comment

Just to let you know that over the coming months my posts will be infrequent, as my husband and I have a new addition to the family:

He’s great, but he’s definitely a little generator of time warps – haven’t had much time for anything else! I’ll be back in the blogging game as I have the time for it – promise!

Henning Hermjakob: PSICQUIC and EnVision

April 29, 2010 1 comment

This is a presentation given on 29 April, 2010, at the Link-Age / LifeSpan Workshop on Data Handling for Biogerontology Research held by CISBAN, Newcastle University.

Data integration: one definition is to combine data residing in different sources providing users a unified view of these data. Questions of relevance for the data integration field: scope (all, datasets), type (same, different), implementation (federation, centralisation), access (programmatic i.e. computer to computer, web i.e. interactive) and ownership (public, private). Henning covers federated, mainly programmatic techniques using data of the same type in this talk.

To take an example, if you start with a sample (e.g. from a mouse). Observations of this sample results in one or more (overlapping or non-overlapping) publications. Then, the publication information can be used to annotate interaction databases and sent to PSICQUIC servers. PSICQUIC should allow the user to reconstruct an idealised view of the original system from the interaction data.

The molecular interaction standard is the PSI-MI standard, whose first XML version was produced in February 2004. There have been updates and extensions since then, and has been widely implemented by the major interaction databases including DIP, MINT, MIPS, IntAct, HPRD, etc. (http://www.psidev.info/MI)

The PSI-MI XML format is full featured, but complex. This complexity is both its strength and its weakness. Therefore, due to user request, they developed a simplified tabular format called MITAB where one row equals one binary interaction. You loose a lot of information, such as whether a binary interaction is part of a more complex reaction, but it has proven popular.

PSICQUIC is one API which is implemented by many databases such as those mentioned earlier. Its purpose is for querying molecular interaction databases, and uses a common query language (MIQL, which is based on Lucene) for this data. Can be used for PPIs, drug-target interactions and simplified pathway data. The simple PSICQUIC viewer is at http://www.ebi.ac.uk/Tools/webservices/psicquic/view. The PSICQUIC viewer can also point to other resources such as IntAct and many other non-EBI databases. The viewer also has a more fancy, graphics-based implementation where there is an overlay of molecular interactions on Reactome pathways.

MIQL can query every field available within MITAB in a precise way. SOAP and REST interfaces are available and documented at http://code.google.com/p/psicquic.

The challenge is to move PSICQUIC from simple access to all the resources to a real integrated view of all those resources. How to determine if two sources really are talking about the same interaction? Also, the compute time quickly moves beyond suitable interactive times.

PSICQUIC is a technical solution, whereas IMEx is the social/collaborative answer. IMEx is the International Molecular Exchange Consortium. The aims of its members include: avoiding redundant curation, providing a consistent body of public data using detailed joint curation guidelines, and providing a network of stable and comprehensive resources for MI data. This work is now in production phase since February 2010. The work is split up into the different databases by journal type. You can find out more information about IMEx at http://imex.sf.net. Each interaction has its own database’s identifier, but also an identifier from a common IMEx identifier space. The hardest part was harmonizing curation procedures, and they now have a common curation manual across all databases.

Looking at another aspect of his work, EnCore, which is based on different data types integrated using a federated, programmatic approach. EnCore is an ENFIN platform to enable mining data across various domains, sources, formats and types. It integrates database resources and analysis tools across different disciplines. The first focus is on developing an EnXML format. Access interfaces include Perl API, Java API, ftp, SOAP, REST, GUI, etc. The return formats are in a variety of flavours, e.g. XML, CSV, plain text, JSON, etc. All of this must be squeezed into one consistent format. This is done by putting wrappers around the various programs.

The EnXML structure is set oriented – not only does it tell you about one thing (e.g. protein), but also about a set of them. In this structure, an experiment is run which identifies the results. Each experiment references a Set structure, which contains the structure of the result. Sets can hold further nested sets. There are a number of other further sub-structures. The EnCORE results always include both a positive and negative result set (in the case of the negative result, it lists all identifiers for which *no* hit was found). Negative results allow you to track why you might not have gotten a response, and how you “lost” some identifiers from the result.

EnVision is an end-user tool for the above EnCORE work based on the EnXML format. It provides an initial, integrative view for Sets of molecular entities without the need for programming. It also allows the possibility for further local processing. It allows you to save the status/analysis of your material on a particular date, and use that for, e.g., supplementary materials. You can also download your sub-results in a tabular format. Further information and the ability to run this GUI is available from http://www.enfin.org, where you can play with an EnCORE tutorial.

All of this can be quite laborious – web services that are used by EnCORE can change without warning, so it’s a constant challenge to maintain all of these wrappers. A partial answer is to use, wherever possible, underlying standards for the individual services. Such standards include PSICQUIC for MI data. DAS will be used to access protein annotation and information.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

BioSharing and M3 Special Interest Group at ISMB 2010

March 31, 2010 Leave a comment

Last year, I attended ISMB 2009 in Stockholm. Prior to the main ISMB conference, there were a number of Special Interest Groups, which are a little like mini-conferences lasting 1-2 days. I really enjoyed going to a number of these SIGs, and they were one of the highlights of the conference. I even presented a paper at one of them.

Last year also saw the first M3 SIG. This year they’re expanding the scope of this year’s SIG, including sessions on data sharing and standards in biology, and highlights the BioSharing community, and those interested in open data and data standards. Further, during this SIG, the organisers:

…aim to launch the BioSharing forum, as discussed at M3 2009 to enable a broader dialogue among funders, journals, standards developers, technology developers and researchers on the critical issue of data sharing within the metagenomics community and beyond. (source)

Although I will not be able to attend ISMB or the SIGs this year, I highly recommend attending this SIG, and the conference in general. For more information, please see the BioSharing post advertising the SIG.

Follow

Get every new post delivered to your Inbox.