Workflow development and reuse in Taverna (ISMB DAM SIG 2009)

Carole Goble, Manchester University

There are lots of different data services and tools. In order to integrate data sets there are many choices: warehouse, post/pre-docs (the hard way), bunches of scripts, workflows, and applications like Gaggle. Carole et al. concentrate on workflows, which are automated, coordinated sequences of tasks such as bioinformatics protocols. With these, you need to have a programmatic way of accessing the data set. Apps like web services work well for this purpose. In workflows, data flows between different components, and the arrival of data starts the next set of components going.

Taverna is open source and free, and release 2 has been out since December 2008. It’s divided between the enactment engine and the front-end workbench. It has about 8000 downloads per version. A workflow is a specification, and the workflow engine separates out from the application all of the error handling, service invocation, data movement, data streaming, data provenance tracking, process auditing, execution monitoring, security access etc. This means that people actually building the workflows themselves don’t need to deal with it. Taverna uses “just in time” data integration, similar to query translation, but no translation is required as you are actually manipulating a graphical view of the underlying external services.

You can incorporate a new service into Taverna without coding, as long as it has an appropriate WSDL document. In Taverna 2, there are iterations over data collections (which they already had); native looping for asynchronous services; native data streaming and reference management; integration with established computing platforms caGrid, EGEE, KnowArc, and Dutch e-Science Grid; compliance of their provenance format with the open provenance model; integrated with OSGi, Spring and Eclipse platforms. There is a demo on Thursday at ISMB if you’re interested.

Taverna is used by caBIG, which is an initiative in the US to make sure that people use the same format in their research. Taverna has also been used in astromony (astrogrid, I believe). Paul Fisher has also used Taverna to do a Trypanaosomiasis study. He thinks that using the Taverna workflows helped him manage the scale and complexity of data and literature, and he feels it eliminated user bias, as before using the workflows they were triaging data and therefore missing the most interesting data, which was actually outside their normal area of interest.

Carole then moved on to a description of www.myexperiment.org. Also, they are part of the development of BioCatalogue (demo on Wednesday) which is a catalog of web services. Utopia also uses Taverna, and there is a demo of that on Tuesday.

The design of Taverna centered around thinking about how to help the lone bioinformaticians deal with big science. They also wanted to use an open architecture to allow for future additions. Workflow development tends to be an iterative process. Workflows are only as good as their components. Writing reusable services is hard, as it requires predicting for the unknown and writing services that may be beyond the scope of current funding structures of the originating group. BioCatalogue tries to sort this out by providing tags etc to annotate information about the web services. In Taverna, it tries to help by grouping services into families and having special plugins for those families. There are, of course, data incompatibilites. This is a common problem in the life sciences 😉 . 62% of respondents in the Elixir questionnaire didn’t use any sort of minimal information guidelines. To help, Carole says you must: make services compatible; describe “the hell out of them” (e.g. BioMOBY, Taverna); shim services and in-workflow scripting (flexible and transparent, beanshell scripting means programming); and make compatible components (combine the service and manipulation services together and create sub-workflows).

myExperiment has the idea of “just-enough sharing” model, so you don’t have to share workflows you don’t want to. There are also incentives, such as who has the credit for making the workflows. As you can see, design principles for these apps are based on a stepwise Just Enough Just in Time principle.

FriendFeed discussion: http://ff.im/4shPC

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Advertisements

One thought on “Workflow development and reuse in Taverna (ISMB DAM SIG 2009)”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s