…Open binary file formats for large-scale data management, aka Toward Scalable Bioinformatics infrastructures

Mark Welsh,

Measuring gene expression is much easier these days. HDF = hierarchical data format. HDF5 is a model and file format for large complex data. Complexity limits scale and productivity, as data are unstructured with no consistent data model. Also, there’s a tendency to solve problems using redundant data processing with incremental processing with data filtering at each stage. If you have a new question, you often have to re-run the steps. This makes getting answers difficult, and comparing between samples hard. They want a scalable system with smooth user interefaces, among other things. Hence the BioHDF project, which aims to deliver core tools to the community and to get feedback.

Benefits include: separates the model, implementation and view of the data; combines data from multiple samples; compression and chunking; rapid prototyping env.; significant reduction in dev. time; approach ngs analysis differently.

He has a bagful of thumb drives with the HDF software pre-loaded. good idea!

FriendFeed discussion:

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s