…Open binary file formats for large-scale data management, aka Toward Scalable Bioinformatics infrastructures
Mark Welsh, geospiza.com
Measuring gene expression is much easier these days. HDF = hierarchical data format. HDF5 is a model and file format for large complex data. Complexity limits scale and productivity, as data are unstructured with no consistent data model. Also, there’s a tendency to solve problems using redundant data processing with incremental processing with data filtering at each stage. If you have a new question, you often have to re-run the steps. This makes getting answers difficult, and comparing between samples hard. They want a scalable system with smooth user interefaces, among other things. Hence the BioHDF project, which aims to deliver core tools to the community and to get feedback.
Benefits include: separates the model, implementation and view of the data; combines data from multiple samples; compression and chunking; rapid prototyping env.; significant reduction in dev. time; approach ngs analysis differently.
He has a bagful of thumb drives with the HDF software pre-loaded. good idea!
FriendFeed discussion: http://ff.im/4vsfK
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!