Grid/cloud data storage and computing for 3D image databases (ISMB DAM SIG 2009)

Christoph Best, EBI

Bioimage informatics is informatics for biological imaging, and helps make that data reliable and accessible. Structural databases at the EBI include the PDB and the EMDB, the latter of which is a part of PDB at EBI and Rutgers. For electron microscopy, objects get “plunged”, where they are embedded in amorphous ice that is transparent to electrons. Then in the picture you have an x-ray projection of the protein. The single-particle method helps you get molecular structures, and has many images computationally combined. You can get 3d images from the 2d ones. You can also increase the resolution by averaging. It is very much an iterative process. The resolution has been pushed up to 4 angstroms.

Data management issues: initial images 10-20 gigabytes. The final data set is 1 MVoxel, which is considered small. Processing power is roughly equivalent to a number of weeks in lab-owned clusters if you have a few hundred cores. The software is a mixture of 1970s’ Fortran code, and 1990s C code: it’s very fragmented and there is a definite lack of standards. 3d reconstruction happens by taking a series of images from different angles. It makes it possible to see a cell to a “molecular resolution” (tomography of eukaryotic cells). They do a lot of processing on ensembles of images as well as the images themselves. “Visual proteomics” will identify proteins in cryo-electron tomograms of intact cells. Involves pattern matching.

EMDB has about 600 entries, with a current rate of increase of approx. 15-20 per month. Metadata mgmt is difficult – there are many rounds of consultation with the community, and still most fields remain empty. Submissions consist of maps (increasingly more than one) and have relations between data sets. They’re looking into XML-based standards for representing relationships between data. They’d like to be able to get the community to submit all their original data. To hold all this, they need help from grid/cloud computing to help with the data upload, distriution, etc.

