XMLPipeDB (ISMB DAM/BOSC 2009)
June 27, 2009
Kam Dahlquist, Loyola Marymount University
The original motivation for this project was GenMAPP, a tool for looking at DNA microarray data on biological pathways (a while ago), which is basically a legacy program these days. XMLPipeDB is a reusable open source tool chain for building relational dbs from XML sources. Original requirements: proteomes from UniProt XML, GO XML, and others. Firstly, the XSD is converted into a db schema using hyperjaxb from Apache (I think). You still need to do some basic post-processing of the data (changing data type or SQL reserve words – why doesn’t hyperjaxb do the latter?). Then the XML files are broken down into 25 record chunks for import (hyperjaxb couldn’t handle the big files) , and the TallyEngine counts records in XML and relational database. Then use the genMAPP builder builds the data into Microsoft Access format.
How robust is the system? Data-driven design allowed pick-up of RefSeq and NCBI Gene IDs from cross-references in the UP XML. The UP and GO XML schemas have changed, and were handled mostly automatically. However, XML sources need to keep their own XSDs updated – and the XSDs on the site can be older than the XML… Also, each new species does require additional coding to handle the vagaries of its own gene ID system.
FriendFeed discussion: http://ff.im/4vvIi
My thoughts: I would like to hear her opinions on XML databases, and why they prefer relational databases.
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!
/media/OS/Users/Allyson/200906-BioOntSig/AllysonListerBioOntSig009-Long.ppt
/media/OS/Users/Allyson/200906-BioOntSig/biopax-full.png
/media/OS/Users/Allyson/200906-BioOntSig/biopax-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/biopax-rules-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/boardingpass.pdf
/media/OS/Users/Allyson/200906-BioOntSig/dataintegration-extranotes.odt
/media/OS/Users/Allyson/200906-BioOntSig/fig2cProctor.png
/media/OS/Users/Allyson/200906-BioOntSig/fig2cProctor-green.png
/media/OS/Users/Allyson/200906-BioOntSig/flat-hierarchy-psimif.png
/media/OS/Users/Allyson/200906-BioOntSig/glycolysis.png
/media/OS/Users/Allyson/200906-BioOntSig/glycolysis-small.png
/media/OS/Users/Allyson/200906-BioOntSig/interaction-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/interblag-xkcd181.png
/media/OS/Users/Allyson/200906-BioOntSig/mfo-comments.png
/media/OS/Users/Allyson/200906-BioOntSig/mfo-overview.png
/media/OS/Users/Allyson/200906-BioOntSig/mfo-species.png
/media/OS/Users/Allyson/200906-BioOntSig/PhilEdit-AllysonListerBioOntSig009.pptx
/media/OS/Users/Allyson/200906-BioOntSig/psimif-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/psimif-rules-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/psimif-rules-closeup2.png
/media/OS/Users/Allyson/200906-BioOntSig/results-tuo-sqwrl3.png
/media/OS/Users/Allyson/200906-BioOntSig/result-tuo1and2.png
/media/OS/Users/Allyson/200906-BioOntSig/rules-part1.png
/media/OS/Users/Allyson/200906-BioOntSig/rules-part2.png
/media/OS/Users/Allyson/200906-BioOntSig/sqwrl-1and4.png
/media/OS/Users/Allyson/200906-BioOntSig/table1.png
/media/OS/Users/Allyson/200906-BioOntSig/table3.png
/media/OS/Users/Allyson/200906-BioOntSig/Telomere_caps.gif
/media/OS/Users/Allyson/200906-BioOntSig/tuo.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-1.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-2.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-3.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-only-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-sqwrl.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-sqwrl-1.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-sqwrl-4.png
/media/OS/Users/Allyson/200906-BioOntSig/uc1-part1.png
/media/OS/Users/Allyson/200906-BioOntSig/uc1-part2.png
/media/OS/Users/Allyson/200906-BioOntSig/uc1-part3.png
/media/OS/Users/Allyson/200906-BioOntSig/uniprot-full.png
/media/OS/Users/Allyson/200906-BioOntSig/up-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/up-rules-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/up-rules-closeup2.png
/media/OS/Users/Allyson/200906-BioOntSig/User_icon_2.png
/media/OS/Users/Allyson/200906-BioOntSig/User_icon_2.svg
/media/OS/Users/Allyson/200906-BioOntSig/workflow.png
Brad Chapman, biopython.org
Reusable libraries for parsing file formats, running programs, build analysis pipelines, etc are important. Python examples are biopython, pygr, etc. For representing biological data, BioSQL and Chado are examples of database schemas that represent biological data and help you move from flat files. We should integrate bioinformatics libraries, db schemas, and web development frameworks (among others). An example implementation, deployed with Google App Engine, is at http://biosqlweb.appspot.com.
Challenges: how do we provide plug-in components, leverage existing code, make reuse easier, and communicate about these issues?
FriendFeed discussion:http: http://ff.im/4vtZi
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!
BioHDF (ISMB DAM/BOSC 2009)
June 27, 2009
…Open binary file formats for large-scale data management, aka Toward Scalable Bioinformatics infrastructures
Mark Welsh, geospiza.com
Measuring gene expression is much easier these days. HDF = hierarchical data format. HDF5 is a model and file format for large complex data. Complexity limits scale and productivity, as data are unstructured with no consistent data model. Also, there’s a tendency to solve problems using redundant data processing with incremental processing with data filtering at each stage. If you have a new question, you often have to re-run the steps. This makes getting answers difficult, and comparing between samples hard. They want a scalable system with smooth user interefaces, among other things. Hence the BioHDF project, which aims to deliver core tools to the community and to get feedback.
Benefits include: separates the model, implementation and view of the data; combines data from multiple samples; compression and chunking; rapid prototyping env.; significant reduction in dev. time; approach ngs analysis differently.
He has a bagful of thumb drives with the HDF software pre-loaded. good idea!
FriendFeed discussion: http://ff.im/4vsfK
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!
PSLID (ISMB DAM/BOSC 2009)
June 27, 2009
…the Protein Subcellular Location Image Database: Subcellular location assignments, annotated image collections, image analysis tools, and generative models of protein distributions
Robert Murphy, Carnegie Mellon
Everything he describes are open source and available on their website.
Tools that analyze images of proteins and their distributions within cells. SLIC = subcellular location image classifcation. The challenges include: cells have different shapes, sizes, and orientations; structures are not found in fixed locations within the cell; instead, they describe each image numerically and operate on the descriptors, known as SLF or subcellular location features. The tools within SLIC are: segmentation, feature classification, clustering and comparison. You can do the analysis at many different levels of granularity. Computational classification of subcellular location is very high quality. SLIC is available in Matlab, and in Python, and some of it has been ported to C++/ITK.
Decomposing mixture patterms involve sorting out proteins that are in more than one place. PUnMix either learns to unmix given instances of the pure patterns, or will use a previous instance of a pattern. You learn the types by clustering using object features. For instance, if they know what a lysosome and what a golgi pattern looked like, and the computer is given a mixture, the computer can tell you what sort of fraction. But, how do you test something like that? Create real images that are mixtures of two different probes.
To determine nuclear shape you can use the medial axis model. 11 parameters allow you to synthesize one nucleus – you learn those 11 parameters over 1000s of nuclei and you get a distribution. The model for the cell shap is about Distance ratio, and capture variation as a Principal components model, typically using 10 principle components. For models for protein containing objects , you see them as a mix of gaussian objects and learn distributions. The SLML model toolbox is all about storing these models. If you want to do cell simulations, then you can combine models together – interesting for the virtual cell, as you can model the proteins inside the cell with this.
They also distribute lots of annotated data sets where they’ve collected images of different proteins, both in 2d and 3d.
FriendFeed Discussion: http://ff.im/4vq84
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!
…for genotype and phenotype experiments
Morris Swertz, Groningen
MOLGENIS is a database generator – a free toolbox of automated best practices and more. We separate out the parts that are different from the parts that are common in their applications. The blueprint is written in XML and then templates are written which translate that XML into working software. The first step in MOLGENIS is modelling (though you can also extract a model from an existing database that isn’t yet part of MOLGENIS). Then a set of generators are run that will build the appropriate apps. The generators are implemented in Freemarker. The programmer can the build on the generated code if they like.
He spent some of the time on a demo of MOLGENIS itself.
XGAP (Xtensible Geontype and Phenotype data platform) – DAM challenges include sharing data between QTL collaborators, variety of species and methods, ensure there is reuse of ad-hoc analysis protocols. XGAP extends the FuGE standard. Their project plus 6 others are described in XGAP format.
The next step is to add more semantics into it. They would also like to do some federation and cloud computing.
FriendFeed discussion: http://ff.im/4vnGO
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!