Kam Dahlquist, Loyola Marymount University
The original motivation for this project was GenMAPP, a tool for looking at DNA microarray data on biological pathways (a while ago), which is basically a legacy program these days. XMLPipeDB is a reusable open source tool chain for building relational dbs from XML sources. Original requirements: proteomes from UniProt XML, GO XML, and others. Firstly, the XSD is converted into a db schema using hyperjaxb from Apache (I think). You still need to do some basic post-processing of the data (changing data type or SQL reserve words – why doesn’t hyperjaxb do the latter?). Then the XML files are broken down into 25 record chunks for import (hyperjaxb couldn’t handle the big files) , and the TallyEngine counts records in XML and relational database. Then use the genMAPP builder builds the data into Microsoft Access format.
How robust is the system? Data-driven design allowed pick-up of RefSeq and NCBI Gene IDs from cross-references in the UP XML. The UP and GO XML schemas have changed, and were handled mostly automatically. However, XML sources need to keep their own XSDs updated – and the XSDs on the site can be older than the XML… Also, each new species does require additional coding to handle the vagaries of its own gene ID system.
FriendFeed discussion: http://ff.im/4vvIi
My thoughts: I would like to hear her opinions on XML databases, and why they prefer relational databases.
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!