Exploring the future of bioinformatics data sharing and mining (ISMB DAM SIG 2009)

…with Pygr and Worldbase.
Christopher Lee, UCLA

Graph databases as knowledge maps – mapping how knowledge interconnects. Hypergraphs are a general model for bioinformatics. You can have, for example, nodes as sequence letters or annotations, and edges as links between sequence and annotations.Pygr is extensible, and stands for the Python Graph Database Framework. You can use it to map anay set of inputs onto any set of outputs. A node can have multiple mappings, and the mappings are indexed. That’s essentially everything required to understand Pygr. Pygr is simpler than many systems because the data really is in a graph format: SQL schemas are not very good representation of graphs.

With Pygr, other advantages include the fact that both data and queries are graphs. Graph queries are also sinple, whereas the SQL would require different queries when Pygr requires one. You can have multiple sequence alignements as a graph database. Is Pygr scaleable? Yes, if you develop good indexing schemes, e.g. NLMSA. A single NLMSA stores alignment on the disk, and provides fast C extension to perform all queries (Pyrex + C). You can also store and query annotations using NLMSA.

How to access each other’s datasets in a general way? That’s where Worldbase comes in: it is a namespace for scientific data and schemas. They would like to make obtaining a specific dataset as easy as Phython import foo.bar.you. To work with the dataset, all you need to know is its name. With the name and the import system, you don’t need to know how the data has gotten there – you just have access to it.

In Worldbase, nodes are databases, and edges are relations/mappings as in an ER diagram. You need to have a configuration, CLASSPATH-like path called WORLDBASEPATH, which contains the locations for your metabases. Order in the path is relevant, and shows precedence. They’ve set it up such that with 3 lines of code you can serve up your data using XML RPC. It is totally automatic, with zero config. He feels that when discussing the scalability principles of integration, the conversation must be in code, i.e. working prototypes that show a pattern for solving a problem, followed by making these things work together. He feels that if you want to solve the problem of data integration, you should abstract the problem into abstract graph operations. You’ll need a standard way to convert any dataset to and from a platform-independent/portable form.

The set of data types and transformations themselves form a graph. Graph queries can then automatically find a path in this graph from a desired source to a desired target, and automatically perform the transformations, like Make does with simpler things: don’t write scripts: write make rules (and let the system find the right path for you).

Need to think about: Worldbase name service; WB Transformations Kernel; WB Repositories (“Git for data publishing and construction”); WB interfaces (standard interfaces for standard datatypes, but unlimited implementations).

My thoughts: would you need to set up a registry for people to put their identifiers for others to use? How to ensure it works as it should? Doesn’t make it easy to discover new data unless you have a global registry of locations. This isn’t data integration so much as data discovery, which is useful in itself. And there is already a lot of work going on with integrating large-scale networks. I think what he’s saying is do that, but not just for the data itself (which is happening loads), but also the data files – have the files themselves as part of a network to be analysed. Carole’s observation: you’ve just described the Semantic Web (RDF triples etc).This graph model is like an RDF model. You’ve got the equivalent of SPARQL etc. Christopher: this is a client environment. Summary: more talk later this week required..!

FriendFeed discussion: http://ff.im/4uX7e

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s