Categories
Meetings & Conferences

XMLPipeDB (ISMB DAM/BOSC 2009)

Kam Dahlquist, Loyola Marymount University

The original motivation for this project was GenMAPP, a tool for looking at DNA microarray data on biological pathways (a while ago), which is basically a legacy program these days. XMLPipeDB is a reusable open source tool chain for building relational dbs from XML sources. Original requirements: proteomes from UniProt XML, GO XML, and others. Firstly, the XSD is converted into a db schema using hyperjaxb from Apache (I think). You still need to do some basic post-processing of the data (changing data type or SQL reserve words – why doesn’t hyperjaxb do the latter?). Then the XML files are broken down into 25 record chunks for import (hyperjaxb couldn’t handle the big files) , and the TallyEngine counts records in XML and relational database. Then use the genMAPP builder builds the data into Microsoft Access format.

How robust is the system? Data-driven design allowed pick-up of RefSeq and NCBI Gene IDs from cross-references in the UP XML. The UP and GO XML schemas have changed, and were handled mostly automatically. However, XML sources need to keep their own XSDs updated – and the XSDs on the site can be older than the XML… Also, each new species does require additional coding to handle the vagaries of its own gene ID system.

FriendFeed discussion: http://ff.im/4vvIi

My thoughts: I would like to hear her opinions on XML databases, and why they prefer relational databases.

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

/media/OS/Users/Allyson/200906-BioOntSig/AllysonListerBioOntSig009.ppt
/media/OS/Users/Allyson/200906-BioOntSig/AllysonListerBioOntSig009-Long.ppt
/media/OS/Users/Allyson/200906-BioOntSig/biopax-full.png
/media/OS/Users/Allyson/200906-BioOntSig/biopax-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/biopax-rules-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/boardingpass.pdf
/media/OS/Users/Allyson/200906-BioOntSig/dataintegration-extranotes.odt
/media/OS/Users/Allyson/200906-BioOntSig/fig2cProctor.png
/media/OS/Users/Allyson/200906-BioOntSig/fig2cProctor-green.png
/media/OS/Users/Allyson/200906-BioOntSig/flat-hierarchy-psimif.png
/media/OS/Users/Allyson/200906-BioOntSig/glycolysis.png
/media/OS/Users/Allyson/200906-BioOntSig/glycolysis-small.png
/media/OS/Users/Allyson/200906-BioOntSig/interaction-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/interblag-xkcd181.png
/media/OS/Users/Allyson/200906-BioOntSig/mfo-comments.png
/media/OS/Users/Allyson/200906-BioOntSig/mfo-overview.png
/media/OS/Users/Allyson/200906-BioOntSig/mfo-species.png
/media/OS/Users/Allyson/200906-BioOntSig/PhilEdit-AllysonListerBioOntSig009.pptx
/media/OS/Users/Allyson/200906-BioOntSig/psimif-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/psimif-rules-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/psimif-rules-closeup2.png
/media/OS/Users/Allyson/200906-BioOntSig/results-tuo-sqwrl3.png
/media/OS/Users/Allyson/200906-BioOntSig/result-tuo1and2.png
/media/OS/Users/Allyson/200906-BioOntSig/rules-part1.png
/media/OS/Users/Allyson/200906-BioOntSig/rules-part2.png
/media/OS/Users/Allyson/200906-BioOntSig/sqwrl-1and4.png
/media/OS/Users/Allyson/200906-BioOntSig/table1.png
/media/OS/Users/Allyson/200906-BioOntSig/table3.png
/media/OS/Users/Allyson/200906-BioOntSig/Telomere_caps.gif
/media/OS/Users/Allyson/200906-BioOntSig/tuo.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-1.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-2.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-3.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-only-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-sqwrl.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-sqwrl-1.png
/media/OS/Users/Allyson/200906-BioOntSig/tuo-sqwrl-4.png
/media/OS/Users/Allyson/200906-BioOntSig/uc1-part1.png
/media/OS/Users/Allyson/200906-BioOntSig/uc1-part2.png
/media/OS/Users/Allyson/200906-BioOntSig/uc1-part3.png
/media/OS/Users/Allyson/200906-BioOntSig/uniprot-full.png
/media/OS/Users/Allyson/200906-BioOntSig/up-rules.png
/media/OS/Users/Allyson/200906-BioOntSig/up-rules-closeup.png
/media/OS/Users/Allyson/200906-BioOntSig/up-rules-closeup2.png
/media/OS/Users/Allyson/200906-BioOntSig/User_icon_2.png
/media/OS/Users/Allyson/200906-BioOntSig/User_icon_2.svg
/media/OS/Users/Allyson/200906-BioOntSig/workflow.png
Categories
Meetings & Conferences

Lowering barriers to publishing biological data on the web (ISMB DAM/BOSC 2009)

Brad Chapman, biopython.org

Reusable libraries for parsing file formats, running programs, build analysis pipelines, etc are important. Python examples are biopython, pygr, etc. For representing biological data, BioSQL and Chado are examples of database schemas that represent biological data and help you move from flat files. We should integrate bioinformatics libraries, db schemas, and web development frameworks (among others).  An example implementation, deployed with Google App Engine, is at http://biosqlweb.appspot.com.

Challenges: how do we provide plug-in components, leverage existing code, make reuse easier, and communicate about these issues?

FriendFeed discussion:http: http://ff.im/4vtZi

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

BioHDF (ISMB DAM/BOSC 2009)

…Open binary file formats for large-scale data management, aka Toward Scalable Bioinformatics infrastructures

Mark Welsh, geospiza.com

Measuring gene expression is much easier these days. HDF = hierarchical data format. HDF5 is a model and file format for large complex data. Complexity limits scale and productivity, as data are unstructured with no consistent data model. Also, there’s a tendency to solve problems using redundant data processing with incremental processing with data filtering at each stage. If you have a new question, you often have to re-run the steps. This makes getting answers difficult, and comparing between samples hard. They want a scalable system with smooth user interefaces, among other things. Hence the BioHDF project, which aims to deliver core tools to the community and to get feedback.

Benefits include: separates the model, implementation and view of the data; combines data from multiple samples; compression and chunking; rapid prototyping env.; significant reduction in dev. time; approach ngs analysis differently.

He has a bagful of thumb drives with the HDF software pre-loaded. good idea!

FriendFeed discussion: http://ff.im/4vsfK

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

PSLID (ISMB DAM/BOSC 2009)

…the Protein Subcellular Location Image Database: Subcellular location assignments, annotated image collections, image analysis tools, and generative models of protein distributions

Robert Murphy, Carnegie Mellon

Everything he describes are open source and available on their website.

Tools that analyze images of proteins and their distributions within cells. SLIC = subcellular location image classifcation. The challenges include: cells have different shapes, sizes, and orientations; structures are not found in fixed locations within the cell; instead, they describe each image numerically and operate on the descriptors, known as SLF or subcellular location features. The tools within SLIC are: segmentation, feature classification, clustering and comparison. You can do the analysis at many different levels of granularity. Computational classification of subcellular location is very high quality. SLIC is available in Matlab, and in Python, and some of it has been ported to C++/ITK.

Decomposing mixture patterms involve sorting out proteins that are in more than one place. PUnMix either learns to unmix given instances of the pure patterns, or will use a previous instance of a pattern. You learn the types by clustering using object features. For instance, if they know what a lysosome and what a golgi pattern looked like, and the computer is given a mixture, the computer can tell you what sort of fraction. But, how do you test something like that? Create real images that are mixtures of two different probes.

To determine nuclear shape you can use the medial axis model. 11 parameters allow you to synthesize one nucleus – you learn those 11 parameters over 1000s of nuclei and you get a distribution. The model for the cell shap is about Distance ratio, and capture variation as a Principal components model, typically using 10 principle components. For models for protein containing objects , you see them as a mix of gaussian objects and learn distributions. The SLML model toolbox is all about storing these models. If you want to do cell simulations, then you can combine models together – interesting for the virtual cell, as you can model the proteins inside the cell with this.

They also distribute lots of annotated data sets where they’ve collected images of different proteins, both in 2d and 3d.

FriendFeed Discussion: http://ff.im/4vq84

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

MOLGENIS by example: generating an extensible platform (ISMB DAM/BOSC 2009)

…for genotype and phenotype experiments

Morris Swertz, Groningen

MOLGENIS is a database generator – a free toolbox of automated best practices and more. We separate out the parts that are different from the parts that are common in their applications. The blueprint is written in XML and then templates are written which translate that XML into working software. The first step in MOLGENIS is modelling (though you can also extract a model from an existing database that isn’t yet part of MOLGENIS). Then a set of generators are run that will build the appropriate apps. The generators are implemented in Freemarker. The programmer can the build on the generated code if they like.

He spent some of the time on a demo of MOLGENIS itself.

XGAP (Xtensible Geontype and Phenotype data platform) – DAM challenges include sharing data between QTL collaborators, variety of species and methods, ensure there is reuse of ad-hoc analysis protocols. XGAP extends the FuGE standard. Their project plus 6 others are described in XGAP format.

The next step is to add  more semantics into it. They would also like to do some federation and cloud computing.

FriendFeed discussion: http://ff.im/4vnGO

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

SysMO-DB: Sharing and Exchanging SB Data and Models (ISMB DAM SIG 2009)

Kathy Wolstencroft, University of Manchester

91 groups are part of the SysMO consortium. Started in 2008, a year after the other projects. They want to retrofit the data access and modelling platform over the existing resources used by the consortium. Their web interface looks similar to myExperiment. Sharing policies are present: your stuff is private until you say it isn’t. How are the assets exchanged? They’re divided into experimental and bioinformatics processes. In terms of models (and not data), they say SBML is the recommended format. For those who model with SBML, they have integrated with JWS online.

Their data comes from excel spreadsheets (the majority), SABIO-RK, iChiP, MeMo, and others. In order to exchange data, they have the “just enough results model” which lists a minimal amount of information for exchanging results. For each data type, you define a jerm and generate and apply that jerm to your data. Keeping the data at the project sites has challenges: reliability, support, and archiving. Final thoughts: find a solution that fits their current practices; start simple, show benefits, add more; engage with peopel actually doing the work; let the scientists retain control over their data and who can see it; don’t reinvent; help prevent people duplicating work by linking the people as well as the resources.

FriendFeed discussion: http://ff.im/4vigi

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

Grid/cloud data storage and computing for 3D image databases (ISMB DAM SIG 2009)

Christoph Best, EBI

Bioimage informatics is informatics for biological imaging, and helps make that data reliable and accessible. Structural databases at the EBI include the PDB and the EMDB, the latter of which is a part of PDB at EBI and Rutgers. For electron microscopy, objects get “plunged”, where they are embedded in amorphous ice that is transparent to electrons. Then in the picture you have an x-ray projection of the protein. The single-particle method helps you get molecular structures, and has many images computationally combined. You can get 3d images from the 2d ones. You can also increase the resolution by averaging. It is very much an iterative process. The resolution has been pushed up to 4 angstroms.

Data management issues: initial images 10-20 gigabytes. The final data set is 1 MVoxel, which is considered small. Processing power is roughly equivalent to a number of weeks in lab-owned clusters if you have a few hundred cores. The software is a mixture of 1970s’ Fortran code, and 1990s C code: it’s very fragmented and there is a definite lack of standards. 3d reconstruction happens by taking a series of images from different angles. It makes it possible to see a cell to a “molecular resolution” (tomography of eukaryotic cells). They do a lot of processing on ensembles of images as well as the images themselves. “Visual proteomics” will identify proteins in cryo-electron tomograms of intact cells. Involves pattern matching.

EMDB has about 600 entries, with a current rate of increase of approx. 15-20 per month. Metadata mgmt is difficult – there are many rounds of consultation with the community, and still most fields remain empty. Submissions consist of maps (increasingly more than one) and have relations between data sets. They’re looking into XML-based standards for representing relationships between data. They’d like to be able to get the community to submit all their original data. To hold all this, they need help from grid/cloud computing to help with the data upload, distriution, etc.

My thoughts: A highly visual, very good talk.

FriendFeed discussion: http://ff.im/4vfm9

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

Standards and infrastructure for managing experimental metadata (ISMB DAM SIG 2009)

Philippe Rocca-Serra, EBI

Metagenomics and metatranscriptomics experiments are growing in size and complexity. Gilbert et al PLoS One, 2008 is one example. There are many different domains of science, and the all share some common problems. Consistent reporting of the experimental metadata along with the resulting data has a positive and long-lasting impact on the value of collective scientific outputs. To help solve these problems, many communities have developed reporting standards covering minimal information to be reported about an experiment type (MIBBI).

At the EBI, there are separate submission systems for proteomics data, transcriptomics data, sequence data etc. This is frustrating for the researcher who may have all of this information in one experiment. Therefore, BII is being developed to simplify the submission process. More generally, work is underway to promote synergies among standards initiatives. The common efforts include: MIBBI for scope, FuGE and ISA-TAB for syntax, and OBO foundry ontologies and terminologies for semantics. There is a MIBBI talk later this week at ISMB by Chris Taylor.

There are a number of components that are part of the ISA infrastructure (for more information, see http://isatab.sf.net): The isacreator configurator allows you to set metadata fields and allowed values; isacreator itself is used to describe and upload the experimental metadata (ontologies are accessed in real time via the ontology lookup services and BioPortal, and groups of samples are colour coded); isacreator has a nice visualisation of the various group types that gives you an overview of size and relative importance to other groups.

My thoughts: Firstly, you should know that I contributed to both the FuGE projects and the ISA-TAB projects, and helped develop the ISA-TAB specification. Therefore, I have an interest in this. Moving on… Overall, it really looks like isacreator is coming along nicely from its early incarnations. It looks nice, which is very important for user uptake. It also is compatible with FuGE (though development is still ongoing to increase compatibility, I think).

FriendFeed discussion: http://ff.im/4vcW5

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

BioExtract Server (ISMB DAM SIG 2009)

Carol Lushbough, University of South Dakota

The BioExtract server allows querying, export, saving data extracts, apply analytic tools, saving and executing workflows, etc. User queries can be constructed around a Web UI. The data source implementation includes web services representing data sources, relational database, and apache lucene data extracts. They use DTDs to describe a flat-file format.

We were then taken through the use of BioExtract via a screenshot-based demo. Because of that, I didn’t get as many notes as I normally otherwise would.

FriendFeed discussion: http://ff.im/4vb2l

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

Adapting the Galaxy Bioinformatics Tool (ISMB DAM SIG 2009)

…to Support Semantic Web Service Composition

Eileen Kramer, University of Georgia

They are working with Galaxy to extend it. Galaxy is available for workflows and uses Yahoo pipes. It’s an open-source project, and is relatively easy to use. Some limitations exist, for instance although it can be used via any browser, all tools must be installed on a server. They’re trying to add improvements such as web services for access to the tools. They’re working mainly on web service composition, and the addition of semantic web services via WSDL-S and SAWSDL. Essentially, they want to be able to invoke web services from within Galaxy. There was then a number of movies to demo how these services work.

Galaxy has a number of different data types, and provides some tools for converting between them. They want to perform data mediation, and use XML data types for the web services. Their data mediation uses a top-down approach: lifting mapping converts to the common ontology, and the lowering mapping sends from the common ontology to the other format.

They want to combine the Galaxy workflow and the web service composition. They plan to extend Galaxy workflow to support web services. One issue – should BPEL be used versus other process definition languages? Another issue is process mediation, and provide suggestions when a user is composing a workflow and have a semi-automatic approach.

FriendFeed discussion: http://ff.im/4v2Wm

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!