Last week I had another volunteer session as a STEM Ambassador with a local primary school. I was one of three STEM Ambassadors (as well as the lovely STEM coordinator Catherine Brophy) who spoke for an hour with a variety of Key Stage 2 children (anywhere from around 7-10 years old I think) about being a scientist. They interviewed all four of us about what it means to be a scientist, why it is fun/important, and more. When I wanted to speak a bit more about what it means to “do biology with a computer”, I referred occasionally to these slides I had made for a similar purpose a few years back:
The slides are full of pretty pictures (the notes to accompany these slides are also available on my blog) and they gave me a way to focus the children’s attention. However, the kids last week really didn’t need much focusing, and asked lots of great questions. They asked if I had invented anything (does inventing biological data standards count? I think so…!), who my inspirations were (my high school Biology teacher of course, among others), if I had any pets (slightly off topic!), and what my greatest accomplishment was (my thesis – phew!). They all seemed really animated, which is one of the best perks of going to a primary rather than a secondary school. I love the fact that secondary school students can handle more complex discussions, but the enthusiasm of primary school kids is just stupendous.
This trip to a primary school was very timely, as I had recently been to a parents’ meeting at my own son’s primary school about the change in the English school system away from levels. The head teacher was fantastic at explaining the changes, but one of the things I noticed about the new system was a seeming lack of guidance for schools in the science curriculum. It worries me that primary schools are being edged away from teaching science due to a large emphasis on English and Mathematics.
However, this is an issue that the STEM network can help solve. As ambassadors, we can come into your school and talk about science, how we became scientists, and why we love it. Last week it was three scientists talking, but the STEM network can provide experiments and other visual aids too, as well as helping schools enter science contests and fairs such as the Big Bang.
We had fun last week. We had a chemist (with a background in maths) who brought props: a cow bone, jelly babies, hair dye and other things. She asked the kids what the cow bone and jelly babies had in common (gelatin!), and quite a few of them knew. The other biologist had done field work with butterflies, and had the children imagine how you could mark or safely trap them. I talked about the structure of DNA (one child knew the comparison with a spiral staircase!), and how science was great because you don’t have to accept what anyone says “just because”. You don’t believe them? Test it! Science is imagination, testing, and reproducibility.
You don’t have to take my word for it “just because” I say it’s great – become a STEM Ambassador yourself, and test my statement that it is a completely awesome thing to do :)
By the end of last year, I had finished my work with both Manchester and Newcastle. Happily, I’ve found a new (working) home with Susanna, Philippe and the gang at the OERC in Oxford. I’ll be working part time on the BioSharing.org project, and will be doing all sorts of things related to biological data standards, policies, and databases.
I have a history in biological data standards and in developing community-driven standard formats, checklists and ontologies. I look forward to devoting some time to this collaborative, integrative project, and to helping people structure and manage their data well. To finish, here is a little bit about BioSharing, taken from the website itself:
“BioSharing works to map the landscape of community developed standards in the life sciences, broadly covering biological, natural and biomedical sciences. […] As part of the growing movement for reproducible research, a growing number of community-driven standardization efforts are working to make data along with the experimental details available in a standardized manner. BioSharing works to serve those seeking information on the existing standards, identify areas where duplications or gaps in coverage exist and promote harmonization to stop wasteful reinvention, and developing criteria to be used in evaluating standards for adoption.”
You may have noticed a pause in my posts (here, and on Twitter, and on G+ etc. etc.) – this is due to an 8 lb 15 1/2 ounce (4.07 kg) bouncing baby boy coming into our lives this past July :) So my apologies, I do plan to post more about bioinformatics and ontologies in the near future, but as this is a work blog (and not a baby blog!) there will be a little break now. Normal service will resume, eventually!
In a previous post, I talked about what kind of visualizations would make sense for large-scale epigenomic and related data. Now, I’d like to introduce the kind of data structures we’re building in the Newcastle ARIES project to support the creation of such visualizations.
Entanglement is a software platform that provides a generic, scalable, graph framework suitable for data integration applications that require embarrassing scalability. With the data sets we are currently importing, write times for Entanglement scale linearly over millions of database entities. A poster (Entanglement: Embarrassingly Scalable Graphs) and abstract for Entanglement were presented last month at the Integrative Bioinformatics 2013 symposium in IPK-Gatersleben, Germany.
Included below is the text of the abstract, together with a summary of the poster’s contents.
Epigenetics is becoming a major focus of research in the area of human genetics. In addition to contributing to our knowledge of inheritance, epigenetic profiles can be used as prognostic or predictive biomarkers. Methylation of DNA in leukocytes is one of the most commonly measured forms of epigenetic modification. The ARIES project generates epigenomic information for a range of human tissues at multiple points in time. ARIES uses both Illumina Infinium 450K methylation arrays and BS-seq approaches to generate epigenetic data on a number of samples from the Avon Longitudinal Study of Parents and their Children (ALSPAC) cohort. ALSPAC is a unique resource for studying how methylation patterns change over time. The ARIES project also provides tools and resources to the community for the interpretation of epigenomic data in the form of an integrated dataset and associated Web portal for browsing and integrative analysis of experimental methylation data. The integrated dataset includes a range of data types such as phenotypic, transcriptomic and methylation data from rodents, together with data generated by studies such as the ENCODE project.
The integration of these data is a considerable bioinformatics challenge. To meet this challenge we are developing a graph-based data integration platform, extending our previous work with the ONDEX system. We have developed a scalable, parallel graph storage system that exploits Cloud computing infrastructures for integrating data to produce graphs of entities and the relationships between them. This system, called Entanglement, has been designed to tackle the problem of scalability that is inherent in most graph-based approaches to bioinformatics data integration.
Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time.
Data is distributed across a MongoDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including rapidly cloning or merging existing graphs to form new graphs. Although the ultimate aim is a fully integrated dataset, by intentionally storing different data sources in different graphs a large amount of flexibility can be obtained.
Multiple ad-hoc integrated views can be composed by importing references to the nodes and edges in various individual dataset graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi.
Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.
Data integration for ARIES will ultimately require graphs containing hundreds of millions, if not billions, of graph entities. Entanglement has been shown to scale linearly with our initial ARIES datasets involving graphs with up to 50 million nodes and edges. Our results suggest that the system will scale to much larger graph sizes. Data storage capacity can be expanded by adding MongoDB servers to an existing cluster. Indexes required for efficient querying are designed to fit in memory, as long as enough machines are available to the cluster.
Entanglement is available under the Apache license at http://www.entanglementgraph.com.
I am currently working in Prof. Neil Wipat’s group at Newcastle University on the ARIES project. This involves working with large amounts of epigenomics data from the ARIES project itself, as well as with all sorts of related information from external data sources.
As well as the production of an integrated data set for ARIES’ nascent genome track browser, an indispensable tool for this type of data, we’re working on something else very exciting: graph data. Specifically, we’re trying out all sorts of visualizations for relevant data sets (including the ARIES data). Here’s one I’ve been playing with recently.
The pink nodes represent the median beta values for methylation sites along the human genome. The lighter the pink, the less likely this particular point on the chromosome is methylated, and vice versa. At a glance, the user can see all integrated beta values (and therefore all experiments containing methylation information) for a particular chromosome location. This is a small, gene-centric graph (the gene is in green) intended for people who would like to see an overview of known experimental results for a particular gene of interest.
This is just the start; we have lots of other visualization ideas, as well as lots of ideas for the creation of novel –and interesting– types of subgraphs. Our database is huge, and the hairball of the entire thing (or just of one chromosome) is likely to not be as informative as subgraphs like this created around a particular area of interest.
But we’re not just working on the export of interesting subgraphs from our graph database: my colleague Keith Flanagan has been developing a highly scalable and incredibly neat graph database built on MongoDB.
You’ll probably see a lot of pictures like the one above in the coming weeks on this blog, as we experiment with visualizations and views of our data. If you’re into epigenomics, and have always wanted to view your data in a particular way, please leave a comment below. I’d also love to hear your ideas about this particular visualization type. Your input would be most welcome.
Below you can find a complete table of contents for all thesis-related posts (you can also get to the posts via the “thesis” tag I have used for each). Enjoy!
- Additional Front Material
- Chapter 1: Background
- Chapter 2: SyMBA: a Web-based metadata and data archive for biological experiments
- Chapter 3: Saint: a Web application for model annotation utilising syntactic integration
- Chapter 4: Model Format OWL integrates multiple SBML specification documents
- Chapter 5: Rule-based mediation for the semantic integration of systems biology data
- Chapter 6: General discussion
And, if you’re interested in how I performed the conversion, I’ve written about that too.