Home > ARIES, Data Integration, Meetings & Conferences > Entanglement: Embarassingly-scalable graphs

Entanglement: Embarassingly-scalable graphs

In a previous post, I talked about what kind of visualizations would make sense for large-scale epigenomic and related data. Now, I’d like to introduce the kind of data structures we’re building in the Newcastle ARIES project to support the creation of such visualizations.

Entanglement is a software platform that provides a generic, scalable, graph framework suitable for data integration applications that require embarrassing scalability. With the data sets we are currently importing, write times for Entanglement scale linearly over millions of database entities. A poster (Entanglement: Embarrassingly Scalable Graphs) and abstract for Entanglement were presented last month at the Integrative Bioinformatics 2013 symposium in IPK-Gatersleben, Germany.

Included below is the text of the abstract, together with a summary of the poster’s contents.

Epigenetics is becoming a major focus of research in the area of human genetics. In addition to contributing to our knowledge of inheritance, epigenetic profiles can be used as prognostic or predictive biomarkers. Methylation of DNA in leukocytes is one of the most commonly measured forms of epigenetic modification. The ARIES project generates epigenomic information for a range of human tissues at multiple points in time. ARIES uses both Illumina Infinium 450K methylation arrays and BS-seq approaches to generate epigenetic data on a number of samples from the Avon Longitudinal Study of Parents and their Children (ALSPAC) cohort. ALSPAC is a unique resource for studying how methylation patterns change over time. The ARIES project also provides tools and resources to the community for the interpretation of epigenomic data in the form of an integrated dataset and associated Web portal for browsing and integrative analysis of experimental methylation data. The integrated dataset includes a range of data types such as phenotypic, transcriptomic and methylation data from rodents, together with data generated by studies such as the ENCODE project.

The integration of these data is a considerable bioinformatics challenge. To meet this challenge we are developing a graph-based data integration platform, extending our previous work with the ONDEX system. We have developed a scalable, parallel graph storage system that exploits Cloud computing infrastructures for integrating data to produce graphs of entities and the relationships between them. This system, called Entanglement, has been designed to tackle the problem of scalability that is inherent in most graph-based approaches to bioinformatics data integration.

Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time.

Data is distributed across a MongoDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including rapidly cloning or merging existing graphs to form new graphs. Although the ultimate aim is a fully integrated dataset, by intentionally storing different data sources in different graphs a large amount of flexibility can be obtained.
Multiple ad-hoc integrated views can be composed by importing references to the nodes and edges in various individual dataset graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi.

Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.

Entanglement Architecture Overview

Entanglement (http://www.entanglementgraph.com) Architecture Overview

Data integration for ARIES will ultimately require graphs containing hundreds of millions, if not billions, of graph entities. Entanglement has been shown to scale linearly with our initial ARIES datasets involving graphs with up to 50 million nodes and edges. Our results suggest that the system will scale to much larger graph sizes. Data storage capacity can be expanded by adding MongoDB servers to an existing cluster. Indexes required for efficient querying are designed to fit in memory, as long as enough machines are available to the cluster.

Entanglement is available under the Apache license at http://www.entanglementgraph.com.

About these ads
  1. dwoolfenden
    May 4, 2013 at 14:15

    Hello, very interesting, I would like to take a closer look at Entanglement but I can’t seem to find the download link at http://www.entanglementgraph.com ?
    Thanks.
    dave

    • May 7, 2013 at 13:55

      Thanks for your comment. While Entanglement is hosted on bitbucket, we have it currently set to a private project pending finalization of the API. We’re happy to take any questions in the interim, and in the next week or two we should have a date scheduled for the release of the code. I hope this helps.

  2. dwoolfenden
    May 13, 2013 at 19:55

    Allyson, I do have a couple questions, but, I will wait until the project is made public to ask them. Look forward to having access in the near future!
    Thanks.
    dave

  3. dwoolfenden
    July 5, 2013 at 12:30

    Allyson, Hello, is there any update here? Thanks. dave

  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 494 other followers

%d bloggers like this: