Entanglement: Embarassingly-scalable graphs
In a previous post, I talked about what kind of visualizations would make sense for large-scale epigenomic and related data. Now, I’d like to introduce the kind of data structures we’re building in the Newcastle ARIES project to support the creation of such visualizations.
Entanglement is a software platform that provides a generic, scalable, graph framework suitable for data integration applications that require embarrassing scalability. With the data sets we are currently importing, write times for Entanglement scale linearly over millions of database entities. A poster (Entanglement: Embarrassingly Scalable Graphs) and abstract for Entanglement were presented last month at the Integrative Bioinformatics 2013 symposium in IPK-Gatersleben, Germany.
Included below is the text of the abstract, together with a summary of the poster’s contents.
Epigenetics is becoming a major focus of research in the area of human genetics. In addition to contributing to our knowledge of inheritance, epigenetic profiles can be used as prognostic or predictive biomarkers. Methylation of DNA in leukocytes is one of the most commonly measured forms of epigenetic modification. The ARIES project generates epigenomic information for a range of human tissues at multiple points in time. ARIES uses both Illumina Infinium 450K methylation arrays and BS-seq approaches to generate epigenetic data on a number of samples from the Avon Longitudinal Study of Parents and their Children (ALSPAC) cohort. ALSPAC is a unique resource for studying how methylation patterns change over time. The ARIES project also provides tools and resources to the community for the interpretation of epigenomic data in the form of an integrated dataset and associated Web portal for browsing and integrative analysis of experimental methylation data. The integrated dataset includes a range of data types such as phenotypic, transcriptomic and methylation data from rodents, together with data generated by studies such as the ENCODE project.
The integration of these data is a considerable bioinformatics challenge. To meet this challenge we are developing a graph-based data integration platform, extending our previous work with the ONDEX system. We have developed a scalable, parallel graph storage system that exploits Cloud computing infrastructures for integrating data to produce graphs of entities and the relationships between them. This system, called Entanglement, has been designed to tackle the problem of scalability that is inherent in most graph-based approaches to bioinformatics data integration.
Entanglement has a number of unique features. A revision history component maintains a provenance trail that records every update to every graph entity stored in the database. Multiple graph update operations submitted to the revision history may be grouped together to form transactions. Furthermore, the revision history may be forked at arbitrary points. Branching is a powerful feature that enables one or more independent revision histories to diverge from a common origin. The branch feature is useful in situations where a set of different analyses must be performed using the same input data as a starting point. After an initial data import operation, a graph can be branched multiple times, once for each analysis that needs to be performed. Each analysis is performed within its own independent graph branch, and is potentially executed in parallel. Subsequent analyses could then create further sub-branches as required. The provenance of multiple chains of analyses (workflows) is stored as part of the graph revision history. Node and edge revisions from any branch can be queried at any time.
Data is distributed across a MongoDB cluster to provide arbitrary-scale data storage. As a result, data storage and retrieval procedures scale linearly with graph size. Graphs can be populated in parallel on multiple worker compute nodes, allowing large jobs to be farmed across local computing clusters as well as to cloud computing from commodity providers. Larger problems can be tackled by increasing the CPU and storage resources in a scalable fashion. An API provides access to a range of graph operations including rapidly cloning or merging existing graphs to form new graphs. Although the ultimate aim is a fully integrated dataset, by intentionally storing different data sources in different graphs a large amount of flexibility can be obtained.
Multiple ad-hoc integrated views can be composed by importing references to the nodes and edges in various individual dataset graphs. Entanglement also provides export utilities allowing graphs or subgraphs to be visualised and analysed in existing tools such as ONDEX or Gephi.
Domain-specific data models and queries can be built on top of the generic API provided by Entanglement. We have developed a number of data import components for parsing both ARIES-specific and publically-available data resources. A data model with project-specific node and edge definitions has also been developed.
Data integration for ARIES will ultimately require graphs containing hundreds of millions, if not billions, of graph entities. Entanglement has been shown to scale linearly with our initial ARIES datasets involving graphs with up to 50 million nodes and edges. Our results suggest that the system will scale to much larger graph sizes. Data storage capacity can be expanded by adding MongoDB servers to an existing cluster. Indexes required for efficient querying are designed to fit in memory, as long as enough machines are available to the cluster.
Entanglement is available under the Apache license at http://www.entanglementgraph.com.