ENCODE – Understanding our genome

Ewan Birney
Keynote Talk, Afternoon Session, 3 September (11th MGED Meeting, 1-4 September, 2008)


There are 4418 TSS with multiple lines of evidence supporting them. This is ~10 fold more than the number of genes. Only 38% would be traditional ones. With many more predicted TSSs, it is consistent with the considerable diversity of transcripts. Independently integrating Chip/Chip data suggested ~1000 "regulatory clusters". Sequence-specific factors are distributed symmetrically around the TSS. Histone information is highly correlated with gene on/off status.

What about the distal sites, and finding them?  Chip/Chip isn't "great" – most look close to one of these new TSSs – there could be factor bias. DNAaseI hypersensitive sites (DHS), as all factors give a DHS signal, and 55% of DHSs are distal to any TSS.

Evolutionary conservation and ENCODE. All 44 ENCODE (pilot) regions has 29.998 million bases. Of that, 4.9% are constrained, and of that, 40% are unannotated, 20% are other ENCODE Experimental Annotations, 8% are UTRs, and 32% are coding. Most of the genome is unconserved, which is to be expected. But, not everything is constrained.For instance, ancient repeats (ARs) have a very small fraction of experimental annotation overlapping a constrained sequence (e.g. they genuinly look like they're evolving neutrally). About 90% of coding exons are constrained in some way. Under 50% of the DHS are unconstrained. Why is there this discrepancy? False positives in the exp? Not likely – exps validate at >80% and cross-validate each other. What about false negatives in the constraint detection? Not likely again – can detect up to 8bp elements, and within the "neutral" zone of alignability. Ok, what about the neutral turnover model… There could be functional conservation, where there starts out with two promoter sites, and a speciation event also coincides with the splitting of one into each new species. Then it could look like there is an unconserved region, when there actually is.

What we should learn from ENCODE. "whacky" transcription is real (but we don't know what it does), and there's unconventional transcripts; Lots more TSSs than we understand (many "distal" regions are actually close to promoters); broad-specificity marks are more useful. Neutral model: because things happen reproducibly in multiple tissues does not imply selection (this is not the same as exp variance). could imply "functional" conservation outside of orthologous bases (comparative genomics sequencing not enough: need comparative functional investigation).

ENCODE scale up: 7 grants spanning all the main types of data generated for the pilot. There will be coordinated data collection (UCSC) and integrative analysis. There is also far tighter coordination (cell lines, standards, growth behaviour).


How to handle ENCODE data in Ensembl? In the  gene build and add supporting evidence and annotation – from there, you get classification either manually or automatically. In a Regulatory build, declare sections of the genome as regulatory features using the union of many experiments. Then there will be predicted binding of Myc and information about promoter element, and many cell lines. Trying to breakdown the problem into 2 axes: the elements, and the status/annotation of those elements. There are Point sources and Annotating sources/broad marks. PS are DNaseI sites and TF binding. AS will be histone marks and methylation status. This will be represented with a small box for the PS and "whiskers" for the area of the AS.

Initial Regulatory Build (headaches): need to consistently recall peak data (have their own in-house caller, SWEmbl, which will hopefully work via ENCODE to harmonize). There are also genomic headaches (mitochondrial repeats in the genome, centromeres, etc), very long regions, and it's reminiscent of gene builds. Has 55 different datasets, 172112 elements…

Names of very broad types of classification of Regulatory features: Genic (generic), promoter (cell sp.), geneic (cell specific), Promotoer (>1 cell line), unclassified (>1 cell line), unclassified (cell sp).

These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. 🙂

Read and post comments |
Send to a friend



Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s