Categories
Meetings & Conferences

PT47: Predicting and Understanding the Stability of G-Quadruplexes (ISMB 2009)

Oliver Stegle

quadruplexes: stable structures that can come from DNA + RNA. THey play a role in regulation of transcription. A stable fold-back structure can emerge in the presence of a cation. Are these patterns really stable? Which carry a functional role? First indicator of functionality is stability. Melting temp will be a proxy for stability. UV-Melt experiments are low-throughput. Further, the rules for quadruplex stability are complicated and have non-linear relationships.

They want to solve the regression problem mapping from a quadruplex input to its melting temperature in a supervised setting. They are using a gaussian processes regression, with marginal posterior mean and error bars. There is predictive uncertainty.

The sequence is a spectrum kernel to capture local sequence differences, followed by a squared exponential kernel for candidate features. The reason that regression method chosen is that the data used for training is noisy, with outliers. Also, these structures are so stable sometimes they never melt. With a Non-Gaussian likelihood model, robust mixture model likelihood accounting for outliers, and step function to incorporate observations that are bounds.

Predictive accuracy: they look at 260 quadruplexes (one of the first data sets available for quadruplexes). Compared with linear regression, SVN, GP, GP + robust noise. As it gets harder, the error goes up (as expected). The linear model is not adequate. The GP robust was the best of the lot. The GP robust significantly gains over the other models – better at determining confidence levels. With a 50/50 training split, the predictions (with the error bars) always overlap with the “truth” line, sometimes with a large uncertainty. Everything is predicted within 10 degrees C.

Genome-wide quadruplex candidates. Structures are taken from quadruplex.org (360,000 candidate seqs). Can we predict them? Is there any relation between location in genome and temperature? Quadruplexes are overrepresented in the promoter regions by order of magnitude than anywhere else.

The current training dataset with 260 observations is not very big. Also, what sequences should be tested next? Active selection has been applied to 10987 quadruplexes in promoter regions – selected 30 measurements actively and at random. He presented a gaussian process scheme of regression of quadruplex stability. Good estimates of predictive uncertainty.

Allyson’s notes: Some of this was a little over my head, so there may be more than a normal chance of me getting some of these notes wrong! 🙂

FriendFeed Discussion

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!

Categories
Meetings & Conferences

Four-Stranded DNA: How G-Quadruplexes Control Transcription and Translation (BioSysBio 2009)

JL Huppert et al.
University of Cambridge

They do both computational and experimental work to try to understand these structures. The classical base pair arrangements are not the only structures you can have. You can arrange them in tetrads with a phosphate backbone and potassium ion in the center. This allows you to have a single strand that falls back on itself to form a loop. This 4-stranded DNA could be associated with the human telomeric repeat. Telomerase is responsible for elongating telomeres and keeps them going in things like stem cells, and is also active in 85% of cancers.

These can attach themselves to the promoter and cause altered transcription. A drug or other protein could shift the state of the DNA from having an accessible promoter or not. Many genes involved in cancer have G-quadruplexes in their promoters. He asks: can we predict structure from sequence? Can we get information about their stability, for example? Where are G-quadruplexes found? What do they do? What can we do to them? The Quadparser algorithm was developed, and it looks like there are 379,000 G-quadruplexes encoded in the human genome. This algorithm is not perfect – it doesn't tell us anything about stability, among other things. So, they've developed a non-linear bayesian predictor, with a Gaussian noise model. It uses a list of possible features, fits to these using non-linear model, tolerates outliers and bounds, learns relevance of inputs, and gives predictions and error bars. They tested with 256 datapoints, with a 70/30 split for learning/testing sets. Better than linear regression and more simple Gaussian processes.

Over 40% of all known genes have a G-quadruplex motif in a 1kb promoter region. They are more stable than most. It's a really common regulatory element. It depends on the type of gene, whether or not it has this type of interaction. Oncogenes are enriched: 69% have such motifs.

They looked at one of these interesting proteins, N-ras, which is a GTP-ase protein involved in cell signalling. They found that when you remove the quadruplex, you get 4x as much of the protein. Others have taken this further and found a correlation between the amount of repression and the stability of the quadruplex. The quadruplex can also act as a pause between two closely-spaced genes.

Quadruplexes are extremely well conserved. We can split quadruplexes into the loops and non-loop areas, and find that the variation is localized in the loops rather than the core, non-loop areas by examining SNPs. What is the evolutionary direction of the changes? Are quadruplexes arising or being removed? There are very few mutations that introduce new quadruplexes, and many that cause them to be lost. Where they do arise, they spread through the population.

See http://www.quadruplex.org

Personal Comments: It was quite interesting to hear new things about telomeres, as they're of much interest to those of us researching ageing at CISBAN. As the chatter in the biosysbio FF states, he's very clear with his examples of equations, machine learning types and graphs. He talks very fast, but has so much to fit in! Manages to make it clear as well as fast.

Monday Session 1
http://friendfeed.com/rooms/biosysbio
http://conferences.theiet.org/biosysbio

Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else's. Please let me know of any errors, and I'll fix them!

Send to a friend

original