quadruplexes: stable structures that can come from DNA + RNA. THey play a role in regulation of transcription. A stable fold-back structure can emerge in the presence of a cation. Are these patterns really stable? Which carry a functional role? First indicator of functionality is stability. Melting temp will be a proxy for stability. UV-Melt experiments are low-throughput. Further, the rules for quadruplex stability are complicated and have non-linear relationships.
They want to solve the regression problem mapping from a quadruplex input to its melting temperature in a supervised setting. They are using a gaussian processes regression, with marginal posterior mean and error bars. There is predictive uncertainty.
The sequence is a spectrum kernel to capture local sequence differences, followed by a squared exponential kernel for candidate features. The reason that regression method chosen is that the data used for training is noisy, with outliers. Also, these structures are so stable sometimes they never melt. With a Non-Gaussian likelihood model, robust mixture model likelihood accounting for outliers, and step function to incorporate observations that are bounds.
Predictive accuracy: they look at 260 quadruplexes (one of the first data sets available for quadruplexes). Compared with linear regression, SVN, GP, GP + robust noise. As it gets harder, the error goes up (as expected). The linear model is not adequate. The GP robust was the best of the lot. The GP robust significantly gains over the other models – better at determining confidence levels. With a 50/50 training split, the predictions (with the error bars) always overlap with the “truth” line, sometimes with a large uncertainty. Everything is predicted within 10 degrees C.
Genome-wide quadruplex candidates. Structures are taken from quadruplex.org (360,000 candidate seqs). Can we predict them? Is there any relation between location in genome and temperature? Quadruplexes are overrepresented in the promoter regions by order of magnitude than anywhere else.
The current training dataset with 260 observations is not very big. Also, what sequences should be tested next? Active selection has been applied to 10987 quadruplexes in promoter regions – selected 30 measurements actively and at random. He presented a gaussian process scheme of regression of quadruplex stability. Good estimates of predictive uncertainty.
Allyson’s notes: Some of this was a little over my head, so there may be more than a normal chance of me getting some of these notes wrong! 🙂
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!