Daphne Koller, Stanford University
aka Understanding Gene Regulation: From Networks to Mechanisms. Title change due to introduction of new results – some of which only happened last night. There are many mechs (chromatin modification, rna degradation) that modify gene regulation (GR). There are also perturbations such as environment, drugs, and genetic changes.
Regulatory networks for gene expression. Can measure RNA throughput / levels in a cell at any time. So you can model RNA level of a gene as a variable in a model. First assumption is that mRNA level of regulator can indicate its activity level. Target expression is predicted by expression of its regulators. Then use expression of regulatory genes (TFs, signal transduction proteins, MRNA processing factors, etc – direct or indirect) as regulators. Another assumption for the model is that co-regulated genes have similar regulation program. This allows us to exploit modularity and predict expression of entire module, and allows uncovering compled regulator programs. A critical aspect is the structure of the regulatory program.
Regulatory trees are really good to use (see Segal et al. Nature Genetics 2003). But it has disadvantages: as you try to learn this type of program from data, once you make an initial decision on activator you’ve split things into two groups and as you go down further in the tree you lose statstical power. As we know regulatory programs are very complex in most organisms, there isn’t enough statistical power in this method to pick up all the complexity. An (often arbitrary) regulator gets selected among a group of correlated regulators and may not be the right one.
Therefore you do regulation as linear regression. However, it can be problematic as there can be 100s or 1000s of regulators, and linear regression gives them all nonzero weight, and the result is highly uninterpretable to biologists.
Therefore you end up with two things they’ve played with: Lasso (L1) Regression (Tibshirani 1996), and Elastic Net Regression (Zhou and Hastie 2005). With Lasso Regression L1, it’ll push all non-relevant regulators to 0. This is a convex optimization problem, whith a unique globnal optimum, but it STILL aribtrarily chooses among two highly-correlated regulators. Elastic Net Regression solves some of these, and is what they’ll be talking about for most of the rest of this section.
Cluster genes into modules and then leaern a regulatory program for each module. Then you iterate – reassign genes to modules, and it uses regulation as opposed to pure expression to put genes into modules, which helps things a lot. A special kind of Bayesian network.
Individual genetic variation and gene regulation (original main point of talk before adding cell differentiation and gene regulation).
eQTL dataset (the biggest data set from Brem et al. (2002 Science). Two strains of yeast were crossed and the expression of 6000 genes were measured and 3000 markers on the chromosome were measured. With an F2 protocol, this is for most genes good enough to work out genotype. In the traditional approach (single marker) it tries to treat each expression profile as a single quantitative trait (such as height or weight). If you have a profile of a gene that is highly correlated with a given marker, then you have something useful.
However, their method is different: there are new binary variables that represent the gene types and markers along the chromosome. How do markers affect things? They’ll have eRegulators (same as before) and gRegulators (Genotypes). She then show us one of the modules that comes out, the telomere module (40/42 of genes are in the telomere). This module is enriched for telomere maintenance. Another module deals with chromatin modification. 4/5 consecutive genes – the highest-ranking regulator is Sir1, and the genes are known Sir1 targets, which is nice.
So, the chromatin as mechanism. Of 23 modules, 16 were automatically classified as having “chromosomal features”. Chromatin modification was the biggest type that described the variation between the two strains. Another mechanism that came out of this came out of the Puf3 Module. One of its characteristics is that 147/153 genes are pulldown targets for the mRNA binding protein Puf3. Puf3 family is part of the seq-specific mRNA binding proteins (3′ UTR) and regulate degradataion of mRNA and/or repress translation. The regulatory program came up with several genes: KEN1 and DHH1 which are part of P-Bodies complex. These are places where mRNA are stored temporarily and while they are there they are transcriptionally repressed. Dhh1 regulates mRNA decapping in p-bodies. The model then suggests that PUF3 is part of this process of decapping. They did a microscopy experiment to test this, which demonstrates that PUF3 is specifically localized to p-bodies.
What regulates the p-bodies? What is one level higher up in the hierarchy? A locus on chromosome 14, but this is a large region and covers 30 genes and 300 polymorphisms. Therefore she came up with the idea of regulatory potential. The motivation is that not all SNPs are equally likely to be causal. You can create a list of regulatory features but how important/what weight to give each? They used Bayesian L1-Regularization. The prior distribution is a Laplacian. Weight can more easily deviate form 0 and the regulator is more likely to be selected. Each regulator has its own prior and now has its own probability distribution about how likely it is to diverge from 0. You end up with a Metaprior Model (Hierarchical Bayes).
Start by learning regulatory programs as described earlier. Second, learn regulatory weights (betas). Then compute the regulatory potential of each SNP in the genome. Then iterate. What do regulatory potentials do? They don’t change the selection of strong regulators (those where the prediction of targets is clear). However, it does help disambiguate between weak ones. Strong regulators help teach her what to look for – a form of transfer learning. The highest-ranking ones are stop codons, and the model figured out these are things to look for. Cis regulation (correlation between SNP and adjacent gene) is very highly ranked. Also, conservation. (All of these were determined automatically.) Many others (Lee et al PLoS Genetics 2009).
Statistical evaluation of this method came next. It uses PGV: % of genetic variation explained by the predicted regulatory program for each gene. It’s a form of test data validation. Explained about 50% of the variation in about 50% of the genes. How many predicted interactions have support elsewhere? Up to 80% in modules had support. Predicting causal regulators: find these regulators for 13 “chromosomal hotspots”. What in the large regions are the causal regulators? Learning regulatory priors is good: it can use any set of regulatory features including sequence features.
Cell differentiation and gene regulation.
ImmGenData set of 63 mouse immune cell types with 203 arrays. Goal: learn regulatory network involved in the cell differentiation process. By using G-Regulators, you allow programs to depend on genetic variation, but you don’t have G-regulators here, you have cell types. In regulation in ontogeny, you want bias towards shared regulation but also allow transitions. Therefore use ontogeny to guide conserved regulation. You do this by looking at differences, and penalize every place you change the regulatory power – penalize changes/divergences in the regulatory program. Tested by looking at 6 cell types and then made predictions. The naive model error value is significantly higher than the lineage-aware model.
You see shifts in the regulatory program within T-cells at helios, which is know to block T-cell proliferation, but their conclusions is that the blocking can happen higher in the lineage than was previously known. JARID1B is a histone lysine 4. It had not been previously associated at all with immune cells or anything, but in a paper from last week in Nature its paralog was discovered to play a role in hemapoetic stem cells.
Expression changes and underlying phenotype. What are the mechanism underlying them? Example: transformation of FL to DLBCL (Diffuse large B cell lymphoma – thanks to Oliver Hofmann for the acronym expansion!) occurs in 40-60% of patients, and diverse mechanisms seem to drive transformation. Gene-based analysis was inconclusive, and so generated a regulatory network. How is this used to understand pheontype? Represent each module as a metagene expression profile, and use machine learning to id modules distinguishing FL-t (pre-transformation) from transformed DLBCL. Construct multivariate linear regression model for FL survival using “stemness” modules as predictors. Performs well on both training and validation set and even of DLBCL survival, even though they have completely different characteristics. Therapeutic implications using a connectivity map analysis.
Can you use a module-based approach to understand metabolic syndrome? Used mice ApoE present or mutated. They end up with what they call a pheontype network, where the nodes are modules and the edges are learned regulatory programs. An important module in this case is the biosynthesis liver module: the genes are almost disjoint but are all in the same module, therefore you would have missed it without these modules.
Summary: framework for modeling gene regulation; this uncovers divers regulatory mechanisms; and use for understanding the effect of gene regulation of phenotype.
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!