Plenary Talk, Morning Session, 3 September (11th MGED Meeting, 1-4 September, 2008)
Modeling conditional independence: the idea that the transcription level of TFs drives the expression of other genes is over simplistic, but it is the most easily converted into a statistical model: you model the conditional independence structure of the expression data as a graph. You're explaining the expression level of certain genes with the expression level of other genes, which isn't really that biologically exact.
Model indentification: different graphical models encode for very similar covariance structures, and some models even for identical ones. You need 1000s of microarrays to identify a network reliably (example provided had 2000 arrays). Why do you need so many? You can easily generate graphical models, but the data you get are very similar – this is why you need lots of arrays.
Given many models for a biological network: in order to decide which one is best supported by data, the models must generate sufficiently-different data. If two models generate similar data, their biological interpretation must be similar, too. The model space must adapt to the limited complexity of the data and not to the high complexity of biology: it must be sufficiently coarse.
A coarse model must ignore aspects of biological complexity (George Box: All models are wrong, some are useful"). Which aspects should be ignored and which should be modeled? This depends on the type of data and the motivation of why you model a network. They want to analyze how the flow of information in molecular signalling pathways is disturbed in human tumors, and use this information for novel molecular classifications of tumors.
You cannot get things like a phosphorylation step or a dephosphorylation step in the microarrays directly. For instance, we don't know that signalling is related to cancer via microarrays, but via mutations, which can introduce constitutive active signals or block the signal flow. These mutations may yield changes in gene expression profiles downstream. So how can we get back to the actual cause from the gene expression data? You can mimic loss-of-function mutations using RNAi experiments.
Nested effects models (NEMs). Negative controls (C-), positive controls (C+), Interventions in S-Genes (RNAi), Observations in E-Genes (microarray). The silencing effect is when an e-gene goes from a C+ level back to a C- level. How does the data look like that will be generated by the model, given a certain structure of a linear cascade? Model assumptions: the core model is transitive, every e-gene is connected to exactly one s-gene, and there is independent binary noise.
Scoring observed silencing effects. The silencing scheme allows prediction of e-gene states (when the position is known). We expect a number of false positive and false negative observations. The likelihood is based on these. They then get a likelihood for each core model, and then find the maximum likelihood model. The model search space is large, and gets very very large after 8 or 9 genes. For each pair of genes, fit a model for every pair of genes, and pick the best of the four possible models. The problems with this: works fairly well but not really good, and it looses transitivity (only 2 genes). Alternatively, you can do triplets – usually you are much closer to transitivity.
How well does this network work on simulated data? Pretty well, even with the number of genes goes up to 32, there is still 90% precision with 5 replicates for triples (only about 82% for pairs).
Limitations: hardly any statistical theory on uncertainties, unstable wrt the included s-genes, and unstable wrt data discretization (but this one can be fixed). Applications: NEMs can be used toderive a singalling hierarchy, and a signalling-consistent clustering, and might be used to classify tumors.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. 🙂