Song et al 2008 PLoS Computational Biology 4(5) Genes that share common ancestry tend to have similar structure and function. As a consequence, evolutionary, functional and structural properties of genes and proteins can be inferred through sequence comparison. Such commonly-descended genes tend to be located in related chromosomal regions. Therefore sequence comparison can be used to ID chromosomal regions that share common ancestry. Look for pairs of homologous genes. If similar sequence, then homologous. Hence, comparative genomics. HOWEVER, if the sequences are multi-domain, we can run into problems. The insulin receptor is a good example of such multi-domain sequences. Multidomain sequences evolve via gene duplication and domain shuffling, which messes up plain sequence comparison. However, these genes still descended from an ancestral gene even if they do not share similar sequences.
Key question: Given two seqs with significant similarity, are they related by vertical descent or domain insertion? Inferences that can be drawn from vertical descent (similar molecular functions) and domain insertion (bindng partners) are different. Ex: PRKG1B and PDGFRB are both kinases. NCAM2 is a nural cell adhesion molecule that is unrelated to the kinases. Want to distinguish between the two. Both VD and DI have the same e-value with classical methods. The pair distinguished by VD and the pair distinguished by DI need to be separated programmatically. They curated a gold standard, and then evaluated existing methods and finally proposed a new method.
The goal of this method is to identify sequence pairs related by VD and DI,and should work on a broad range of families. To test, they looked at 20 well-studied families related by vertical descent. Could they find evidence for the families at all? Another test set was also a combined set. To evaluate the method, they assigned a score to every pair of sequences. They had a much larger set of negative examples (40,000). She fixed the number of FP, and then for that % FP (0.03%), how many TP would be missed? All methods do well with conserved multidomain proteins. They were more challenged by Variable multidomain, where Psi-BLAST doesn’t do as well as BLAST. Both methods are extremely challenged when all the sequences were put into the analysis together.
All methods are parallized methods, so perhaps we need more information. They get their information from the structure of the sequence similarity network. Neighborhood of a sequence is its near neighbors (distance = 1). Then you can compare neighborhoods. The sequence that match more than one domain are near each other in the network. Domain architecture is implicitly present in the network.
From Allyson: Sorry, she was running over and I needed to get to the next talk. Missed the very end.
Please note that this post is merely my notes on the presentation. They are not guaranteed to be correct, and unless explicitly stated are not my opinions. They do not reflect the opinions of my employers. Any errors you can happily assume to be mine and no-one else’s. I’m happy to correct any errors you may spot – just let me know!