Shirley Liu, Dana-Farber Cancer Institute, Harvard School of Public Health
Morning Session 1 September (11th MGED Meeting, 1-4 September, 2008)
She describes a peak-finding algorithm called MACS (Model-based Analysis of ChIPSeq). If you look at the tags, they usually lie to the left or the right of the actual binding site. In order to know where that site is, you have to shift the position from where the tag is. Most people don't know how much to shift it by. They use the peaks with the most confidence to calculate the shift size. The mode of this shift size was smaller than expected. This could be that among the whole population you give to the sequencer, it prefers the shorter fragments. Alternatively it could be that there is a binding site right in the middle, and perhaps are hypersensitive regions either side of the binding site where the breaks occur instead. So you shift according to the tag size (about have of the size). The tag distribution along the genome should show a Poisson distribution. However, ChIPSeq shows local biases in the genome, thought to be both chromatin and sequencing bias.
In a 300bp region, in a control there will only be 1-3 tags, which means that are simply too few tags to be good enough. So, rather than just looking at the bases at the binding site, they also look at 1kb, 5 kb, and 10kb regions around it (local lambda). If a global measure is used instead of the local one, the results don't come out very well. With MACS, you get a higher motif occurrence in peak centers, and a improved spatial resolution. if FDR (False Discovery Rate) = control peaks / ChIP peaks, then with the MACs method the FDR is 0.4%, while No control and using only the background lambda, it's 41% (much worse).
You shouldn't use a random sampling of tags to give you your FDR, as the distribution is not random. Further, there can be a problem with unbalanced tags. If you have two channels, whichever channel has more tags will give you more peaks, even after normalization.
They worked with some data for nucleosome positioning in humans. They extended the original tags for each nucleosome and check the tag count across the genome. They also de-noised the nucleosome data using the Coiflet Wavelet De-noising. This is done by decomposing the original signal in steps, and then remove wavelets with high frequency and small coefficients (the noisy ones). They also removed peaks with unbalanced tags.
ChIPSeq may be ineffective at mapping inactive histone marks. The percentage of tags located at identified nucleosomes are mich higher for active histone marks. The percentage of isolated tags are much higher for inactive histones. Active marks tend to bind to sharper regions (more localized) than the inactive ones. Differentiation impairs ChIPSeq efficiency of inactive marks but not active marks. Also, close chromatin are harder to sonicate, so resulting fragments are larger. ChIPSeq library construction biases shorter fragments.
Is there a nucleosome sequence preference in humans? There isn't as much as expected. Need to compare in vivo with in vitro nucleosome sequencing. 10bp periodicity is observed in vitro. However, there isn't this periodicity in vivo. So, people are doing better at predicting the in vitro nucleosome rather than the in vivo nucleosome. So, they extended the tag by 146 bp for the nucleosome profile, and take the middle 73 bp, and then get the correlation coefficient. Only 10% of the in vitro and in vivo overlap. Only 50% of in vitro and in vivo data agrees with each other. There are definitely intrinsic sequence features for nucleosomes, but they don't predict in vivo nucleosome patterns very well.
These are just my notes and are not guaranteed to be correct.
Please feel free to let me know about any errors, which are all my
fault and not the fault of the speaker. 🙂