Description
Inside the nucleus, DNA is wrapped into a complex molecular structure
called chromatin, whose fundamental unit is approximately 150 bp of DNA
organized around the eight-histone protein complex known as the
nucleosome. These tracks contains predicted nucleosome occupancy scores
produced by three different computational models. Each model is a
support vector machine classifier trained using microarray data from
an MNase cleavage assay. Each SVM is trained to discriminate between
50 bp DNA sequences that show the strongest and weakest signals in the
MNase assay. Although each model can predict regions of high and low
nucleosome occupancy, one model (MEC) excels at recognizing regions of
low nucleosome occupancy, whereas the other two (A375 and Dennis) are
better at recognizing regions of high nucleosome occupancy.
The three models are as follows:
-
A375 - This model was trained using data from the A375 cell
line from Ozsolak et al. (2007). This cell line was prepared with
weak MNase digestion. The A375 model excels at recognizing regions of
strong protection from MNase cleavage; i.e., positions that are
frequently occupied by a nucleosome.
-
Dennis - This model was trained using data from MDA-kb2 cell
line data from Dennis et al. (2007). This cell line was
prepared with weak MNase digestion. Hence, like the A375 model, the
Dennis model excels at recognizing regions that are frequently
occupied by a nucleosome.
-
MEC - This model was trained using data from the MEC cell line
from Ozsolak et al. (2007). This cell line was prepared with strong
MNase digestion. This model excels at recognizing regions of high
accessibility to MNase cleavage; i.e., positions that are frequently
nucleosome-free.
Display Conventions and Configuration
The output of the SVM is a unitless discriminant score. In the
browser, the score of a 50-mer is assigned to its 26th base.
Canonically, a score of 0 indicates an uncertain assignment; a score
of 1.0 corresponds to a confident prediction for being in the positive
class (i.e., a position of frequent nucleosome occupancy), and a score
of -1.0 corresponds to a confident prediction for being in the
negative class.
Methods
For a given microarray experiment, we identify the 1000 50 bp probes
with the highest log intensity ratios. These comprise our positive
training samples. In a similar fashion, we generate negative training
samples with the lowest log intensity ratios. Each 50-mer in the
training set is converted into a 2772-element vector of k-mer
frequencies for k=1 up to 6 (collapsing reverse complements).
A linear SVM is then trained to discriminate between the two classes.
The SVM regularization parameter is selected by evaluating the entire
regularization path on a held-out portion of the training data set.
After training, each 50-mer in the human genome is converted to the
2772-element representation and scored using the trained SVM.
Detailed methods are given in Gupta et al. (2008), and
supplementary data is available
here.
Credits
This track was produced at the University of Washington by Shobhit
Gupta and William Stafford Noble ([email protected]).
References
Ozsolak F, Song JS, Liu XS, Fisher DE.
High-throughput mapping of the chromatin structure of human
promoters. Nat Biotechnol. 2007 Feb;25(2):244-8.
Dennis JH, Fan HY, Reynolds SM, Yuan G, Meldrim JC, Richter DJ,
Peterson DG, Rando OJ, Noble WS, Kingston RE.
Independent and complementary methods for large-scale structural
analysis of mammalian chromatin. Genome Res.
2007 Jun;17(6):928-39.
Gupta S, Dennis J, Thurman RE, Kingston R, Stamatoyannopoulos JA, Noble WS.
Predicting human nucleosome occupancy from primary sequence.
PLoS Comput Biol. 2008 Aug 22;4(8):e1000134.
|
|