Description
These tracks display the level of sequence uniqueness of the reference NCBI36/hg18
genome assembly. They were generated using different window sizes, and high signal
will be found in areas where the sequence is unique.
Methods
The Broad alignability track displays whether a region is made up of
mostly unique or mostly non-unique sequence. To generate the track, every
36-mer in the genome was marked as "unique" if the most similar 36-mer
elsewhere in the genome have at most 2 mismatches, and as "non-unique"
otherwise. Position X in the alignable track is marked by 1 if >50% of the
bases in [X-200,X+200] are "unique" and by 0 otherwise. Every point in the
alignable track has a corresponding position in each of the ChIP signal
tracks. The Broad alignability track was generated for the ENCODE
project as a tool for development of the
Broad Histone tracks.
The Duke uniqueness tracks display how unique is each sequence on the
positive strand starting at a particular base and of a particular length.
Thus, the 20 bp track reflects the uniqueness of all 20 base sequences with
the score being assigned to the first base of the sequence. Scores are
normalized to between 0 and 1 with 1 representing a completely unique
sequence and 0 representing the sequence occurs >4 times in the genome
(excluding chrN_random and alternative haplotypes). A score of 0.5
indicates the sequence occurs exactly twice, likewise 0.33 for three times
and 0.25 for four times. The Duke uniqueness tracks were generated
for the ENCODE project as tools in the development of the
Open Chromatin tracks.
The Duke excluded regions track displays genomic regions for
which mapped sequence tags were filtered out before signal generation
and peak calling for Duke/UNC/UTA's
Open Chromatin tracks.
This track contains problematic regions for short sequence tag signal
detection (such as satellites and rRNA genes). The
Duke excluded regions track was generated for the ENCODE project.
The Rosetta uniqueness track uses sequence 'tiles' of 35 bp.
Each tile was aligned to the genome using the BWA aligner. Tiles that align
uniquely and perfectly in hg18 receive a p-value of 1e-37, while those that
align perfectly in multiple locations receive a p-value of 0. For each tile,
the oligo midpoint coordinate was recorded along with the -log_10 p-value:
37 (unambiguous) to 0 (ambiguous). The Rosetta uniqueness track was
generated independently of the ENCODE project.
The UMass uniqueness track displays a uniqueness signal for each
base which represents the sum of both plus and minus strand 15-mer occurrences
of that particular 5'->3' (plus strand) sequence throughout the genome. Scores
are normalized between 0 and 1 by calculating ( 1 / N ) where N is the number
of genome wide occurrences of the 15-mer starting at position X. A score of 1
represents a single genome wide occurrence of that 15-mer. A 0.5 would
represent either 2 plus strand occurrences or 1 plus and 1 minus strand
occurrence, and so on. Ratios are rounded to 3 significant digits. Therefore
a 0.000 would represent > 2000 occurrences. A 0 is reserved for a given 15-mer
that is either not assembled or contains at least one N at position X. The
UMass uniqueness track was generated for the ENCODE project.
The CRG Alignability tracks display how uniquely k-mer sequences align
to a region of the genome. To generate the data, the GEM-mappability
program has
been employed. The method is equivalent to mapping sliding windows of k-mers
(where k has been set to 36, 40, 50, 75 or 100 nts to produce these tracks)
back to the genome using the GEM mapper aligner (up to 2 mismatches were
allowed in this case). For each window, a mapability score was computed
(S = 1/(number of matches found in the genome): S=1 means one match in the
genome, S=0.5 is two matches in the genome, and so on). The
CRG Alignability tracks were
generated independently of the ENCODE project, in the framework of the GEM
(GEnome Multitool) project.
Credits
The Broad alignability track was created by the Broad Institute.
Data generation and analysis was supported by funds from the NHGRI (the ENCODE
project), the
Burroughs Wellcome Fund, Massachusetts General Hospital and the Broad Institute.
The Duke uniqueness and Duke excluded regions tracks were created
by Terry Furey
and Debbie Winter at Duke Univerisity's
Institute for Genome Sciences & Policy (IGSP);
and Stefan Graf at the
European Bioinformatics Insitute (EBI).
We thank NHGRI for ENCODE funding support.
The Rosetta uniqueness track was created by John Castle,
at
Rosetta Inpharmatics (Merck),
with assistance from Melissa Cline at UCSC.
The UMass uniqueness track was created by Bryan Lajoie
in Job Dekker's Lab
at the University of Massachusetts Medical School. Funding Support: NIH grant
HG003143 to JD. Keck Distinguished Young Scholar Award to JD. This track was
generated as part of the ENCODE project funded by the NHGRI.
The CRG Alignability track was created by Thomas Derrien and Paolo
Ribeca
in Roderic Guigo's lab at the Centre for Genomic
Regulation (CRG), Barcelona, Spain. Thomas Derrien was supported by funds from NHGRI for
the ENCODE project, while Paolo Ribeca was funded by a Consolider grant CDS2007-00050
from the Spanish Ministerio de Educación y Ciencia.
References
Derrien T, Estelle J, Marco Sola S, Knowles DG, Raineri E, Guigo R, Ribeca P.
Fast computation and applications of genome mappability.
PLoS One. 2012;7(1):e30377.
Data Release Policy
Data users may freely use all data in this track. ENCODE labs that
contributed annotations have exempted the data displayed here from the
ENCODE data release policy restrictions.
|