These tracks display chromatin immunoprecipitation (ChIP-seq) evidence as
part of the four Open Chromatin track sets (see below).
ChIP-seq is a method to identify the specific location of proteins that are
directly or indirectly bound to genomic DNA. By identifying the binding location
of sequence-specific transcription factors, general transcription
machinery components, and chromatin factors, ChIP can help in the functional
annotation of the open chromatin regions identified by DNaseI HS mapping and
FAIRE.
Together with
DNaseI HS and FAIRE experiments, these tracks display the locations of active
regulatory elements identified as open chromatin in
multiple cell types
from the Duke, UNC-Chapel Hill, UT-Austin, and EBI ENCODE group.
Within this project, open chromatin was identified using two
independent and complementary methods: DNaseI hypersensitivity (HS)
and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE),
combined with these ChIP-seq assays for select
regulatory factors. DNaseI HS and FAIRE provide assay
cross-validation with commonly identified regions delineating the
highest confidence areas of open chromatin. These ChIP assays provide
functional validation and preliminary annotation of a subset of
open chromatin sites. Each method employed Illumina (formerly Solexa)
sequencing by synthesis as the detection platform.
The Tier 1 and Tier 2 cell types were additionally verified by a
second platform, high-resolution 1% ENCODE tiled microarrays supplied by NimbleGen.
As a background control experiment, the input genomic DNA sample that
was used for ChIP was sequenced. Crosslinked chromatin
was sheared and the crosslinks were reversed without carrying out the
immunoprecipitation step. This sample was otherwise processed in a manner
identical to the ChIP sample as described below. The input track is
useful in revealing potential artifacts arising from the sequence
alignment process such as copy number differences between the
reference genome and the sequenced samples, as well as regions of
poor sequence alignability. For cell lines for which there is
no input experiment available, the peaks were generated using the control
of generic_male or generic_female, as an attempt to create a general
background based on input data from several cell types. These files
are in "iff" format, which is used when calling peaks with
F-seq software, and can be downloaded from the
production lab directly
from under the section titled "Copy number / karyotype correction."
Other Open Chromatin track sets:
Data for the DNaseI HS experiments can be found in
Duke DNaseI HS.
Data for the FAIRE experiments can be found in
UNC FAIRE.
A synthesis of all the open chromatin assays for select cell lines can
be found in
Open Chrom Synth.
Display Conventions and Configuration
This track is a multi-view composite track that contains a single data type
with multiple levels of annotation (views). For each view, there are
multiple subtracks representing different cell types that display individually
on the browser. Instructions for configuring multi-view tracks are
here.
ChIP data displayed here represents a continuum of signal intensities.
The Iyer lab recommends setting the "Data view scaling: auto-scale"
option when viewing signal data in full mode to see the full dynamic
range of the data. Note that in regions that do not have open chromatin sites,
autoscale will rescale the data and inflate the background signal, making the
regions appear noisy. Changing back to fixed scale will alleviate this issue.
In general, for each experiment in each of the cell types, the
UTA TFBS tracks contain the following views:
Peaks
Regions of enriched signal in ChIP experiments.
Peaks were called based on signals created using F-Seq, a software program
developed at Duke (Boyle et al., 2008b). Significant regions were
determined by fitting the data to a gamma distribution to calculate p-values.
Contiguous regions where p-values were below a 0.05/0.01 threshold were
considered significant. The solid vertical line in the peak represents the
point with highest signal.
F-Seq Density Signal
Density graph (wiggle) of signal
enrichment calculated using F-Seq for the combined set of sequences from all
replicates. F-Seq employs Parzen kernel density estimation to create base pair
scores (Boyle et al., 2008b). This method does not look at fixed-length
windows but rather weights contributions of nearby sequences in proportion to
their distance from that base. It only considers sequences aligned 4 or less
times in the genome and uses an alignability background model to try to correct
for regions where sequences cannot be aligned. For each cell type, especially
important for those with an abnormal karyotype, a model to try to correct for
amplifications and deletions that is based on control input data was also used.
Base Overlap Signal
An alternative version of the
F-Seq Density Signal track annotation that provides a higher resolution
view of the raw sequence data. This track also includes the combined set of
sequences from all replicates. For each sequence, the aligned read is
extended 5 bp in both directions from its 5' aligned end where DNase cut
the DNA. The score at each base pair represents the number of
extended fragments that overlap the base pair.
Peaks and signals displayed in this track are the results of pooled replicates. The raw
sequence and alignment files for each replicate are available for
download.
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
To perform ChIP, proteins were cross-linked to DNA in vivo
using 1% formaldehyde solution (Bhinge et al., 2007; ENCODE
Project Consortium, 2007). Cross-linked chromatin was sheared by sonication
and immunoprecipitated using a specific antibody against the protein of
interest. After reversal of the cross-links, the immunoprecipitated DNA was
used to identify the genomic location of transcription factor binding. This
was accomplished by sequencing of the ends of the immunoprecipitated DNA
(ChIP-seq) using the Illumina (Solexa) sequencing system. ChIP data for
Tier 1 and Tier 2 cell lines were verified by comparing multiple independent
growths (replicates) and determining the reproducibility of the data. For
some cell types, additional verification was performed using the same
immunoprecipitated DNA by labeling and hybridizing to NimbleGen Human
ENCODE tiling arrays (1% of the genome) along with the input DNA as reference
(ChIP-chip). A more detailed protocol is available
here.
DNA fragments isolated by ChIP are 100-200 bp in length, with the average
length being 134 bp. Sequences from each experiment were aligned to the
genome using Burrows-Wheeler Aligner (Li et al., 2010) for the GRCh37 (hg19) assembly.
Where genome.fa is the whole genome sequence and s_1.sequence.txt.bfq is one lane
of sequences converted into the required bfq format.
Sequences from multiple lanes
are combined for a single replicate using the bwa samse command, and converted
in the sam/bam format using SAMtools.
Only those that aligned to 4 or fewer locations were retained. Other sequences
were also filtered based on their alignment to problematic regions
(such as satellites and rRNA genes - see
supplemental materials).
The mappings of these short reads to the genome are available for
download.
The resulting digital signal was converted to a continuous wiggle track using
F-Seq that employs Parzen kernel density estimation to create base pair scores
(Boyle et al., 2008b). Input data has been generated for several
cell lines. These are used directly to create a control/background model used
for F-Seq when generating signal annotations for these cell lines.
These models are meant to correct for sequencing biases, alignment artifacts,
and copy number changes in these cell lines. Input data is not being generated
directly for other cell lines. Instead, a general background model was derived
from the available input data sets. This should provide corrections for
sequencing biases and alignment artifacts, but will not correct for cell
type specific copy number changes.
Where the (bff files) are the background files based on alignability, the
(iff files) are the background files based on the input experiments,
and alignments.bed are a bed file of filtered sequence alignments.
Discrete ChIP sites (peaks) were identified from ChIP-seq F-seq
density signal. Significant regions were determined by fitting the
data to a gamma distribution to calculate p-values. Contiguous regions
where p-values were below a 0.05/0.01 threshold were considered significant.
Data from the high-resolution 1% ENCODE tiled microarrays supplied by
NimbleGen were normalized using the Tukey biweight normalization, and peaks
were called using ChIPOTle (Buck, et al., 2005) at multiple levels
of significance. Regions matched on size to these peaks that were devoid of
any significant signal were also created as a null model. These data were used
for additional verification of Tier 1 and Tier 2 cell lines by ROC analysis.
Files labeled Validation view containing this data are available for
download.
Release Notes
Release 2 (August 2011) of this track adds 34 new experiments including 17 new cell lines.
Enhancer and Insulator Functional assays: A subset of DNase and FAIRE
regions were cloned into functional tissue culture reporter assays to test for
enhancer and insulator activity. Coordinates and results from these
experiments can be found
here.
Credits
These data and annotations were created by a collaboration of multiple
institutions (contact:
Terry Furey)
Data users may freely use ENCODE data, but may not, without prior consent,
submit publications that use an unpublished ENCODE dataset until nine months
following the release of the dataset. This date is listed in the Restricted
Until column on the track configuration page and the download page. The
full data release policy for ENCODE is available here.