Description
This track depicts high throughput sequencing of long RNAs (>200 nt)
from
RNA samples from
tissues or subcellular compartments from
ENCODE cell
lines.
The overall goal of the ENCODE project is to identify and characterize all
functional elements in the sequence of the human genome.
Display Conventions and Configuration
This track is a multi-view composite track that contains the following
views:
- Alignments
- The Alignments view shows reads mapped to the genome. Sequences determined
to be transcribed on the positive strand are shown in blue.
Sequences determined to be transcribed on the negative strand are shown in
orange. Sequences for which the direction of
transcription was not able to be determined are shown in black.
- Raw Signals
- The Raw Signal views show the density of aligned tags on the plus, minus, and on both strands.
Methods
Cells were grown according to the approved
ENCODE cell culture protocols.
Sample preparation and sequencing
K562 and GM12878 total cell, total RNA
Standard Illumina Pair-end kit with the sole exception that a "tagged" random
hexamer was used to prime the 1st strand synthesis: 5′ACTGTAGGN6-3′.
The addition of this tag is what permits us to make strand assignments for the reads.
The sequence of the tag is reported in the 5′ end of the read. Asymmetric PCR
can place the tag on either the 1st or 2nd read depending on which strand it used as
a template. Strand assignments are made by looking for the tag at the 5′ end of
either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag
is present on one end strand assignments are made for both ends. We noted during
analysis that the tags are generally 5′ truncated. We only "strand" reads that
contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded
in these libraries. It is possible to cull additional stranded reads that contain
non-templated TAGG, AGG, GG, or G sequences at their 5′ end. The peak in
insert size distribution is between 200-250 nucleotides.
K562 cytosol, polyA+ RNA
Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the
manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline
pyrophosphatase to eliminate any 5′ cap structures and hydrolyzed to ~200
bases via alkaline hydrolysis. The 3′ end was repaired using calf
intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the
addition of Cs to the 3′ end. The 5′ end was phosphorylated using T4
PNK, and an RNA linker was ligated onto the 5′ end. Reverse transcription
was carried out using a poly-G oligo with a defined 5′ extension. The
inserts were then amplified using oligos targeting the 5′ linker and
poly-G extension. This cloning protocol generated stranded reads that were read
from the 5′ ends of the inserts. The library was sequenced on a Solexa
platform for a total of 36 cycles; however, the reads underwent post-processing,
resulting in trimming of their 3′ ends. Consequently, the mapped read
lengths are variable.
Analysis
K562 and GM12878 total cell, total RNA
Tags were removed from the 5′ ends of the reads in accordance to their
lengths and strand assignments made. Subsequently, the reads were trimmed
from their 3′ ends to a final length of 50 nucleotides and were mapped
using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to
2 mismatches across the entire length and only report reads that mapped to a
single/unique locus in the assembled hg18 genome.
K562 cytosol, polyA+ RNA
Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only
uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in
the processed files, as follows:
- Collect the read sequences from Illumina non-filtered output files.
- Filter out all reads that contain undefined nucleotides ('N')
- Perform iterative alignment/C-tail chopping algorithm (below). On each
alignment step, the reads are aligned to the genome with 100% identity.
All reads that align to a single locus are withdrawn from the alignment pool and
only the reads that could not be aligned continue to the next step.
- Align to the hg18 genome using Nexalign 1.3.3 (© Timo Lassmann) without
chopping off any nucleotides
- Chop off any C-blocks (until the first non-C) at the ends of the reads
- Align to the genome -> remove and save those that align
- Chop off any non-Cs until the next C
- Chop off C-block until the next non-C
- Align to the genome -> remove and save those that align
- Repeat steps d, e, and f until the reads align to the genome, or chopping
results in the reduction of the reads' lengths to below 16 (default), or
there are no non-Cs left.
Verification
Verification was done by comparison of referential data generated from 8
individual sequencing lanes (Illumina technology).
Release Notes
This is Release 2 (Nov 2009) of this track. It includes data from additional
experiments, and changes in formatting for the existing data described below.
The K562 cytosol alignments are exactly the same data as Release 1, but the
alignments are now formatted in the bed14 format described below.
These data have the string submittedDataVersion="V2 - file format change" in
their metadata and the table names are appended with the string "V2".
The data format for the alignments in this track are provided in bigBed format.
Each record is in bed 14 format with the first 12 fields described
here.
The final two fields are the two paired sequences, or in the case of single
alignments, the 13th field is the sequence and the 14th field is a single N.
Credits
K562 cytosol, polyA+ RNA
These data were generated and analyzed by the transcriptome group at
Cold Spring Harbor Laboratories, and the Center for Genomic
Regulation (Barcelona), who are participants in the ENCODE Transcriptome Group.
K562 and GM12878 total cell, total RNA
Credits: Carrie A. Davis, Jorg Drenkow, Huaien Wang, Alex Dobin and Tom Gingeras
Contacts:
Carrie Davis
and Tom Gingeras (CSHL).
Data Release Policy
Data users may freely use ENCODE data, but may not, without prior
consent, submit publications that use an unpublished ENCODE dataset until
nine months following the release of the dataset. This date is listed in
the Restricted Until column, above. The full data release policy
for ENCODE is available
here.
|
Top⇑ |