Description
This track shows the locations of transcriptionally active regions
(TARs)/transcribed fragments (transfrags) hybridized to an oligonucleotide
microarray with a design based on human assembly hg13 (NCBI Build 31)
(Bertone et al., 2004).
Methods
Microarrays were designed using sequence from the human hg13 assembly. The
genome sequence was screened for repetitive elements and low-complexity DNA
using RepeatMasker in the sensitive mode. Additional low-complexity filtering
was performed using the NSEG (segment sequence(s) by local complexity)
program using a minimum segment length of 21 nucleotides to determine
low complexity segments of lowest probability. After filtering, 1.5 Gb of
nonrepetitive DNA remained and microarray probes were chosen using the NASA
Oligonucleotide Probe Selection Algorithm (NOPSA).
NOPSA is designed to find the optimal probes for hybridization. A
database of the frequency of every 18-mer in the genome is created using a
hash algorithm. Chaining was used to resolve collisions. Average frequencies
of 36-mers in the genome were determined from the frequencies of each 18-mer
subsequence in the 36-mer and its reverse complement. 36-mer oligonucleotides
with a frequency equal to one are selected as potential probes for the
microarray (from supporting online material for Stolc et al., 2004)
This resulted in probe selection based on several criteria:
- Every 36-mer in the genome is unique.
- Sequences that could form a loop with a stem of > 7 bp were excluded.
- Factors such as sequence length, extent of complementarity and base
composition were also considered.
A total of 51,874,388 36-mer
oligonucleotide probes were selected from both the sense and antisense strands
at an average resolution of 46 bp to cover the non-repetitive sequence from
the whole genome. Probes were spaced every 10 nucleotides on average. The
probes were synthesized via maskless photolithography at a feature density of
approximately 390,000 probes per slide.
Biological samples that were hybridized to the arrays consisted of
triple-selected human liver poly(A)+ RNA pooled from several individuals
(supplied by Ambion). One biological replicate was carried out.
See this NCBI
GEO accession for details of experimental protocols.
The TARs identified for hg13 (NCBI Build 31) were mapped to this
assembly using Blat. The program pslCDnaFilter was used to filter
alignments using the parameters
-minId=0.96, -minCover=0.25,
-localNearBest=0.001,-minQSize=20,
-minNonRepSize=16, -ignoreNs, -bestOverlap.
Display Conventions
TARs are represented by blocks in the graphical display. The numeric part of
the ID displayed when the track has pack or full visibility is the ID used
by the Yale Database for Active Regions with Tools
(DART). A link to
this database is provided on the details page for each TAR.
Data Analysis
Two groups of TARs were identified: Normal and Poly(A)-associated.
Normal TARs:
Clusters of transcription units were identified that consisted of at least
five consectutive probes with fluorescence intensities in the top
90th intensity percentile and with genomic coordinates within a 250-nt
window. After collecting these regions genome-wide, their locations were
compared to those of annotated components of genes. As a result, a
total of 13,889 transcription units, ranging in size from 209 to 3,438
nucleotides,
were identified. Under the null hypothesis of zero transcription, only 400
were expected to be found. Of those regions identified, one-third (4,931)
correspond to previously annotated exons while the other 8,958 are new
transcribed sequences that are referred to as TARs.
Poly(A)-associated TARs:
Another set of criteria was used to find TARs in which the probe hybridization
intensities were correlated with the presence of a polyadenylation signal 3'
to the TAR. Transcription units are five consecutive probes with fluoroscence
intensities in the top 80th intensity percentile and in a window of 250
nucleotides.
The 3' region also must contain or be close to a polyadenylation signal.
Transcription units with an associated polyadenylation signal of
"AATAAA" were assigned to a type I group, while those with
"ATTAAA" were type II. Only 100 of
these should occur at random in the genome under the null hypothesis of zero
transcription. The majority (1,991) were found to be within annotated exons,
and 952 were located more than 10 kb from an annotated gene. A total of 1,371
type I and 674 type II poly(A) sequences were identified within exons of
known genes. 1,289 (94%) of type I and 607 (90%) of type II instances were
found to be in the 3' exon of the gene.
Verification
The TARs were validated using RT-PCR on human liver poly(A)+ RNA. Forty-eight
poly(A)-associated and 48 non-poly(A)-associated TARs were investigated.
In 94% (90/96) of cases, the PCR products were found to be of the expected
size in a single-pass assay.
Credits
These data were generated and analyzed by a collaboration between the labs of
Michael Snyder,
Mark Gerstein,
and Sherman Weissman at Yale University and with
NASA Ames Research Center (Moffett Field, California) and Eloret Corporation
(Sunnyvale, California).
References
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S et al.
Global identification of human transcribed sequences with
genome tiling arrays.
Science. 2004 Dec 24;306(5705):2242-6.
Stolc V, Gauhar Z, Mason C, Halasz G, van Batenburg MF, Rifkin SA, Hua S,
Herreman T, Tongprasit W, Barbano PE et al.
A gene expression map for the euchromatic genome of Drosophilamelanogaster.
Science. 2004 Oct 22;306(5696):655-60.
|