Description
This track shows AceView gene models constructed from
cDNA by Danielle and Jean Thierry-Mieg at NCBI, using their AceView program.
AceView is unique in that it defines the genes genome-wide by using only,
but exhaustively, the experimental cDNA sequences from the species itself.
The analysis exploits sophisticated cDNA-to-genome co-alignment algorithms
and the quality of the genome sequence to provide a comprehensive and
non-redundant representation of the GenBank, dbEST, GSS, Trace and RefSeq
cDNA sequences. The next release, later in 2011, will also include the data
deposited in SRA (or assimilated public repository) as part of the SEQC
collaborative project led by Leming Shi from FDA and involving
high throughput RNA sequences provided by Helicos, Illumina,
LifeTech SOLiD and Roche 454, which greatly refine and enrich
the gene models.
In a way, the AceView transcripts represent a fully annotated non-redundant
"nr" view of the public RNAs, minus cloning artefacts, contaminations and
bad quality sequences. AceView transcripts currently represent a 10 times
compaction relative to the raw data, with minimal loss of sequence
information.
87% of the public RNA sequences are coalesced into AceView alternative
transcripts and genes, thereby identifying close to twice as many main
genes as there are "known genes" in both human and mouse. 18% to 25% of
the spliced genes appear non-coding, in mouse and human respectively.
Alternative transcripts are prominent in both species. The typical human gene
produces on average eight distinct alternatively spliced forms from three
promoters and with three non-overlapping terminal exons.
It has on average three cassette exons and four internal donor or acceptor
sites. The AceView site further proposes a thorough biological annotation
of the reconstructed genes, including association to diseases and tissue
specificity of the alternative transcripts.
AceView combines respect for the experimental data with extensive quality
control. Evaluated in the ENCODE regions, AceView transcripts are close to
indistinguishable from the manually curated Gencode reference genes
(see Thierry-Mieg, 2006, or compare the two tracks in the Genome Browser),
but over the entire genome the number of transcripts exceeds Havana/Vega
by a factor of three and RefSeq by a factor of six.
For more information on the different gene tracks, see our Genes FAQ.
Display Conventions and Configuration
This track follows the display conventions for
gene
tracks.
All gene models displayed at UCSC are in the "cDNA-supported" class
and are displayed in pink.
The track description page offers the following filter and configuration
options:
- Color track by codons: Select the genomic codons option
to color and label each codon in a zoomed-in display to facilitate validation
and comparison to gene predictions. Click the
Codon coloring help link
on the track description page for more information about this feature.
Click the "AceView Gene Summary" on an individual transcript's
details page to access the gene on the NCBI AceView website.
Methods
The millions of cDNA sequences available from the public databases
(GenBank, dbEST, GSS, Traces, etc.) are aligned cooperatively on the
genome sequence, taking care to keep the paired 5' and 3' reads from
single clones associated in the same transcript.
Useful information about tissue, stage, publications, isolation procedure
and so on is gathered. AceView alignments on the genome use knowledge
on sequencing errors gained from analyzing sequencing traces and cooperative
refinements. They are usually obtained over the entire length of the EST
or mRNA (average 98.8% aligned, 0.2% mismatches in mRNAs or 95.5% aligned,
1.4% mismatches in ESTs).
Multiple alignments are evaluated and the sequences are stringently
kept only in their best position genome-wide.
Less than 1% of the mRNAs and less than 2% of the ESTs will ultimately be
aligned in more than one gene, usually in the ~1% closely repeated genes.
The cDNA sequences are then processed and cleaned: the vectors and polyA are
clipped, the reads presumably submitted on the wrong strand are flipped,
and the small insertion or deletion polymorphisms are identified.
Eventual cDNA clone rearrangements or anomalous alignments are flagged
and filtered (akin to manually) so as not to lose unique valuable
information while avoiding pollution of the database with poorly supported
anomalous data.
Unfortunately, cDNA libraries are still far from saturation,
because up until high throughput sequencing, cDNA sequences were
difficult to obtain. Yet they are the cleanest and most reliable
information to define the molecular genes. For this reason, a single
good-quality cDNA sequence, aligned with standard introns on the genome,
is considered sufficient evidence for a given spliced mRNA fragment.
In contrast, un-spliced alignments could reflect genomic contamination
of cDNA libraries, and non-coding single exon genes are reported
only if they are supported by six or more accessions.
The numerous single exon TARs supported by 5 or fewer cDNAs belong to what
is termed ‘the cloud’ (not displayed on the UCSC Genome Browser,
but annotated in AceView and downloadable separately from the
ftp site).
The cDNA sequences are clustered into a minimal number of alternative
transcript variants, preferring partial transcripts to artificially
completed ones. Sequences are concatenated by simple contact,
but the combinatorics are avoided by allowing each cDNA accession
to contribute to a single alternative variant, preferably one
where it merges silently without bringing any new sequence information.
As a result, all shorter reads compatible with a full-length mRNA will be
absorbed in that transcript and will not be used to extend other incompatible
transcripts.
About 70% of the variants, clearly identified on the Acembly site,
have their entire protein coding region supported by a single cDNA;
the others may be illicit concatenations that may be split and associated
differently when more data become available. The main sequence of the
transcript used in the annotation is that of the footprint of the transcript
on the genome, which is of better quality than the mRNAs: this procedure
corrects up to 2% of sequencing errors. Single base insertion, deletion,
transition or transversion is shown graphically in the mRNA view,
where frequent SNPs become evident.
Putative protein-coding regions are predicted from the mRNA sequence and
annotated using BlastP, PFAM, Psort2, and comparison to AceView proteins
from other species. The best proteins are scored (see the
Aceview Overview on the Acembly site) and transcripts
are putatively proposed to be protein-coding or non-coding.
Expression, cDNA support, tissue specificity, sequences of alternative
transcripts, introns and exons, alternative promoters, alternative
exons and alternative polyadenylation sites are evaluated and annotated
in rich tables on the Acembly web site.
The reconstructed alternative transcripts are then grouped into genes
if they share at least one exact intron boundary or if they have
substantial sequence overlap (80% of the sequence of one included
in the other). Coding and non-coding genes are defined, and genes
in antisense are flagged.
AceView genes are matched by molecular contact to Entrez genes and
named according to the Entrez Gene nomenclature. For novel genes not
in Entrez, AceView creates new gene names that are maintained from
release to release until the genes receive an official or Entrez gene
name.
Knowledge on each gene is annotated provided there is PubMed support.
Selected functional annotations are gathered from other sources,
including Entrez. In addition, candidate tested disease associations
are extracted directly from PubMed, in addition to OMIM and GAD.
Finally, lists of the most closely related genes by function, pathway,
protein complex, GO annotation, disease, cellular localization or all
criteria taken together are proposed, to stimulate research and
development.
Credits
Thanks to
Danielle and Jean Thierry-Mieg at NCBI for providing this track
for human, worm and mouse.
References
Thierry-Mieg D, Thierry-Mieg J.
AceView: a comprehensive cDNA-supported gene and transcripts annotation.
Genome Biol. 2006;7 Suppl 1:S12.1-14.
PMID: 16925834; PMC: PMC1810549
AceView web site:
https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly
|