Description
This track shows AceView gene models constructed from
cDNA and genomic evidence by Danielle and Jean Thierry-Mieg
using the Acembly program.
AceView is the only database that defines the genes
genome-wide by using only, but exhaustively, the public experimental
cDNA sequences from the same species. The analysis relies on the
quality of the genome sequence and exploits sophisticated cDNA-to-genome
co-alignment algorithms to provide a comprehensive and
non-redundant representation of the GenBank, dbEST, GSS, Trace and
RefSeq cDNA sequences. In a way, the AceView transcripts represent a
fully annotated non-redundant ‘nr’ view of the public
RNAs, minus cloning artefacts, contaminations and bad quality
sequences. AceView transcripts represent a 10 times compaction
relative to the raw data, with minimal loss of sequence
information.
87% of the public RNA sequences are coalesced into AceView alternative
transcripts and genes, thereby identifying close to twice as many main
genes as there are "known genes" in both human and
mouse. 18% to 25% of the spliced genes appear non-coding, in mouse and
human respectively. Alternative transcripts are prominent in both
species. The typical human gene produces on average eight distinct
alternatively spliced forms from three promoters and with three
non-overlapping terminal exons. It has on average three cassette exons
and four internal donor or acceptor sites. The AceView site further
proposes a thorough biological annotation of the reconstructed genes,
including association to diseases and tissue specificity of the
alternative transcripts.
AceView combines respect for the experimental data with extensive
quality control. Evaluated in the ENCODE regions, AceView transcripts
are close to indistinguishable from the manually curated Gencode
reference genes (see Thierry-Mieg, 2006, or compare the two tracks in the
Genome Browser), but over the entire genome the number of transcripts exceeds
Havana/Vega by a factor of three and RefSeq by a factor of six.
Display Conventions and Configuration
This track follows the display conventions for
gene
tracks. Gene models that fall into the "main" class
are displayed in purple; "putative"
genes are displayed in pink.
The main genes include at least one transcript which is spliced or
putatively protein coding. Spliced genes contain at least one
well-defined standard intron, i.e., an intron with a GT-AG or GC-AG
boundary, supported by at least one clone matching exactly, with no
ambiguous bases, 8 bases of the genome on each side of the intron.
The putative genes have no standard intron and do not encode good
proteins, yet are supported by more than six cDNA clones.
The track description page offers the following filter and configuration
options:
- Gene Class filter: Select the main or putative
option to filter the display.
- Color track by codons: Select the genomic codons option
to color and label each codon in a zoomed-in display to facilitate validation
and comparison to gene predictions. Click the
Codon coloring help link
on the track description page for more information about this feature.
Methods
The millions of cDNA sequences available from the public databases
(GenBank, dbEST, GSS, Traces, etc.) are aligned cooperatively on the
genome sequence, taking care to keep the paired 5' and 3' reads from
single clones associated in the same transcript.
Useful information about tissue, stage, publications, isolation
procedure and so on is gathered.
AceView alignments on the genome use knowledge on sequencing errors
gained from analyzing sequencing traces and cooperative
refinements. They are usually obtained over the entire length of the
EST or mRNA, (average 98.8% aligned, 0.2% mismatches in mRNAs or 95.5%
aligned, 1.4% mismatches in ESTs).
Multiple alignments are evaluated and the sequences are stringently
kept only in their best position genome-wide. Less than 1% of the
mRNAs and less than 2% of the ESTs will ultimately be aligned in more
than one gene, usually in the ~1% closely repeated genes.
The cDNA sequences are then processed and cleaned: the vectors and
polyA are clipped, the reads submitted on the wrong strand are
flipped, and the small insertion or deletion polymorphisms are
identified.
Eventual cDNA clone rearrangements or anomalous alignments are
flagged and filtered (akin to manually) so as not to lose unique
valuable information while avoiding pollution of the database with
poorly supported anomalous data.
Unfortunately, cDNA libraries are still far from saturation, so
after 20% of the suspicious entries have been removed, a single good-quality
cDNA sequence, aligned with standard introns on the genome, is
considered sufficient evidence for a given mRNA structure. That is
because cDNA sequences are difficult to obtain, but they remain the
cleanest and most reliable information to best define the molecular
genes. Unspliced non-coding genes are however reported (in the
putative class) only if they are supported by six or more accessions. Others
belong to what is termed ‘the cloud’ (not displayed on
the UCSC Genome Browser).
The cDNA sequences are clustered into a minimal number of
alternative transcript variants, preferring partial transcripts to
artificially extended ones. Sequences are concatenated by simple
contact, but the combinatorics are voided by allowing each cDNA
accession to contribute to a single alternative variant, preferably
one where it merges silently without bringing any new sequence
information. As a result, for instance, all shorter reads compatible
with a full-length mRNA will be absorbed in that transcript and will not
be available to allow for extensions on other incompatible
transcripts.
About 70% of the variants, clearly identified on the Acembly site, have
their entire coding region supported by a single cDNA; the others may
be illicit concatenations that could be split when more data become
available.
For each transcript, the consensus sequence of the cDNAs most
compatible to the genome sequence is generated. Single base insertion,
deletion, transition or transversion is shown graphically in the mRNA
view, where frequent SNPs become evident.
The main sequence of the transcript used in the annotation is that
of the footprint of the transcript on the genome, which is of better
quality than the mRNAs: this procedure corrects up to 2% sequencing
errors.
Putative protein-coding regions are predicted from the mRNA
sequence and annotated using BlastP, PFAM, Psort2, and comparison to
AceView proteins from other species. Best proteins are scored (see the
FAQ on the Acembly site) and transcripts are putatively proposed to be
protein-coding or non-coding.
Expression, cDNA support, tissue specificity, sequences of
alternative transcripts, introns and exons, alternative promoters,
alternative exons and alternative polyadenylation sites are evaluated
and annotated on the Acembly web site.
The reconstructed alternative transcripts are then grouped into
genes if they share at least one exact intron boundary or if they have
substantial sequence overlap.
Coding and non-coding genes are defined, and genes in antisense are
flagged.
AceView genes are matched molecularly to Entrez genes and named
according to the official nomenclature or the Entrez Gene
nomenclature. For novel genes not in Entrez, AceView creates new gene
names that are maintained from release to release until the genes receive
an official or Entrez gene name.
Each gene is annotated in depth, with the intention of AceView serving
as a one-stop knowledgebase for systems biology. Selected functional
annotations are gathered from various sources, including expression
data, protein interactions and GO annotations. In particular, possible
disease associations are extracted directly from PubMed, in addition
to OMIM and GAD, and the users can help refine those annotations.
Finally, lists of the most closely related genes by function,
pathway, protein complex, GO annotation, disease, cellular
localization or all criteria taken together are proposed, to
stimulate research and development.
Click the "AceView Gene Summary" on an individual transcript's
details page to access the gene on the NCBI AceView website.
Credits
Thanks to
Danielle and Jean Thierry-Mieg at NCBI for providing this track
for human, worm and mouse.
References
Thierry-Mieg D, Thierry-Mieg J.
AceView: a comprehensive cDNA-supported gene and transcripts annotation.
Genome Biol. 2006;7 Suppl 1:S12.1-14.
PMID: 16925834; PMC: PMC1810549
AceView web site:
https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly
|