Description
The Gencode Gene track shows high-quality manual annotations in the
ENCODE regions generated by the
GENCODE project.
A companion track, Gencode Introns, shows experimental gene structure
validations for these annotations.
The gene annotations are colored based on the Havana annotation type.
Known and validated transcripts
are colored dark green,
putative and unconfirmed are light green,
pseudogenes are blue,
and artifacts are grey.
The transcript types are defined in more detail in the accompanying table.
The Gencode project recommends that the annotations
with known and validated transcripts; i.e., the types Known,
Novel_CDS, Novel_transcript_gencode_conf, and
Putative_gencode_conf (which are colored dark green in the track display)
be used as the reference annotation.
Type |
Color |
Description |
Known |
dark green |
Known protein coding genes (referenced in Entrez Gene, NCBI) |
Novel_CDS |
dark green |
Novel protein coding genes annotated by Havana (not referenced in Entrez Gene, NCBI) |
Novel_transcript_gencode_conf |
dark green |
Novel transcripts annotated by Havana (no ORF assigned) with at least
one junction validated by RT-PCR |
Putative_gencode_conf |
dark green |
Putative transcripts (similar to "novel transcripts", EST supported,
short, no viable ORF) with at least one junction validated by RT-PCR |
Novel_transcript |
light green |
Novel transcripts annotated by Havana (no ORF assigned) not validated
by RT-PCR |
Putative |
light green |
Putative transcripts (similar to "novel transcripts", EST supported,
short, no viable ORF) not validated by RT-PCR |
TEC |
light green |
Single exon objects (supported by multiple ESTs with polyA
sites and signals) undergoing experimental validation/extension. |
|
Processed_pseudogene |
blue |
Pseudogenes arising via retrotransposition (exon structure of parent gene lost) |
Unprocessed_pseudogene |
blue |
Pseudogenes arising via gene duplication (exon structure of parent gene retained) |
Artifact |
grey |
Transcript evidence and/or its translation equivocal |
Methods
The Human and Vertebrate Analysis and Annotation manual curation process
(HAVANA) was
used to produce these annotations.
Finished genomic sequence was analyzed on a clone-by-clone basis using a
combination of similarity searches against DNA and protein databases, as
well as a series of ab initio gene predictions. Nucleotide sequence
databases were searched with WUBLASTN and significant hits were realigned
to the unmasked genomic sequence by EST2GENOME. WUBLASTX was used to search
the Uniprot protein database, and the accession numbers of significant hits
were retrieved from the Pfam database. Hidden Markov models for Pfam protein
domains were aligned against the genomic sequence using Genewise to provide
annotation of protein domains.
A number of ab initio
prediction algorithms were also run: Genscan and Fgenesh for genes, tRNAscan
to find tRNA genes, and Eponine TSS for transcription start site predictions.
The annotators used the (AceDB-based) Otterlace interface to create and
edit gene objects, which were then stored in a local database named
Otter. In cases where predicted transcript structures from Ensembl
are available, these can be viewed from within the Otterlace interface and
may be used as starting templates for gene curation. Annotation in the Otter
database is submitted to the EMBL/Genbank/DDBJ nucleotide database.
Verification
The gene objects selected for verification came from various
computational prediction methods and HAVANA annotations.
RT-PCR and RACE experiments were performed on them, using a variety of human
tissues, to confirm their structure. Human cDNAs from 24 different
tissues (brain, heart, kidney, spleen, liver, colon, small intestine,
muscle, lung, stomach, testis, placenta, skin, peripheral blood
leucocytes, bone marrow, fetal brain, fetal liver, fetal kidney, fetal
heart, fetal lung, thymus, pancreas, mammary gland, prostate) were
synthesized using 12 poly(A)+ RNAs from Origene, eight from Clemente
Associates/Quantum Magnetics and four from BD Biosciences as described in
[Reymond et al., 2002a,b]. The relative amount of each cDNA was
normalized by quantitative PCR using SyberGreen as intercalator and an
ABI Prism 7700 Sequence Detection System.
Predictions of human genes junctions were assayed experimentally by
RT-PCR as previously described and modified [Reymond, 2002b;
Mouse Genome Sequencing Consortium, 2002; Guigo, 2003].
Similar amounts of Homo
sapiens cDNAs were mixed with JumpStart REDTaq ReadyMix (Sigma) and four
ng/ul primers (Sigma-Genosys) with a BioMek 2000 robot (Beckman). The
ten first cycles of PCR amplification were performed with a touchdown
annealing temperatures decreasing from 60 to 50°C; annealing
temperature of the next 30 cycles was carried out at 50°C. Amplimers
were separated on "Ready to Run" precast gels (Pharmacia) and
sequenced. RACE experiments were performed with the BD SMART RACE cDNA
Amplification Kit following the manufacturer instructions (BD
Biosciences).
Credits
Click here for a complete list of people who participated in the
GENCODE project.
References
Ashurst, J.L. et al.
The Vertebrate Genome Annotation (Vega) database.
Nucleic Acids Res 33 (Database Issue), D459-65
(2005).
Guigo, R. et al.
Comparison of mouse and human genomes followed by experimental
verification yields an estimated 1,019 additional genes.
Proc Natl Acad Sci U S A 100(3), 1140-5 (2003).
Mouse Genome Sequencing Consortium.
Initial sequencing and comparative analysis of the mouse
genome. Nature 420(6915), 520-62 (2002).
Reymond, A. et al.
Human chromosome 21 gene expression atlas in the mouse.
Nature 420(6915), 582-6 (2002).
Reymond, A. et al.
Nineteen additional unpredicted transcripts from human
chromosome 21. Genomics 79(6), 824-32 (2002).
|
|