Description
The GENCODE Genes track (version 7, May 2011) shows high-quality manual
annotations merged with evidence-based automated annotations across the entire
human genome generated by the
GENCODE project.
The GENCODE gene set presents a full merge
between HAVANA manual annotation and ENSEMBL automatic annotation.
Priority is given to the manually curated HAVANA annotation, using predicted
ENSEMBL annotations when there are no corresponding manual annotations.
The annotation was carried out on genome assembly GRCh37 (hg19).
NOTE: Due to UCSC Genome Browser using the NC_001807 mitochondrial genome sequence
(chrM) and GENCODE annotating the NC_012920 mitochondrial sequence, the
GENCODE mitochondrial sequences are not available in the UCSC Genome Browser.
These annotations are available for download in the GENCODE GTF files.
NOTE: We try and synchronize the release cycles for GENCODE, Havana and
Ensembl. This GENCODE version 7 corresponds to Ensembl 62 from 13 April 2011
and Vega 23-03-2011. Also see:
GENCODE project.
Display Conventions and Configuration
The annotations are divided into separate tracks based on type of annotation.
The basic set of coding and non-coding transcripts is a subset of
the comprehensive set selected to provide a simplified view of the transcript
set designed to suit the needs of a majority of users. The selection algorithm is described in
the next section.
The available tracks are:
- GENCODE Basic set - subset of the GENCODE coding and non-coding
transcript annotations, including polymorphic pseudogenes. This includes
both manual and automatic annotations. The selection criteria is
described in the next section.
This is subset of the comprehensive set.
- GENCODE Comprehensive set - all GENCODE coding and non-code transcript annotations,
including polymorphic pseudogenes. This includes both manual and
automatic annotations, except pseudogenes. This is super-set of
the basic set.
- GENCODE Pseudogenes - all pseudogene annotations except polymorphic pseudogenes
- GENCODE 2-way Pseudogenes - Pseudogenes predicted by the Yale
Pseudopipe and UCSC Retrofinder pipelines. The set was derived by looking
for a 50 base pairs of overlap between pseudogenes derived from both sets
based on their genomic locations i.e. chromosomal coordinates. When multiple
Pseudopipe predictions map to a single Retrofinder prediction, only one match is kept
for the 2-way consensus set.
- GENCODE PolyA - This track contains polyA signals and sites manually annotated on
the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of
transcripts containing at least 3 A's not matching to the genome.
The GENCODE basic set is intended to provide a simplified subset of
the GENCODE transcript annotations that will be useful to the majority of
users. Selection for the GENCODE annotations to include in the basic set
is done on a per-locus basis and then for coding and non-coding transcripts
within that locus. The goal is to use the better quality
transcript annotations while still having some annotation present for
each locus.
The selection criteria for a given locus is:
- Coding transcripts (including polymorphic pseudogenes):
- If there are any full length coding transcripts that are not
nonsense-mediated decay or problem transcripts, then only they are
included in the basic set.
- Otherwise, use the coding transcript with the largest CDS.
- Non-coding transcripts:
- If there are any full length non-coding transcripts and they have a
well characterized BioType (see below), then only they are included in the
basic set.
- Otherwise, use the largest non-coding transcript.
Non-coding transcript categories
Non-coding transcripts are categorized using
their BioType
and the following criteria:
- well characterized: antisense, lincRNA, miRNA, Mt_rRNA, Mt_tRNA,
rRNA, snoRNA, snRNA
- poorly characterized: non_coding, processed_transcript,
retrotransposed, misc_RNA
Filtering
Items in the GENCODE Basic, Comprehensive and Pseudogene tracks
can be filter using the following criteria:
- Transcript class: Filter by the basic biological function of a transcript
annotation.
- All - Don't filter by transcript class.
- coding - Display protein coding transcripts, including polymorphic pseudogenes.
- nonCoding - Display non-protein coding transcripts.
- pseudo - Display pseudogene transcript annotations.
- problem - Display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain).
- Annotation Method: Filter by the method used to create the annotation.
- All - Don't filter by transcript class.
- manual - display manually created annotations, including those that are
also created automatically.
- automatic - display automatically created annotations, including those that are
also created manually.
- manual_only - display manually created annotations there were
not annotated by the automatic method.
- automatic_only - display automatically created annotations there were
not annotated by the manual method.
- Transcript Type: filter transcripts by BioType.
Coloring
The gene annotations are colored based on the annotation type:
Manual and automatic
| coding
| non-coding
| pseudogene
| problem
|
2-way pseudogene
| all
|
PolyA annotations
| all
|
Methods
We aim to annotate all evidence-based gene features at high accuracy on
the human reference sequence. This includes identifying all
protein-coding loci with associated alternative variants, non-coding
loci which have transcript evidence, and pseudogenes. We integrate
computational approaches (including comparative methods), manual
annotation and targeted experimental verification.
For a detailed description of the methods and references used, see
Harrow et al (2006).
Verification
See Harrow et al. (2006) for information on verification
techniques.
Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing.
Those experiments can be found at GEO:
- GSE34797:[E-MTAB-684] - Batch IV is based on chromosome 3, 4 and 5 annotations from GENCODE 4 (January 2010).
- GSE34820:[E-MTAB-737] - Batch V is based on annotations from GENCODE 6 (November 2010).
- GSE34821:[E-MTAB-831] - Batch VI is based on annotations from GENCODE 6 (November 2010) as well as transcript models predicted by the Ensembl Genebuild group based on the Illumina Human BodyMap 2.0 data.
Credits
This GENCODE release is the result of a collaborative effort among
the following laboratories: (contact:
GENCODE at the Sanger Institute.
)
Lab/Institution |
Contributors |
GENCODE Principal Investigator |
Tim Hubbard |
HAVANA manual annotation group,
Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK |
Adam Frankish, Jose Manuel Gonzalez, Mike Kay, Alexandra Bignell,
Gloria Despacio-Reyes, Garaub Mukherjee, Gary Sanders, Veronika Boychenko, Jennifer Harrow |
Genome Bioinformatics Lab (CRG),
Barcelona, Spain |
Thomas Derrien, Tyler Alioto, Andrea Tanzer, Roderic Guigó |
Genome Bioinformatics, University of California Santa Cruz (UCSC), USA |
Rachel Harte, Mark Diekhans, Robert Baertsch, David Haussler |
Comp. Genomics Lab, Washington University St. Louis (WUSTL), USA |
Jeltje van Baren, Charlie Comstock, David Lu, Michael Brent |
Computer Science and Artificial Intelligence Lab,
Broad Institute of MIT and Harvard, USA |
Mike Lin, Manolis Kellis |
Computational Biology and Bioinformatics, Yale University (Yale), USA |
Philip Cayting, Suganthi Balasubramanian, Baikang Pei, Cristina Sisu, Mark Gerstein |
Center for Integrative Genomics,
University of Lausanne, Switzerland |
Cedric Howald, Alexandre Reymond |
ENSEMBL genebuild group,
Wellcome Trust Sanger Insitute (WTSI), Hinxton, UK |
Steve Searle, Bronwen Aken, Amonida Zadissa, Daniel Barrell
|
Structural Computational Biology Group, Centro Natcional de Investigaciones Oncologicas (CNIO), Madrid, Spain |
José Manuel Rodríguez, Michael Tress, Alfonso Valencia |
References
Flicek et al.
Ensembl 2011.
Nucleic Acids Research. 2011;39 Database issue:D800-D806
Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al.
GENCODE: producing a reference annotation
for ENCODE. Genome Biol. 2006;7 Suppl
1:S4.1-9.
Data Release Policy
GENCODE data are available for use without restrictions.
The full data release policy for ENCODE is available
here.
|
|