UCSC Genes Track Settings
 
UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics)   (All Genes and Gene Predictions tracks)

Display mode:   
Label: gene symbol    UCSC Known Gene ID    UniProt Display ID   
Show: non-coding genes    splice variants   

Color track by codons: Help on codon coloring

Show codon numbering:

Display data as a density graph:

View table schema
Data last updated: 2015-04-08

Description

The UCSC Genes track is a set of gene predictions based on data from RefSeq, GenBank, and the tRNA Genes track. The track includes both protein-coding genes and non-coding RNA genes. Both types of genes can produce non-coding transcripts, but non-coding RNA genes do not produce protein-coding transcripts. This is a moderately conservative set of predictions. Transcripts of protein-coding genes require the support of one RefSeq RNA, or one GenBank RNA sequence plus at least one additional line of evidence. Transcripts of non-coding RNA genes require the support of one Rfam or tRNA prediction. Compared to RefSeq, this gene set has generally about 10% more protein-coding genes, approximately four times as many putative non-coding genes, and about twice as many splice variants.

Display Conventions and Configuration

This track in general follows the display conventions for gene prediction tracks. The exons for putative non-coding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker. The following color key is used:

  • Black -- feature has a corresponding entry in the Protein Data Bank (PDB)
  • Dark blue -- transcript has been reviewed or validated by either the RefSeq, SwissProt or CCDS staff
  • Medium blue -- other RefSeq transcripts
  • Light blue -- non-RefSeq transcripts

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions.

Methods

The UCSC Genes are built using a multi-step pipeline:

  1. RefSeq and GenBank RNAs are aligned to the genome with BLAT, keeping only the best alignments for each RNA. Alignments are discarded if they do not meet certain sequence identity and coverage filters. All sequences must align with high (98%) identity. The sequence coverage must be at least 90% for shorter sequences (those with 2500 or fewer bases), with the coverage threshold progressively relaxed for longer sequences.
  2. Alignments are broken up at non-intronic gaps, with small isolated fragments thrown out.
  3. A splicing graph is created for each set of overlapping alignments. This graph has an edge for each exon or intron, and a vertex for each splice site, start, and end. Each RNA that contributes to an edge is kept as evidence for that edge. Gene models from the Consensus CDS project (CCDS) are also added to the graph.
  4. A similar splicing graph is created in the human, based on human RNA and ESTs. If the human graph has an edge that is orthologous to an edge in the mouse graph, that is added to the evidence for the mouse edge.
  5. If an edge in the splicing graph is supported by two or more mouse ESTs, it is added as evidence for the edge.
  6. If there is an Exoniphy prediction for an exon, that is added as evidence.
  7. The graph is traversed to generate all unique transcripts. The traversal is guided by the initial RNAs to avoid a combinatorial explosion in alternative splicing. All RefSeq transcripts are output. For other multi-exon transcripts to be output, an edge supported by at least one additional line of evidence beyond the RNA is required. Single-exon genes require either two RNAs or two additional lines of evidence beyond the single RNA.
  8. Alignments are merged in from the mm10 tRNA Genes track and from Rfam in regions that are syntenic with the hg19 human genome.
  9. Protein predictions are generated. For non-RefSeq transcripts, we use the txCdsPredict program to determine if the transcript is protein-coding, and if so, the locations of the start and stop codons. The program weighs as positive evidence the length of the protein, the presence of a Kozak consensus sequence at the start codon, and the length of the orthologous predicted protein in other species. As negative evidence it considers nonsense-mediated decay and start codons in any frame upstream of the predicted start codon. For RefSeq transcripts, the RefSeq protein prediction is used directly instead of this procedure. For CCDS proteins, the CCDS protein is used directly.
  10. The corresponding UniProt protein is found, if any.
  11. The transcript is assigned a permanent "uc" accession. If the transcript was not in the previous release of UCSC Genes, the accession ends with the suffix ".1" indicating that this is the first version of this transcript. If the transcript is identical to some transcript in the previous release of UCSC Genes, the accession is re-used with the same version number. If the transcript is not identical to any transcript in the previous release but it overlaps a similar transcript with a compatible structure, the previous accession is re-used with the version number incremented.

Related Data

The UCSC Genes transcripts are annotated in numerous tables, each of which is also available as a downloadable file. These include tables that link UCSC Genes transcripts to external datasets (such as knownToLocusLink, which maps UCSC Genes transcripts to Entrez identifiers, previously known as Locus Link identifiers), and tables that detail some property of UCSC Genes transcript sequences (such as knownToPfam, which identifies any Pfam domains found in the UCSC Genes protein-coding transcripts). One can see a full list of the associated tables in the Table Browser by selecting UCSC Genes at the track menu; this list is then available at the table menu. Note that some of these tables refer to UCSC Genes by its former name of Known Genes, sometimes abbreviated as known or kg. While the complete set of annotation tables is too long to describe, some of the more important tables are described below.

  • kgXref identifies the RefSeq, SwissProt, Rfam, or tRNA sequences (if any) on which each transcript was based.
  • knownToRefSeq identifies the RefSeq transcript that each UCSC Genes transcript is most closely associated with. That RefSeq transcript is either the RefSeq on which the UCSC Genes transcript was based, if there is one, or it's the RefSeq transcript that the UCSC Genes transcript overlaps at the most bases.
  • knownGeneMrna contains the mRNA sequence that represents each UCSC Genes transcript. If the transcript is based on a RefSeq transcript, then this table contains the RefSeq transcript, including any portions that do not align to the genome.
  • knownGeneTxMrna contains mRNA sequences for each UCSC Genes transcript. In contrast to the sequencess in knownGeneMrna, these sequences are derived by obtaining the sequences for each exon from the reference genome and concatenating these exonic sequences.
  • knownGenePep contains the protein sequences derived from the knownGeneMrna transcript sequences. Any protein-level annotations, such as the contents of the knownToPfam table, are based on these sequences.
  • knownGeneTxPep contains the protein translation (if any) of each mRNA sequence in knownGeneTxMrna.
  • knownIsoforms maps each transcript to a cluster ID, a cluster of isoforms of the same gene.
  • knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

Credits

The UCSC Genes track was produced at UCSC using a computational pipeline developed by Jim Kent, Chuck Sugnet, Melissa Cline and Mark Diekhans. It is based on data from NCBI RefSeq, UniProt (including TrEMBL and TrEMBL-NEW), CCDS, and GenBank as well as data from Rfam and the Todd Lowe lab. Our thanks to the people running these databases and to the scientists worldwide who have made contributions to them.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6. PMID: 14681350; PMC: PMC308779

Chan PP, Lowe TM. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 2009 Jan;37(Database issue):D93-7. PMID: 18984615; PMC: PMC2686519

Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR et al. Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Res. 2011 Jan;39(Database issue):D141-5. PMID: 21062808; PMC: PMC3013711

Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006 May 1;22(9):1036-46. PMID: 16500937

Kent WJ. BLAT--the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64. PMID: 11932250; PMC: PMC187518

Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997 Mar 1;25(5):955-64. PMID: 9023104; PMC: PMC146525

UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5. PMID: 22102590; PMC: PMC3245120