Description
This track shows full sets of gene predictions covering all 44 ENCODE regions
originally submitted for the ENCODE Gene Annotation Assessment Project
(EGASP) Gene Prediction Workshop 2005.
The following gene predictions are included:
The EGASP Partial companion track shows original gene prediction submissions
for a partial set of the 44 ENCODE regions; the EGASP Update track
shows updated versions of the submitted predictions. These annotations
were originally produced using the hg17 assembly.
Display Conventions and Configuration
Data for each gene prediction method within this composite annotation track
are displayed in a separate subtrack. See the top of the track description
page for configuration options allowing display of selected subsets of gene
predictions. To remove a subtrack from the display,
uncheck the appropriate box.
The individual subtracks within this annotation follow the display conventions
for gene prediction
tracks. Display characteristics specific to individual subtracks are
described in the Methods section. The track description page offers the option
to color and label codons in a zoomed-in display of the subtracks to facilitate
validation and comparison of gene predictions. To enable this feature, select
the genomic codons option from the "Color track by codons"
menu. Click the
Help on codon coloring
link for more information about this feature.
Color differences among the subtracks are arbitrary. They provide a
visual cue for distinguishing the different gene prediction methods.
Methods
AceView
These annotations were generated using AceView. All mRNAs
and cDNAs available in GenBank, excluding NMs, were co-aligned on the Gencode
sections. The results were then examined and filtered to resemble Havana.
The very restrictive view of Havana on CDS was not reproduced, due to a lack of
experimental data.
DOGFISH-C
Candidate splice sites and coding starts/stops were evaluated using DNA
alignments between the human assembly and seven other vertebrate species
(UCSC multiz alignments, adding the frog and removing the chimp). Genes
(single transcripts only) were then predicted using dynamic programming.
Ensembl
The Ensembl annotation includes two types of predictions: protein-coding
genes (the Ensembl Gene Predictions subtrack)
and pseudogenes of protein-coding genes
(the Ensembl Pseudogene Predictions subtrack).
The Ensembl Pseudo track is not intended as a comprehensive annotation of
pseudogenes, but rather
an attempt to identify and label those gene predictions made by the Ensembl
pipeline that have pseudogene characteristics. Exons that lie partially outside
the ENCODE region are not included in the data set. The "Alternate
Name" field on the subtrack details page shows the Ensembl ID for the
selected gene or transcript.
ExonHunter
ExonHunter is a comprehensive gene-finder based on hidden Markov models (HMMs)
allowing the use of a variety of additional sources of information (ESTs,
proteins, genome-genome comparisons).
Exogean
Exogean annotates protein coding genes by combining mRNA and cross-species
protein alignments in directed acyclic colored multigraphs where nodes and
edges respectively represent biological objects and human expertise.
Additional predictions and methods for this subtrack are available in the
EGASP Updates track.
Fgenesh Pseudogenes
Fgenesh is an HMM gene structure prediction program.
This data set shows predictions of potential pseudogenes.
Fgenesh++
These gene predictions were generated by Fgenesh++, a gene-finding program that
uses both HMMs and protein similarity to find
genes in a completely automated manner.
GeneID-U12
The GeneID-U12 gene prediction set, generated using a version of GeneID modified
to detect U12-dependent introns (both GT-AG and AT-AC subtypes) when present,
employs a single-genome ab initio method.
This modified version of GeneID uses matrices for U12 donor,
acceptor and branch sites constructed from examples of published U12
intron splice junctions
(both experimentally confirmed and expressed-sequence-validated predictions).
Two GeneID-U12 subtracks are
included: GeneID Gene Predictions and GeneID U12 Intron Predictions. The U12
splice sites for features in the U12 Intron Predictions track are displayed
on the track details pages.
Additional predictions and methods for this subtrack are available in the
EGASP Updates track.
GeneMark
The eukaryotic version of the GeneMark.hmm (release 2.2) gene prediction
program utilizes the HMM statistical model with duration or hidden
semi-Markov model (HSMM). The HMM includes hidden states for initial,
internal and terminal exons, introns, intergenic regions and single exon genes.
It also includes the "border" states, such as start site (initiation
codon), stop site (termination codons), and donor and acceptor splice sites.
Sequences of all protein-coding regions were modeled by three periodic
inhomogeneous Markov chains; sequences of non-coding regions were modeled by
homogeneous Markov chains. Nucleotide sequences corresponding to the site
states were modeled by position-specific inhomogeneous Markov chains.
Parameters of the gene models were derived from the set of genes obtained by
cDNA mapping to genomic DNA. To reflect variations in G+C composition of the
genome, the gene model parameters were estimated separately for the three G+C
regions.
JIGSAW
JIGSAW uses the output from gene-finders, splice-site prediction programs and
sequence alignments to predict gene models. Annotation data downloaded from
the UCSC Genome Browser and TIGR gene-finder output was used as input for these
predictions. JIGSAW predicts both partial and complete genes.
Additional predictions and methods for this subtrack are available in the
EGASP Updates track.
Pairagon/N-SCAN
The pairHMM-based alignment program, Pairagon, was used to align
high-quality mRNA sequences to the ENCODE regions. These were
supplemented with N-SCAN EST predictions which are displayed in the
Pairgn/NSCAN-E subtrack, and extended further with additional
transcripts from the Brent Lab to produce the predictions
displayed as the Pairgn/NSCAN-E/+ subtrack. The NSCAN subtrack
contains only predictions from the N-SCAN program.
SGP2-U12
The SGP2-U12 gene prediction set, generated using a version of GeneID modified
to detect U12-dependent introns (both AT-AC and GT-AG subtypes) when present,
employs a dual-genome method (SGP2) that utilizes similarity (tblastx) to
mouse genomic sequence syntenic to the ENCODE regions (Oct. 2004 MSA freeze).
This modified version of GeneID uses matrices for U12
donor, acceptor and branch sites constructed from examples of published U12
intron splice junctions (both experimentally confirmed and
expressed-sequence-validated predictions). Two SGP2-U12 subtracks are
included: SGP2 Gene Predictions and SGP2 U12 Intron Predictions.
The U12 splice sites for features in the U12 Intron Predictions track are
displayed on the track details pages.
Additional predictions and methods for this subtrack are available in the
EGASP Updates track.
SPIDA
This exon-only prediction set was produced using SPIDA (Substitution Periodicity
Index and Domain Analysis). Exons derived by mapping ESTs to the genome were
validated by seeking periodic substitution patterns in the aligned informant
DNA sequences. First, all
available ESTs were mapped to the genome using Exonerate. The resulting
transcript structures were "flattened" to remove redundancy. Each
exon of the flattened transcripts was subjected to SPI analysis, which involves
identifying periodicity in the pattern of mutations occurring between the human
and an informant species DNA sequence (the informant sequences and their TBA
alignments were provided by Elliott Margulies). SPI was calculated for all
available human-informant pairs for whole exons and in a sliding 48 bp window.
SPI analysis requires that a threshold level of periodicity be identified in at
least two of the informant species if the exon is to be accepted. If accepted,
SPI provides the correct frame for translation of the exon. This exon was used
as a starting point for extending the ORF coding region of the flattened
transcript from which it came. This gave a full or partial CDS; different exons
may give different CDSs. The CDSs were translated and searched for domains using
hmmpfam and Pfam_fs. Only transcripts with a domain hit with e > 1.0 were
retained. Heuristics were applied to the retained CDSs to identify problems with
the transcript structure, particularly frame-shifts. Many transcripts may
identify the same exon, but only a single instance of each exon has been
retained.
Twinscan-MARS
This gene prediction set was produced by a version of Twinscan that employs
multiple pairwise genome comparisons to identify protein-coding genes (including
alternative splices) using nucleotide homology information. No expression or
protein data were used.
Credits
The following individuals and institutions provided the data for the subtracks
in this annotation:
-
AceView: Danielle and Jean Thierry-Mieg,
NCBI, National
Institutes of Health.
-
DOGFISH-C: David Carter, Informatics Dept.,
Wellcome Trust Sanger
Institute.
-
Ensembl: Stephen Searle, Wellcome Trust Sanger Institute (joint
Sanger/EBI project).
-
Exogean: Sarah Djebali, Dyogen Lab,
Ecole Normale
Supérieure (Paris, France).
-
ExonHunter: Tomas Vinar, Waterloo Bioinformatics, School of Computer Science,
University of Waterloo.
-
Fgenesh, Fgenesh++: Victor Solovyev,
Department of Computer Science,
Royal Holloway, London University.
-
GeneID-U12, SGP2-U12: Tyler Alioto,
Grup de Recerca en Informàtica Biomèdica
(GRIB) at
the Institut Municipal d'Investigació Mèdica (IMIM), Barcelona.
-
GeneMark: Mark Borodovsky, Alex Lomsadze and Alexander Lukashin,
Department of
Biology, Georgia Institute of Technology.
-
JIGSAW: Jonathan Allen, Steven Salzberg group, The Institute for
Genomic Research (TIGR)
and the Center for Bioinformatics and Computational Biology
(CBCB) at the
University of Maryland, College Park.
-
Pairagon/N-SCAN: Randall Brown, Laboratory for Computational Genomics, Washington University
in St. Louis.
-
SPIDA: Damian Keefe, Birney Group, EMBL-EBI.
-
Twinscan: Paul Flicek, Brent Lab,
Washington University
in St. Louis.
|
|