Description
This track shows updated versions of gene predictions submitted for the
ENCODE Gene Annotation Assessment Project
(EGASP) Gene Prediction Workshop 2005.
The following gene predictions are included:
The original EGASP submissions are displayed in the companion tracks,
EGASP Full and EGASP Partial.
Display Conventions and Configuration
Data for each gene prediction method within this composite annotation track
are displayed in separate subtracks. See the top of the track description page
for a complete list of the subtracks available for this annotation. To display
only selected subtracks, uncheck the boxes next to the tracks you wish to
hide.
The individual subtracks within this annotation follow the display conventions
for gene prediction
tracks. Display characteristics specific to individual subtracks are
described in the Methods section. The track description page offers the option
to color and label codons in a zoomed-in display of the subtracks to facilitate
validation and comparison of gene predictions. To enable this feature, select
the genomic codons option from the "Color track by codons"
menu. Click the
Help on codon coloring
link for more information about this feature.
Color differences among the subtracks are arbitrary. They provide a
visual cue for distinguishing the different gene prediction methods.
Methods
Augustus
Augustus uses a generalized hidden Markov model (GHMM) that models
coding and non-coding sequence, splice sites, the branch point region,
the translation start and end, and the lengths of exons and introns.
This version has been trained on a set of 1284 human genes.
The track contains four sets of predictions: ab initio,
EST and protein-based, mouse homology-based, and those using
EST/protein and mouse homology evidence as additional input to Augustus
for the predictions.
The EST and protein evidence was generated by aligning sequences from the dbEST
and nr databases to the ENCODE region using wublastn and wublastx.
The resulting alignments were used to generate hints about putative splice
sites, exons, coding regions, introns, translation start and
translation stop.
The mouse homology evidence was generated by aligning pairs of human and
mouse genomic sequences using the program
DIALIGN. Regions conserved at the peptide level were used to
generate hints about coding regions.
Exogean
Exogean produces alternative transcripts by combining mRNA and cross-species
sequence alignments using heuristic rules. The program implements a generic
framework based on directed acyclic colored multigraphs (DACMs). In Exogean,
DACM nodes represent biological objects (mRNA or protein HSPs/transcripts) and
multiple edges between nodes represent known relationships between these
objects derived from human expertise. Exogean DACMs are succesively built and
reduced, leading to increasingly complex objects. This process
enables the production of alternative transcripts from initial HSPs.
FGenesh++
FGenesh++ predictions are based on hidden Markov models and protein similarity to
the NR database. For more information, see the reference below.
GeneID-U12
The GeneID program predicts genes in anonymous genomic sequences
designed with a hierarchical structure.
In the first step, splice sites, start and stop codons are predicted and scored
along the sequence using position weight arrays (PWAs).
Next, exons are built from the sites. Exons are scored as the sum of the scores
of the defining sites plus the the log-likelihood ratio of a Markov model for
coding DNA.
Finally, the gene structure is assembled from the set of predicted exons,
maximizing the sum of the scores of the assembled exons.
The modified version of GeneID used to generate the predictions in this track
incorporates models for U12-dependent splice signals in addition to U2 splice
signals.
The GeneID subtrack shows all GeneID genes. Only U12 introns
and their flanking exons are displayed in the GeneID U12 subtrack.
Exons flanking predicted U12-dependent introns are assigned a type
attribute reflecting their splice sites, displayed on
the details page of the GeneID U12 subtrack as the "Alternate Name"
of the item composed of the intron plus flanking exons.
Jigsaw
Jigsaw is a gene prediction program that determines genes based on
target genomic sequence and output from a gene structure annotation database.
Data downloaded from UCSC's annotation database is
used as input and includes the following tracks of evidence:
Known Genes, Ensembl, RefSeq, GeneID, Genscan, SGP, Twinscan, Human mRNAs,
TIGR Gene Index, UniGene, Most Conserved Elements and Non-human RefSeq Genes.
GlimmerHMM and GeneZilla, two open source ab initio gene-finding
programs based on GHMMs, are also used.
SGP2-U12
To predict genes in a genomic query, SGP2 combines GeneID predictions with
tblastx comparisons of the genomic query against other genomic sequences.
This modified version of SGP2 uses models for U12-dependent splice signals
in addition to U2 splice signals. The reference genomic sequence for this data
set is the Oct. 2004 release of mouse sequence syntenic to ENCODE regions.
The SGP2 and SGP2 U12 tracks follow the same display conventions as the
GeneID and GeneID U12 subtracks described above.
Yale Pseudogenes
For this analysis, pseudogenes were defined as genomic sequences similar
to known human genes and with various disablements (premature stop codons or
frameshifts) in their "putative" protein-coding regions.
The protein sequences of known human genes (as annotated by ENSEMBL) were used
to search for similar nongenic sequences in ENCODE regions. The matching
sequences were assessed as disabled copies of genes based on the occurrences of
premature stop codons or frameshifts. The intron-exon structure of the
functional gene was further used to infer whether a pseudogene was duplicated
or processed (a duplicated pseudogene keeps the intron-exon structure of its
parent functional gene). Small pseudogene sequences were labeled as fragments or
other types.
All pseudogenes in this track were manually curated.
In the browser, the track details page shows the pseudogene type.
Credits
Augustus was written by Mario Stanke at the
Department of
Bioinformatics of the University of Göttingen in Germany.
Exogean was developed by Sarah Djebali and Hugues Roest Crollius from the
Dyogen Lab, Ecole
Normale Supérieure (Paris, France) and Franck Delaplace
from the Laboratoire de Méthodes Informatiques
(LaMI), (Evry,
France).
The FGenesh++ gene predictions were provided by Victor Solovyev of
Softberry Inc.
The GeneID-U12 and SGP2-U12 programs were developed by the
Grup de Recerca en Informàtica Biomèdica
(GRIB) at
the Institut Municipal d'Investigació Mèdica (IMIM) in Barcelona.
The version of GeneID on which GeneID-U12 is based (geneid_v1.2) was written by
Enrique Blanco and Roderic Guigó.
The parameter files were constructed by Genis Parra and Francisco Camara.
Additional contributions were made by Josep F. Abril, Moises Burset and Xavier
Messeguer. Modifications to GeneID that allow for the prediction of
U12-dependent splice sites and incorporation of U12 introns into gene models
were made by Tyler Alioto.
Jigsaw was developed at The Institute for Genomic Research
(TIGR)
by Jonathan Allen and Steven Salzberg,
with computational gene-finder contributions from Mihaela Pertea and William
Majoros. Continued maintenance and development of Jigsaw will
be provided by the Salzberg group at the Center for Bioinformatics
and Computational Biology
(CBCB) at the
University of Maryland, College Park.
The Yale Pseudogenes were generated by the pseudogene annotation group of
Mark Gerstein at Yale
University.
References
Augustus
Stanke, M.
Gene prediction with a hidden Markov model.
Ph.D. thesis, Universität Göttingen, Germany (2004).
Stanke, M. and Waack, S.
Gene prediction with a hidden Markov model and a new intron
submodel.
Bioinformatics, 19(Suppl. 2), ii215-ii225 (2003).
Stanke, M., Steinkamp, R., Waack, S. and Morgenstern, B.
AUGUSTUS: a web server for gene finding in eukaryotes.
Nucl. Acids Res., 32, W309-W312 (2004).
FGenesh++
Solovyev V.V.
"Statistical approaches in Eukaryotic gene prediction".
In Handbook of Statistical Genetics (eds. Balding D. et al.)
(John Wiley & Sons, Inc., 2001). p. 83-127.
GeneID
Blanco, E., Parra, G. and Guigó, R.
"Using geneid to identify genes".
In Current Protocols in Bioinformatics, Unit 4.3. (ed. Baxevanis, A.D.)
(John Wiley & Sons, Inc., 2002).
Guigó, R.
Assembling genes from predicted exons in linear time with
dynamic programming.
J Comput Biol. 5(4), 681-702 (1998).
Guigó, R., Knudsen, S., Drake, N. and Smith, T.
Prediction of gene structure.
J Mol Biol. 226(1), 141-57 (1992).
Parra, G., Blanco, E. and Guigó, R.
GeneID in Drosophila.
Genome Research 10(4), 511-515 (2000).
Jigsaw
Allen, J.E., Pertea, M. and Salzberg, S.L.
Computational gene prediction using multiple sources of
evidence.
Genome Res., 14(1), 142-8 (2004).
Allen, J.E. and Salzberg, S.L.
JIGSAW: integration of multiple sources of evidence for gene
prediction.
Bioinformatics 21(18), 3596-3603 (2005).
SGP2
Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C.P., Parra, G.,
Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C. et al.
Comparison of mouse and human genomes followed by experimental
verification yields an estimated 1,019 additional genes.
Proc Natl Acad Sci U S A 100(3), 1140-5 (2003).
Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W. and Guigó, R.
Comparative gene prediction in human and mouse.
Genome Res. 13(1), 108-17 (2003).
|