Description
This track shows gene predictions using the N-SCAN gene structure prediction
software provided by the Computational Genomics Lab at Washington University
in St. Louis, MO, USA.
Methods
N-SCAN
N-SCAN combines biological-signal modeling in the target genome sequence along
with information from a multiple-genome alignment to generate de novo gene
predictions. It extends the TWINSCAN target-informant genome pair to allow for
an arbitrary number of informant sequences as well as richer models of
sequence evolution. N-SCAN models the phylogenetic relationships between the
aligned genome sequences, context-dependent substitution rates, insertions,
and deletions.
Human N-SCAN uses mouse (mm7) as the informant
and iterative pseudogene masking.
N-SCAN PASA-EST
N-SCAN PASA-EST combines EST alignments into N-SCAN. Similar to the conservation
sequence models in TWINSCAN, separate probability models are developed for EST
alignments to genomic sequence in exons, introns, splice sites and UTRs,
reflecting the EST alignment patterns in these regions. N-SCAN PASA-EST is more
accurate than N-SCAN while retaining the ability to discover novel genes to
which no ESTs align.
In N-SCAN PASA-EST, cDNA sequences were clustered using the PASA program
beforehand. PASA, the Program to Assemble Spliced Alignments, was created by
Brian Haas at TIGR. The algorithm assembles clusters of overlapping transcript
alignments (ESTs and full-length cDNAs) into maximal alignment assemblies,
thereby comprehensively incorporating all available transcript data and
capturing subtle splicing variations.
The PASA clusters were used as 'EST' sequences in N-SCAN PASA-EST. The
resulting gene models were updated with the input PASA clusters using the
assembly tool of the PASA pipeline. These updates consist of automatically
generated alternative splices, UTR features and sometimes merging of two gene
models. In addition, PASA assigned open reading frames to clusters that did
not overlap a gene prediction, but that did contain a full length cDNA, and
output them as 'novel genes'. Note that PASA does not use any cDNA annotation
from input but assigns the ORF itself.
No manual annotation was performed to generate any of the gene models. The
high accuracy of the set is in part due to the large number of available ESTs
and full length cDNAs.
Credits
Thanks to Michael Brent's Computational Genomics Group at Washington
University St. Louis for providing these data.
Special thanks for this implementation of N-SCAN to Aaron Tenney in
the Brent lab, and Robert Zimmermann, currently at Max F. Perutz
Laboratories in Vienna, Austria.
References
Gross SS, Brent MR.
Using
multiple alignments to improve gene prediction. In
Proc. 9th Int'l Conf. on Research in Computational Molecular Biology
(RECOMB '05):374-388 and J Comput Biol. 2006 Mar;13(2):379-93.
Korf I, Flicek P, Duan D, Brent MR.
Integrating genomic homology into gene structure prediction.
Bioinformatics. 2001 Jun 1;17(90001):S140-8.
van Baren MJ, Brent MR.
Iterative gene prediction and pseudogene removal improves
genome annotation.
Genome Res. 2006 May;16(5):678-85.
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM,
Rusch DB, Town CD et al.
Improving the Arabidopsis genome annotation using maximal transcript
alignment assemblies.
Nucleic Acids Res 2003 Oct 1;31(19):5654-66.
|