Description
This track displays human-centric multiple sequence alignments
and conserved elements in the
ENCODE regions for the 36 vertebrates included in the
December 2007 ENCODE MSA freeze.
The alignments in this track were generated using the
Threaded Blockset Aligner (TBA).
The conservation subtracks display conserved elements generated by two methods: BinCons,
a binomial-based method that calculates a conservation score in
sliding windows with normalization for phylogenetic bias, and Chai Cons,
a DNA structure-informed constraint detection algorithm that
uses hydroxyl radical cleavage patterns as a measure of DNA structure.
The multiple alignments are based on comparative sequence data generated for the ENCODE project
from NIH Intramural Sequencing Center (NISC)
as well as whole-genome assemblies residing at UCSC, as listed:
Organism | Species | Version |
Human | Homo sapiens |
UCSC hg18 |
Armadillo | Dasypus novemcinctus |
NISC |
Baboon | Papio anubis |
NISC |
Bat (rfbat) | Rhinolophus ferrumequinum |
NISC |
Bat (sbbat) | Myotis lucifugus |
NISC |
Cat | Felis catus |
NISC |
Chicken | Gallus gallus |
UCSC galGal3 |
Chimpanzee | Pan troglodytes |
UCSC panTro2 |
Colobus Monkey | Colobus guereza |
NISC |
Cow | Bos taurus |
UCSC bosTau3 |
Dog | Canis familiaris |
UCSC canFam2 |
Dusky titi | Callicebus moloch |
NISC |
Elephant | Loxodonta africana |
NISC |
Flying Fox | Pteropus vampyrus |
NISC |
Galago | Otolemur garnettii |
NISC |
Gibbon | Nomascus leucogenys leucogenys |
NISC |
Guinea pig | Cavia porcellus |
NISC |
Hedgehog | Atelerix albiventris |
NISC |
Horse | Equus caballus |
NISC |
Macaque | Macaca mulatta |
UCSC rheMac2 |
Marmoset | Callithrix jacchus |
NISC |
Mouse | Mus musculus |
UCSC mm9 |
Mouse Lemur | Microcebus murinus |
NISC |
Opossum | Monodelphis domestica |
UCSC monDom4 |
Orangutan | Pongo abelii |
UCSC ponAbe2 |
Owl Monkey | Aotus nancymaae |
NISC |
Platypus | Ornithorhychus anatinus |
NISC |
Rabbit | Oryctolagus cuniculus |
NISC |
Rat | Rattus norvegicus |
UCSC rn4 |
Rock hyrax | Procavia capensis |
NISC |
Shrew | Sorex araneus |
NISC |
Squirrel monkey | Saimiri boliviensis boliviensis |
NISC |
Squirrel | Spermophilus tridecemlineatus |
NISC |
Tenrec | Echinops telfairi |
NISC |
Tree shrew | Tupaia belangeri |
NISC |
Vervet monkey | Chlorocebus aethiops |
NISC |
Display Conventions and Configuration
In full display mode, this track shows pairwise alignments
of each species aligned to the human genome.
In dense mode, the alignments are depicted using a gray-scale
density gradient. The checkboxes in the track configuration section allow
the exclusion of species from the pairwise display. To view detailed
information about the alignments at a specific position, zoom the display in
to 30,000 or fewer bases, then click on the alignment.
Gap Annotation
The Display chains between alignments configuration option
enables display of gaps between alignment blocks in the pairwise alignments in
a manner similar to the Chain track display. The following
conventions are used:
- Single line: no bases in the aligned species. Possibly due to a
lineage-specific insertion between the aligned blocks in the human genome
or a lineage-specific deletion between the aligned blocks in the aligning
species.
- Double line: aligning species has one or more unalignable bases in
the gap region. Possibly due to excessive evolutionary distance between
species or independent indels in the region between the aligned blocks in both
species.
- Pale yellow coloring: aligning species has Ns in the gap region.
Reflects uncertainty in the relationship between the DNA of both species, due
to lack of sequence in relevant portions of the aligning species.
Genomic Breaks
Discontinuities in the genomic context (chromosome, scaffold or region) of the
aligned DNA in the aligning species are shown as follows:
-
Vertical blue bar: represents a discontinuity that persists indefinitely
on either side, e.g. a large region of DNA on either side of the bar
comes from a different chromosome in the aligned species due to a large scale
rearrangement.
-
Green square brackets: enclose shorter alignments consisting of DNA from
one genomic context in the aligned species nested inside a larger chain of
alignments from a different genomic context. The alignment within the
brackets may represent a short misalignment, a lineage-specific insertion of a
transposon in the human genome that aligns to a paralogous copy somewhere
else in the aligned species, or other similar occurrence.
Base Level
When zoomed-in to the base-level display, the track shows the base
composition of each alignment.
The numbers and symbols on the Gaps
line indicate the lengths of gaps in the human sequence at those
alignment positions relative to the longest non-human sequence.
If there is sufficient space in the display, the size of the gap is shown.
If the space is insufficient and the gap size is a multiple of 3, a
"*" is displayed; other gap sizes are indicated by "+".
Codon translation is available in base-level display mode if the
displayed region is identified as a coding segment. To display this annotation,
select the species for translation from the pull-down menu in the Codon
Translation configuration section at the top of the page. Then, select one of
the following modes:
-
No codon translation: the gene annotation is not used; the bases are
displayed without translation.
-
Use default species reading frames for translation: the annotations from the genome
displayed
in the Default species for translation; pull-down menu are used to
translate all the aligned species present in the alignment.
-
Use reading frames for species if available, otherwise no translation: codon
translation is performed only for those species where the region is
annotated as protein coding.
- Use reading frames for species if available, otherwise use default species:
codon translation is done on those species that are annotated as being protein
coding over the aligned region using species-specific annotation; the remaining
species are translated using the default species annotation.
Codon translation uses the following gene tracks as the basis for
translation, depending on the species chosen.
Species listed in the row labeled "None" do not have
species-specific reading frames for gene translation.
Gene Track | Species |
Gencode Genes | human |
UCSC Genes | mouse |
Known Genes | rat |
RefSeq Genes | chimp |
Ensembl Genes | rhesus, opossum |
None | the remaining 30 species |
Methods
TBA
TBA was used to align sequences in the December 2007 ENCODE sequence data
freeze. Multiple alignments were seeded from a series of combinatorial pairwise
blastz alignments (not referenced to any one species). The specific
combinations were determined by the
species guide tree.
The resulting multiple alignments were projected onto the human reference
sequence.
BinCons
The binCons score is based on the cumulative binomial probability of
detecting the observed number of identical bases (or greater) in
sliding 25 bp windows (moving one bp at a time) between the
reference sequence and each other species, given the neutral rate
at four-fold degenerate sites. Neutral rates are calculated
separately at each targeted region. For targets with no gene annotations,
the average percent identity across all alignable sequence was instead used
to weight the individual species binomial scores; this latter
weighting scheme was found to closely match 4D weights.
Clusters of bases
that exceeded the given conservation score threshold were designated
as conserved elements.
The minimum length of a conserved element is 25
bases. Strict cutoffs were used: if even one base fell below the
conservation score threshold, it separates an element into two distinct
regions.
Regions reported here exceed a 5% False Discovery Rate
threshold, using a window size of 7 bases.
More details on binCons can be found in Margulies et. al. (2003)
cited below.
Chai
Chai is a DNA structure-informed evolutionary conservation algorithm that works in a
manner analogous to the primary sequence-based binCons.
Instead of computing the binomial probability of observed base substitutions between species,
Chai calculates the difference between DNA structural profiles as a measure of similarity.
Single nucleotide resolution structure profiles for genomic DNA are predicted using the
algorithm described in Greenbaum et. al (2007), below.
Regions reported here exceed a 5% False Discovery Rate threshold.
Credits
The TBA multiple alignments were created by Gayle McEwen & Elliott Margulies of NHGRI.
BinCons was developed by Elliott Margulies (Margulies et al. 2003).
Chai was developed by Steve Parker & Tom Tullius (Boston University), Elliott Margulies(NHGRI) and Loren Hansen (NCBI).
The programs Blastz and TBA, which were used to generate the alignments, were
provided by Minmei Hou, Scott Schwartz and Webb Miller of the
Penn State Bioinformatics
Group.
The phylogenetic tree is based on Murphy et al. (2001).
References
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM,
Baertsch R, Rosenbloom K, Clawson H, Green ED, et al.
Aligning multiple genomic sequences with the threaded blockset aligner.
Genome Res. 2004 Apr;14(4):708-15.
Chiaromonte F, Yap VB, Miller W.
Scoring pairwise genomic sequence alignments.
Pac Symp Biocomput. 2002;:115-26.
Greenbaum JA, Pang B, Tullius TD.
Construction of a genome-scale structural map at single-nucleotide resolution.
Genome Res. 2007 Jun;17(6):947-53.
Margulies EH, Blanchette, M, NISC Comparative Sequencing Program,
Haussler, D and Green, ED.
Identification and characterization of multi-species conserved sequences.
Genome Res. 2003 Dec;13(12): 2507-18.
Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E,
Ryder OA, Stanhope MJ, de Jong WW, Springer MS.
Resolution of the early placental mammal radiation using Bayesian phylogenetics.
Science. 2001 Dec 14;294(5550):2348-51.
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W.
Human-mouse alignments with BLASTZ.
Genome Res. 2003 Jan;13(1):103-7.
|
|