Description
This track displays regions showing evidence for conservation with respect
to mutations involving sequence insertions and deletions (indels).
These “indel-purified sequences” (IPSs) were obtained by comparing the
predictions of a neutral model of indel evolution with data obtained from
human (hg18), mouse (mm8) and dog (camFam2) alignments (Lunter et al., 2006)
The evidence for conservation is statistical, and each region is annotated
with a posterior probability. It may be interpreted as the probability that
the segment shows the paucity of indels by selection, rather than by random
chance.
Apart from the underlying alignment, these data are independent of the
conservation of the nucleotide sequence itself. Any inferred conservation
of the sequence, e.g. as shown by phastCons, is therefore independent evidence
for selection. It may happen that sequence is conserved with respect to
indel mutations without concomitant evidence of conservation of the
nucleotide sequence. The opposite may also happen.
Display Conventions
The score (based on the false discovery rate, FDR) is reflected in the
bluescale density gradient coloring the track items. Lighter colours reflect a
higher FDR.
Methods
In the absence of selection, indels have a certain predicted distribution
over the genome. The actual distribution shows an over-abundance of ungapped
regions, due to selection purifying functional sequence from the deleterious
effects of indels. Neutrally evolving sequence, such as (by and large)
ancestral repeats, show an exceedingly good fit to the neutral predictions.
This accurate fit allows the identification of a good proportion of conserved
sequence at a relatively low false discovery rate (FDR). For example, at
an FDR of 10%, the predicted sensitivity is 75%.
Each identified indel-purified sequence (IPS) is annotated by two numbers: a
false discovery rate (FDR), and a posterior probability (p).
The FDR refers to a set of segments, not a given segment by itself. In this
case, it refers to the minimum FDR of all sets that include the segment of
interest. For example, a segment annotated with a 10% FDR also belongs to a
set with a 15% FDR, but not a set with a 5% FDR.
The posterior probability does refer to the single segment by itself. It has
a frequentist interpretation, namely, as the proportion of regions, annotated
with the same posterior probability, that have been under purifying selection,
rather than showing the observed lack of indels by random chance.
The data include segments for a false-discovery rate of up to 50%. The
score directly reflects the FDR, through the following formula:
score = 90 / (FDR + 0.08)
This maps FDR 1% (the most restrictive category) to 999, and FDR 10% to 500.
For further details of the Methods, see Lunter et al., 2006.
Verification
The neutral indel model was calibrated using ancestral repeats, against which
it showed an excellent fit. Among the identified IPSs at 10% FDR and
predicted sensitivity of 75%, we found 75% of annotated protein-coding
exons (weighted by length), and 75% of the 222 microRNAs that were
annotated at the time. Ancestral repeats were heavily depleted among the
identified segments.
Credits
These data were generated by Gerton Lunter and Chris Ponting, MRC Functional
Genetics Unit, University of Oxford, United Kingdom and Jotun Hein,
Department of Statistics, University of Oxford, United Kingdom.
References
Lunter G, Ponting, CP, Hein J.
Genome-wide identification of human functional
DNA using a neutral indel model. PLoS Comp Biol. 2006 Jan;2(1):e5.
The data may also be browsed here.
|
|