Cons Indels MmCf Track Settings
 
Indel-based Conservation for human hg18, mouse mm8 and dog canFam2   (All Comparative Genomics tracks)

Display mode:   

Show only items with score at or above:   (range: 0 to 1000)

View table schema
Data last updated: 2007-09-12

Description

This track displays regions showing evidence for conservation with respect to mutations involving sequence insertions and deletions (indels). These “indel-purified sequences” (IPSs) were obtained by comparing the predictions of a neutral model of indel evolution with data obtained from human (hg18), mouse (mm8) and dog (camFam2) alignments (Lunter et al., 2006) The evidence for conservation is statistical, and each region is annotated with a posterior probability. It may be interpreted as the probability that the segment shows the paucity of indels by selection, rather than by random chance. Apart from the underlying alignment, these data are independent of the conservation of the nucleotide sequence itself. Any inferred conservation of the sequence, e.g. as shown by phastCons, is therefore independent evidence for selection. It may happen that sequence is conserved with respect to indel mutations without concomitant evidence of conservation of the nucleotide sequence. The opposite may also happen.

Display Conventions

The score (based on the false discovery rate, FDR) is reflected in the bluescale density gradient coloring the track items. Lighter colours reflect a higher FDR.

Methods

In the absence of selection, indels have a certain predicted distribution over the genome. The actual distribution shows an over-abundance of ungapped regions, due to selection purifying functional sequence from the deleterious effects of indels. Neutrally evolving sequence, such as (by and large) ancestral repeats, show an exceedingly good fit to the neutral predictions. This accurate fit allows the identification of a good proportion of conserved sequence at a relatively low false discovery rate (FDR). For example, at an FDR of 10%, the predicted sensitivity is 75%. Each identified indel-purified sequence (IPS) is annotated by two numbers: a false discovery rate (FDR), and a posterior probability (p). The FDR refers to a set of segments, not a given segment by itself. In this case, it refers to the minimum FDR of all sets that include the segment of interest. For example, a segment annotated with a 10% FDR also belongs to a set with a 15% FDR, but not a set with a 5% FDR. The posterior probability does refer to the single segment by itself. It has a frequentist interpretation, namely, as the proportion of regions, annotated with the same posterior probability, that have been under purifying selection, rather than showing the observed lack of indels by random chance. The data include segments for a false-discovery rate of up to 50%. The score directly reflects the FDR, through the following formula:

score = 90 / (FDR + 0.08)
This maps FDR 1% (the most restrictive category) to 999, and FDR 10% to 500. For further details of the Methods, see Lunter et al., 2006.

Verification

The neutral indel model was calibrated using ancestral repeats, against which it showed an excellent fit. Among the identified IPSs at 10% FDR and predicted sensitivity of 75%, we found 75% of annotated protein-coding exons (weighted by length), and 75% of the 222 microRNAs that were annotated at the time. Ancestral repeats were heavily depleted among the identified segments.

Credits

These data were generated by Gerton Lunter and Chris Ponting, MRC Functional Genetics Unit, University of Oxford, United Kingdom and Jotun Hein, Department of Statistics, University of Oxford, United Kingdom.

References

Lunter G, Ponting, CP, Hein J. Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comp Biol. 2006 Jan;2(1):e5. The data may also be browsed here.