Description
This track shows which genome regions are more or less accessible to next
generation sequencing methods that use short, paired-end reads. It summarizes whole
genome sequencing data from Phase 1 of the
1000 Genomes Project and shows two
levels of stringency: "pilot" stringency regions (see below) cover 94% of non-N bases
in the genome and "strict" regions cover 72% of non-N bases. Each site which meets
"strict" criteria also passes the "pilot" criteria.
This track does not show a mask of regions in which variant calls can
or cannot be made.
Some 1000 Genomes Phase 1 variant calls are in regions that do not meet the
"strict" criteria.
Phase 1 variant calls are filtered using the
Variant Quality Score Recalibrator (VQSR)
method (implemented in the
Genome Analysis Toolkit (GATK))
without regard to the thresholds applied here. VQSR assesses the evidence for
variation at sites where there is evidence, but says nothing about the remaining
sites.
These regions will be useful for (a) comparing accessibility using current technologies
to accessibility in the 1000 Genomes Pilot Project, and (b) population genetic
analyses (such as estimates of mutation rate) that must focus on genomic regions
with very low false positive and false negative rates.
Methods
The total depth of mapped sequence reads, the average mapping quality score
and the fraction of reads with mapping quality zero (meaning that this read maps
equally well to more than one location in the genome) are tabulated from 1,103
.bam files in the 1000 Genomes Phase 1 low coverage
data release.
This combines low coverage whole genome sequence information from 1,092
individuals, giving a genome wide average total depth of coverage of 5,132 reads.
Both "pilot" and "strict" tracks are .bed file conversions of the
"pass" regions from
.fasta mask files.
See the
Genome Masks README file in that directory for details.
The "pilot" criteria require a depth of coverage between 2,566 and 10,264 inclusive
(between one-half and twice the average depth) and that no more than 20% of
covering reads have mapping quality zero. These are equivalent to the criteria
used for analyses in the 1000 Genomes Pilot paper (2010). The "strict" criteria
require a depth of coverage between 2,566 and 7,698 inclusive, no more than 0.1%
of reads with mapping quality zero, and an average mapping quality of 56 or greater.
This definition is quite stringent and focuses on the most unique regions of the
genome. In regions which passed the strict criteria, only ~2% of sites called in an
initial analysis were rejected as likely false positives by VQSR. Since approximately
one-half of 1000 Genomes Project individuals are males, the depth of coverage
is generally lower on the X chromosome. Coverage thresholds on the X chromosome were
adjusted by a factor of 3/4 and on the Y chromosome by a factor of 1/2. The "pilot"
criteria were not evaluated for the Y chromosome.
1000 Genomes Phase 1 sequencing was done between 2008 and 2010 using Illumina
(86.4%), AB SOLiD (13%) and Roche LS 454 (0.6%) sequencing technologies.
Of the Illumina coverage, 45% is in approximately 100 bp paired-end reads, 31.5% is in 76 bp
reads, 15% is in 51 bp reads and 8.5% is in 36 bp reads. All AB SOLiD data are 50 bp
mate paired reads. Paired-end sequence reads were mapped against the GRCh37 human
genome reference sequence using
BWA version 0.5.5,
BFAST version 0.6.4e and
SSAHA version 2.5
respectively. Full details are in a
README file
and in supplementary materials to the
Phase 1 paper (2012).
The mapping target consists of the 22 autosomes plus X and
Y chromosomes (both pseudo-autosomal regions on the Y are masked by Ns), the
revised CRS mitochondrial sequence (NC_012920), and 59 unplaced contigs. It does
not include the human herpesvirus 4 sequence (used for cell line transformation)
nor approximately 5 Mb of additional "decoy" sequence compiled from other human
entries in GenBank. Those sequences were added to the mapping target in July 2011 and
will be included in the mapping during 1000 Genomes Phase 2. Full details are in the
Human Decoy Sequences README file.
Credits
Mary Kate Trost and Gonçalo Abecasis at the
University of Michigan Center for Statistical Genetics
generated the original masked fasta files.
Tom Blackwell at UM converted those to .bed format and edited the
description and methods.
Thanks to the
1000 Genomes Project
for making Phase 1 data available in advance of publication.
References
1000 Genomes Pilot Project:
1000 Genomes Project Consortium.
A map of human genome variation from population-scale sequencing.
Nature. 2010 Oct 28;467(7319):1061-73.
Phase 1 of the 1000 Genomes Project:
1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker
RE, Kang HM, Marth GT, McVean GA.
An integrated map of genetic variation from 1,092 human genomes.
Nature. 2012 Nov 1;491(7422):56-65.
1000 Genomes Frequently Asked Questions (FAQ)
|