Description
This track shows which genome regions are more or less accessible to next
generation sequencing methods that use short, paired-end reads. It summarizes whole
genome sequencing data from Phase 3 of the
1000 Genomes Project and shows two
levels of stringency: "pilot" stringency regions (see below) cover 94.5%
of non-N bases in the genome (excluding alternate haplotype sequences and unplaced contigs;
95.9% on autosomes)
and "strict" regions cover 75.5% (76.9% on autosomes). Each site which meets
"strict" criteria also passes the "pilot" criteria.
This track does not show a mask of regions in which variant calls can
or cannot be made.
Some 1000 Genomes Phase 3 variant calls are in regions that do not meet the
"strict" criteria.
Phase 3 variant calls are filtered using various tools such as the
Variant Quality Score Recalibrator (VQSR)
method (implemented in the
Genome Analysis Toolkit (GATK))
without regard to the thresholds applied here.
VQSR and similar tools assess the evidence for variation at sites where a variant is called,
but say nothing about the remaining sites.
The 1000 Genomes Project Phase 3 variant calls combine information from low coverage sequencing,
exome sequencing and array genotyping for improved sensitivity and specificity.
The coverage masks are based on low coverage sequencing only.
These regions will be useful for (a) comparing accessibility using current technologies
to accessibility in the 1000 Genomes Pilot Project, and (b) population genetic
analyses (such as estimates of mutation rate) that must focus on genomic regions
with very low false positive and false negative rates.
Methods
The total depth of mapped sequence reads, the average mapping quality score
and the fraction of reads with mapping quality zero (meaning that this read maps
equally well to more than one location in the genome) are tabulated from
1000 Genomes Project Phase 3 .bam files.
This combines whole genome sequence data from 2,504 individuals,
giving a genome wide average depth of coverage of 17,920 reads.
Both "pilot" and "strict" tracks are .bed file conversions of the
"pass" regions from
.fasta mask files.
See the
README file in that directory and
Supplementary Information (section 9.2)
of 1000 Genomes Project Consortium, et al. (2015) for more details.
The "pilot" criteria require a depth of coverage between 8,960 and 35,840 inclusive
(between one-half and twice the average depth) and that no more than 20% of
covering reads have mapping quality zero. These are equivalent to the criteria
used for analyses in the 1000 Genomes Pilot paper (2010). The "strict" criteria
require a depth of coverage between 8,960 and 26,880 inclusive, no more than 0.1%
of reads with mapping quality zero, and an average mapping quality of 56 or greater.
This definition is quite stringent and focuses on the most unique regions of the
genome.
Since approximately
one-half of 1000 Genomes Project individuals are males, the depth of coverage
is generally lower on the X chromosome. Coverage thresholds on the X chromosome were
adjusted by a factor of 3/4 and on the Y chromosome by a factor of 1/2.
Data Access
The raw data can be explored interactively with the
Table Browser, or the
Data Integrator.
For automated analysis, the genome annotation is stored in a bigBed file that can be downloaded from the
download server.
The underlying data files for this track are called
20141020.pilot_mask.whole_genome.bb and 20141020.strict_mask.whole_genome.bb.
Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed
which can be compiled from the source code or downloaded as a precompiled binary
for your system. Instructions for downloading source code and binaries can be found
here.
The tool can also be used to obtain only features within a given range, for example:
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg19/1000Genomes/phase3/20141020.strict_mask.whole_genome.bb -chrom=chr6 -start=0 -end=1000000 stdout
Please refer to our mailing list archives
for questions, or our Data Access FAQ for more information.
Credits
Thank you to
Mary Kate Wing at the
University of Michigan Center for Statistical Genetics
for providing the track data files.
Thank you to Tom Blackwell and Mary Kate Wing at UM for editing the description and methods.
References
1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA,
Hurles ME, McVean GA.
A map of human genome variation from population-scale sequencing.
Nature. 2010 Oct 28;467(7319):1061-73.
PMID: 20981092; PMC: PMC3042601
1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO,
Marchini JL, McCarthy S, McVean GA et al.
A global reference for human genetic variation.
Nature. 2015 Oct 1;526(7571):68-74.
PMID: 26432245
|