Description
The goal of the
NHLBI GO Exome Sequencing Project (ESP)
is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by
pioneering the application of next-generation sequencing of the protein coding regions of the
human genome across diverse, richly-phenotyped populations and to share these datasets and
findings with the scientific community to extend and enrich the diagnosis, management and
treatment of heart, lung and blood disorders. The current data release (ESP6500SI-V2-SSA137)
through the
EVS website
is taken from 6,503 samples drawn from multiple
ESP cohorts
and represents all of the ESP exome variant data.
Data in this track were obtained from the
EVS Release Version: v.0.0.25. (Feb. 7, 2014).
Display Conventions
In "dense" mode, a vertical line is drawn at the position of each
variant.
In "pack" and "full" modes, in addition to the vertical line, a label to
the left shows the reference allele first and variant alleles below
(A = red, C = blue,
G = green, T = magenta,
Indels = black).
Hovering the pointer over any variant will prompt the display of the occurrences numbers for each
allele in the Exome Sequencing Project's database. Clicking on any variant will result in
full details of that variant being displayed as well as possible links to the ESP and dbSNP
databases.
Methods
Sequences were aligned to NCBI build 37 human genome reference using BWA. PCR duplicates
were removed using Picard. Alignments were recalibrated using GATK. Lane-level indel realignments
and base alignment quality (BAQ) adjustments were applied.
All data were simultaneously analyzed for exome SNP variants at the University of Michigan
(by the Abecasis Laboratory). SNPs were called using a two-step approach. First, genotype
likelihood files (GLFs) were generated using samtools pileup on individual BAM files. Next,
we used glfMultiples, a multi-sample variant caller, to generate initial SNP calls. Details of
the likelihood model implemented in glfMultiples are given in Li, et al., 2011
(in the section entitled "Identifying Potential Polymorphic Sites"). The Michigan SNP calling pipeline
is available at:
http://genome.sph.umich.edu/wiki/UMAKE.
This pipeline makes diploid calls for pseudo-autosomal regions of male samples and haploid
calls for the rest of the chromosome. Female samples have diploid calls for all regions on
the X chromosome. SNPs were filtered by a machine-learning technique called support
vector machine (SVM) classification (for a detailed description, see
Filter Status).
Small INDEL variants were analyzed at the Broad Institute (by the Genome Sequencing and
Analysis group) using the
GATK
variation discovery pipeline following the guidelines in the
GATK best practices v4.
More specifically, each BAM was reduced to create a Reduced BAM, and then INDELs were
discovered by analyzing all samples simultaneously with the GATK
UnifiedGenotyper,
and subsequently filtered by the GATK Variant Quality Score Recalibration (VQSR) filtering
model, again following the V4 best practices. The INDEL genotypes for X and Y chromosomes
were adjusted to be consistent with the samples' genders. Female samples have diploid calls
for all regions on the X chromosome. Male samples have diploid calls for pseudo-autosomal
regions on the X chromosome and haploid calls for the rest of the X chromosome and on the
Y chromosome as well. However, the INDEL calls for the ESP data are preliminary and not as
robust as the SNP calls at this point. Users are advised to keep this difference in mind
when applying the ESP data to research studies.
All SNPs and INDELs were further annotated by
SeattleSeqAnnotation137,
and the variant annotations at the coding-DNA and protein levels mostly follow
HGVS
notations.
The SNP calls are included in the release of dbSNP build 138. The full dataset is described in
Fu, et al., 2013, and a subset of the data (i.e., 2,500 exomes) was published by the ESP Population
Genetics and Statistical Analysis Working Group in Tennessen, et al., 2012.
Credits
The authors would like to thank the
NHLBI GO Exome Sequencing Project
and its ongoing studies which produced and provided exome variant calls for comparison: the
Lung GO Sequencing Project (HL-102923),
the
WHI Sequencing Project (HL-102924),
the
Broad GO Sequencing Project (HL-102925),
the
Seattle GO Sequencing Project (HL-102926),
and the
Heart GO Sequencing Project (HL-103010).
Contact: evsserver@uw.edu
References
Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, Altshuler D, Shendure
J et al.
Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants.
Nature. 2013 Jan 10;493(7431):216-20.
PMID: 23201682; PMC: PMC3676746
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR.
Low-coverage sequencing: implications for design of complex trait association studies.
Genome Res. 2011 Jun;21(6):940-51.
PMID: 21460063; PMC: PMC3106327
Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G et
al.
Evolution and functional impact of rare coding variation from deep sequencing of human exomes.
Science. 2012 Jul 6;337(6090):64-9.
PMID: 22604720; PMC: PMC3708544