Genotype Imputation

Abstract

A missing data problem arises in genetic epidemiological studies when genotypes of particular markers are unavailable for analysis for reasons of data quality, cost efficiency or technical design. In such instances, imputation methods can be used to extend the process of scientific inference making from the typed to the un‐typed markers. The information required to infer unobserved genotypes from observed genotypes is provided by the so‐called imputation base, an internal or external set of comprehensively typed individuals (often taken from HapMap or the 1000 Genomes Project) that is representative of the study population as a whole. The most popular genotype imputation methods, including IMPUTE, fastPHASE, MaCH and BEAGLE, employ a Markov chain model of the haplotype distribution in the population of interest. Although these frameworks have been shown to provide accurate and efficient tools of ‘in silico genotyping’ under certain conditions, their uncritical use nevertheless must be cautioned against.

Key Concepts

  • Relevant genotype data may be missing in genetic epidemiological studies for technical or efficiency reasons.
  • Scientific inference that takes missing genotype data properly into account can be made using data imputation methods.
  • Genotype imputation requires an imputation base, that is, a population‐representative set of individuals (such as the HapMap or the 1000 Genomes Project) who are genotyped for all markers of interest.
  • Established genotype imputation methods employ a Markov chain model of the haplotype distribution in the population under study.
  • Genotype imputation may achieve 90% accuracy for highly polymorphic markers but performs less well for rare variants.
  • While genotype imputation may provide valid statistical tests of genotype–phenotype association, their use for effect size estimation and significance assessment must proceed with caution.
  • Genotype imputation needs to follow the same rules of good scientific practice as laboratory‐based data generation.

Keywords: missing data; allelic association; Markov chain; haplotype distribution; maximum likelihood; linkage disequilibrium; microarray; population history; recombination; HapMap; 1000 Genomes Project

Figure 1. Genotype imputation with IMPUTE. The method is illustrated for 10 linked SNPs with alleles encoded by 0 and 1 (indicating, e.g. the presence or absence of a reference allele), using an imputation base that is phased and that comprises four different haplotypes. (a) Every population haplotype is assumed to be a mosaic of haplotypes from the imputation base. The respective haplotype distribution is defined by a Markov chain model with transition probabilities depending on the population history and the local recombination map. High transition probabilities are indicated by bold arrows; thin arrows indicate lower transition probabilities. Possible mutations are highlighted in red. (b) The probability of an un‐phased genotype with missing data is evaluated by considering all possible pairs of mosaic haplotypes that would be compatible with the observed data. The most probable pair determines the most probable genotypes at un‐typed SNPs (i.e. 0–1, 1–1 and 0–0 rather than 1–1, 0–1 and 0–1 in the present example).
Figure 2. Genotype imputation with fastPHASE. Haplotypes in the population (and therefore the imputation base, too) are assumed to cluster around a few frequent haplotypes. These clusters (labelled A to D in the example) define the different states of the Markov chain generating the population haplotype distribution. Alleles at different marker positions are colour‐coded for illustration purposes alone, and different colours may correspond to identical alleles at a given position (e.g. in the case of SNPs).
Figure 3. Genotype imputation with BEAGLE. The BEAGLE method defines clusters at the allele rather than the haplotype level so that, for SNPs, the cluster number equals two at each position (labelled green or yellow). The transition probabilities of the haplotype distribution‐generating Markov chain depend on the local level of inter‐marker allelic association in the imputation base and study sample combined. Intermediate transition probabilities are indicated by broken arrows.
close

References

Anderson CA, Pettersson FH, Clarke GM, et al. (2010) Data quality control in genetic case–control association studies. Nature Protocols 5: 1564–1573.

Browning SR (2006) Multilocus association mapping using variable‐length Markov chains. American Journal of Human Genetics 78: 273–280.

Clark AJ and Li J (2007) Conjuring SNPs to detect associations. Nature Genetics 39: 815–816.

Graham JW (2009) Missing data analysis: making it work in the real world. Annual Review of Psychology 60: 549–576.

Howie BN, Donnelly P and Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome‐wide association studies. PLoS Genetics 5: e1000529.

Li Y, Willer CJ, Ding J, Sheet P and Abecasis GR (2010) MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology 34: 816–834.

Liu Q, Cirulli ET, Han Y, et al. (2014) Systematic assessment of imputation performance using the 1000 Genomes reference panels. Briefings in Bioinformatics DOI: 10.1093/bib/bbu035. (in press)

Marchini J, Howie B, Myers S, McVean G and Donnelly P (2007) A new multipoint method for genome‐wide association studies by imputation of genotypes. Nature Genetics 39: 906–913.

Marchini J and Howie B (2010) Genotype imputation for genome‐wide association studies. Nature Reviews. Genetics 11: 499–511.

Nothnagel M, Ellinghaus D, Schreiber S, Krawczak M and Franke A (2009) A comprehensive evaluation of SNP genotype imputation. Human Genetics 125: 163–171.

Rubin DB (1976) Inference and missing data. Biometrika 63: 581–592.

Scheet P and Stephens M (2006) A fast and flexible statistical model for large‐scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics 78: 629–644.

Servin B and Stephens M (2007) Imputation‐based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics 3: e114.

The International HapMap Consortium (2003) The International HapMap Project. Nature 426: 789–796.

The 1000 Genomes Project Consortium (2010) A map of human genome variation from population‐scale sequencing. Nature 467: 1061–1073.

The Oxford Dictionaries (2015) Impute. http://www.oxforddictionaries.com/definition/english/impute

Further Reading

Baraldi AN and Enders CK (2010) An introduction to modern missing data analyses. Journal of School Psychology 48: 5–37.

Browning SR (2008) Missing data imputation and haplotype phase inference for genome‐wide association studies. Human Genetics 124: 439–450.

Burkett K and Greenwood C (2013) A sequence of methodological changes due to sequencing. Current Opinion in Allergy and Clinical Immunology 13: 470–477.

Browing BL and Browing SR (2009) A unified approach to genotype imputation and haplotype‐phase inference for large data sets of trios and unrelated individuals. American Journal of Human Genetics 84: 210–223.

Delaneau O, Marchini J, 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications 5: 3934.

Donders AR, van der Heijden GJ, Stijnen T and Moons KG (2006) Review: a gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 59: 1087–1091.

Howie B, Fuchsberger C, Stephens M, Marchini J and Abecasis GR (2012) Fast and accurate genotype imputation in genome‐wide association studies through pre‐phasing. Nature Genetics 44: 955–959.

Lee S, Abecasis GR, Boehnke M and Lin X (2014) Rare‐variant association analysis: study designs and statistical tests. American Journal of Human Genetics 95: 5–23.

Li Y, Willer C, Sanna S and Abecasis G (2009) Genotype imputation. Annual Review of Genomics and Human Genetics 10: 387–406.

Neal BM (2010) Introduction to linkage disequilibrium, the HapMap, and imputation. Cold Spring Harb Protoc 2010: pbd top74.

Porcu E, Sanna S, Fuchsberger C and Fritsche LG (2013) Genotype imputation in genome‐wide association studies. Current Protocols in Human Genetics 78: 1.25.1–1.25.12.

Zheng H‐F, Rong J‐J, Liu M, et al. (2015) Performance of genotype imputation for low frequency and rare variants from the 1000 Genomes. PLoS One 10: e0116487.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Krawczak, Michael(May 2015) Genotype Imputation. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0022399]