Population Stratification, Adjustment for


Population stratification (PS) is a major concern in genetic association studies. Failure to control it effectively can lead to excess false‐positive results and failure to detect true associations. Many methods have been designed to adjust for PS, which mainly fall under the following categories: (1) genomic control for the inflation of test statistics, (2) structured association, (3) principal component analysis and multidimensional scaling, (4) mixed‐model approaches, (5) adjustment in admixed populations and (6) other approaches. Nowadays, with the availability of a variety of datasets, such as meta‐analysis test statistics from consortia and extensive cohort data, and genomics data from admixed populations, these big and diverse genomic datasets raise challenges to traditional PS methods and provide opportunities for new approaches. No method is likely to be superior in all situations. Care needs to be taken to ensure that the assumptions of the method are met and that the method is used for its intended purpose.

Key Concepts

  • Population stratification is a major source of potential confounding in genetic association studies.
  • PS can be corrected using many different statistical methods.
  • The inflation of test statistics may be due to polygenicity rather than PS.
  • Linear‐mixed models are able to correct for both PS and cryptic relatedness.
  • PS‐correction methods may be sensitive to genetic marker selection.
  • Both global ancestry and local ancestry can be ascertained and accounted for in admixed populations.
  • Care needs to be taken to make sure that the assumptions of the different statistical methods used are met.

Keywords: population stratification; genomic control; principal component analysis; multidimensional scaling; mixed models; LD score regression; local ancestry; admixture mapping

Figure 1. Multidimensional scaling (MDS) versus principal component approach (PCA). These figures show the clustering results using MDS and PCA with 5000 genome‐wide random autosomal SNPs (single nucleotide polymorphisms) from the HapMap project Phase I data. Panels (a–c) are generated using the PCA approach as implemented in Eigenanalysis. Panels (d–f) are generated using the MDS approach with allele sharing distance. Pairwise plots of the first three dimensions are presented. There is no apparent difference in their ability to visualise the ancestral differences in these populations. Multiple runs gave similar results. The signs of the dimension 1 and 2 from the MDS plots have been reversed (this does not change the relative location of each cluster) to match the geographical locations of PCA clusters. CEU, CEPH in Utah residents with ancestry from northern and western Europe; CHB, Han Chinese from Beijing, China; JPT, Japanese from Tokyo, Japan; YRI, Yoruba in Ibadan; MDS, multidimensional scaling and PCA, principal component analysis.
Figure 2. Local ancestry plots for two Latinos.


Alexander DH, Novembre J and Lange K (2009) Fast model‐based estimation of ancestry in unrelated individuals. Genome Research 19: 1655–1664.

Aulchenko YS, Ripke S, Isaacs A, et al. (2007) GenABEL: an R library for genome‐wide association analysis. Bioinformatics 23 (10): 1294–1296.

Bryc K, Auton A, Nelson MR, et al. (2010) Genome‐wide patterns of population structure and admixture in West Africans and African Americans. Proceedings of the National Academy of Sciences of the United States of America 107: 786–791.

Bulik‐Sullivan BK, Loh PR, Finucane HK, et al. (2015) LD Score regression distinguishes confounding from polygenicity in genome‐wide association studies. Nature Genetics 47 (3): 291–295.

Campbell CD, Ogburn EL, Lunetta KL, et al. (2005) Demonstrating stratification in a European American population. Nature Genetics 37: 868–872.

Chen H, Wang C, Conomos MP, et al. (2016) Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. American Journal of Human Genetics 98 (4): 653–666.

Cheng YJ, Mailund T and Nielsen R (2017) Fast admixture analysis and population tree estimation for SNP and NGS data. Bioinformatics 33 (14): 2148–2155. DOI: 10.1093/bioinformatics/btx098.

Conomos MP, Miller MB and Thornton TA (2015) Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genetic Epidemiology 39 (4): 276–293.

Dadd T, Weale ME and Lewis CM (2009) A critical evaluation of genomic control methods for genetic association studies. Genetic Epidemiology 33: 290–298.

Devlin B and Roeder K (1999) Genomic control for association studies. Biometrics 55: 997–1004.

Devlin B, Roeder K and Wasserman L (2001) Genomic control, a new approach to genetic‐based association studies. Theoretical Population Biology 60: 155–166.

Devlin B, Bacanu SA and Roeder K (2004) Genomic control to the extreme. Nature Genetics 36: 1129–1130 (author reply 1131).

Edwards TL, Scott WK, Almonte C, et al. (2010) Genome‐wide association study confirms SNPs in SNCA and the MAPT region as common risk factors for Parkinson disease. Annals of Human Genetics 74: 97–109.

Epstein MP, Allen AS and Satten GA (2007) A simple and improved correction for population stratification in case‐control studies. American Journal of Human Genetics 80: 921–930.

Gao X and Starmer J (2007) Human population structure detection via multilocus genotype clustering. BMC Genetics 8: 34.

Gao X and Starmer JD (2008) AWclust: point‐and‐click software for non‐parametric population structure analysis. BMC Bioinformatics 9: 77.

Gao X and Martin ER (2009) Using allele sharing distance for detecting human population stratification. Human Heredity 68: 182–191.

Gao X, Haritunians T, Marjoram P, et al. (2012) Genotype imputation for Latinos using the HapMap and 1000 Genomes Project reference panels. Frontiers in Genetics 3: 117.

Gao X, Nannini DR, Corrao K, et al. (2016) Genome‐wide association study identifies WNT7B as a novel locus for central corneal thickness in Latinos. Human Molecular Genetics 25 (22): 5035–5045.

Guan Y (2014) Detecting structure of haplotypes and local ancestry. Genetics 196 (3): 625–642.

Heckerman D, Gurdasani D, Kadie C, et al. (2016) Linear mixed model for heritability estimation that explicitly addresses environmental variation. Proceedings of the National Academy of Sciences of the United States of America 113 (27): 7377–7382.

Kang HM, Zaitlen NA, Wade CM, et al. (2008) Efficient control of population structure in model organism association mapping. Genetics 178 (3): 1709–1723.

Kang HM, Sul JH, Service SK, et al. (2010) Variance component model to account for sample structure in genome‐wide association studies. Nature Genetics 42 (4): 348–354.

Lippert C, Listgarten J, Liu Y, et al. (2011) FaST linear mixed models for genome‐wide association studies. Nature Methods 8 (10): 833–835.

Lloyd‐Jones LR, Robinson MR, Moser G, et al. (2017) Inference on the genetic basis of eye and skin color in an admixed population via Bayesian Linear Mixed Models. Genetics 206 (2): 1113–1126.

Loh PR, Tucker G, Bulik‐Sullivan BK, et al. (2015) Efficient Bayesian mixed‐model analysis increases association power in large cohorts. Nature Genetics 47 (3): 284–290.

Maples BK, Gravel S, Kenny EE, et al. (2013) RFMix: a discriminative modeling approach for rapid and robust local‐ancestry inference. American Journal of Human Genetics 93 (2): 278–288.

Marchini J, Cardon LR, Phillips MS, et al. (2004) The effects of human population structure on large genetic association studies. Nature Genetics 36: 512–517.

Miclaus K, Wolfinger R and Czika W (2009) SNP selection and multidimensional scaling to quantify population structure. Genetic Epidemiology 33: 488–496.

Patterson N, Price AL and Reich D (2006) Population structure and eigenanalysis. PLoS Genetics 2: e190.

Price AL, Patterson NJ, Plenge RM, et al. (2006) Principal components analysis corrects for stratification in genome‐wide association studies. Nature Genetics 38: 904–909.

Price AL, Patterson N, Yu F, et al. (2007) A genomewide admixture map for Latino populations. American Journal of Human Genetics 80 (6): 1024–1036.

Price AL, Tandon A, Patterson N, et al. (2009) Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genetics 5 (6): e1000519.

Pritchard JK, Stephens M and Donnelly P (2000a) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Pritchard JK, Stephens M, Rosenberg NA, et al. (2000b) Association mapping in structured populations. American Journal of Human Genetics 67: 170–181.

Purcell S and Sham P (2004) Properties of structured association approaches to detecting population stratification. Human Heredity 58: 93–107.

Purcell S, Neale B, Todd‐Brown K, et al. (2007) PLINK: a tool set for whole‐genome association and population‐based linkage analyses. American Journal of Human Genetics 81: 559–575.

Schick UM, Jain D, Hodonsky CJ, et al. (2016) Genome‐wide association study of platelet count identifies ancestry‐specific loci in Hispanic/Latino Americans. American Journal of Human Genetics 98 (2): 229–242.

Setakis E, Stirnadel H and Balding DJ (2006) Logistic regression protects against population structure in genetic association studies. Genome Research 16: 290–296.

Spielman RS, McGinnis RE and Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin‐dependent diabetes mellitus (IDDM). American Journal of Human Genetics 52: 506–516.

Svishcheva GR, Axenovich TI, Belonogova NM, et al. (2012) Rapid variance components‐based method for whole‐genome association analysis. Nature Genetics 44 (10): 1166–1170.

Thornton T and McPeek MS (2010) ROADTRIPS: case‐control association testing with partially or completely unknown population and pedigree structure. American Journal of Human Genetics 86: 172–184.

Tishkoff SA, Reed FA, Ranciaro A, et al. (2007) Convergent adaptation of human lactase persistence in Africa and Europe. Nature Genetics 39: 31–40.

Wang X, Zhu X, Qin H, et al. (2011) Adjustment for local ancestry in genetic association analysis of admixed populations. Bioinformatics 27 (5): 670–677.

Wang C, Zhan X, Bragg‐Gresham J, et al. (2014) Ancestry estimation and control of population stratification for sequence‐based association studies. Nature Genetics 46 (4): 409–415.

Yang J, Zaitlen NA, Goddard ME, et al. (2014) Advantages and pitfalls in the application of mixed‐model association methods. Nature Genetics 46 (2): 100–106.

Zhang S, Zhu X and Zhao H (2003) On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genetic Epidemiology 24: 44–56.

Zhang J and Stram DO (2014) The role of local ancestry adjustment in association studies using admixed populations. Genetic Epidemiology 38 (6): 502–515.

Zhou X and Stephens M (2012) Genome‐wide efficient mixed‐model analysis for association studies. Nature Genetics 44 (7): 821–824.

Further Reading

Hartl DL and Clark AG (2007) Principles of Population Genetics, 4th edn. Sunderland: Sinauer Associates Inc..

Seldin MF, Pasaniuc B and Price AL (2011) New approaches to disease mapping in admixed populations. Nature Reviews. Genetics 12: 523–528.

Shriner D (2013) Overview of admixture mapping. Current Protocols in Human Genetics. Chapter 1: Unit 1 23. doi:10.1002/0471142905.hg0123s76.

Weir BS (1996) Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Sunderland: Sinauer Associates Inc..

Weir BS and Hill WG (2002) Estimating F‐statistics. Annual Review of Genetics 36: 721–750.

Winkler CA, Nelson GW and Smith MW (2010) Admixture mapping comes of age. Annual Review of Genomics and Human Genetics 11: 65–89.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Gao, Xiaoyi, and Edwards, Todd L(Dec 2017) Population Stratification, Adjustment for. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0020384.pub2]