Role of Bioinformatics in Genome‐wide Association Studies


A central goal of human genetics is to identify genetic variants that are associated with disease. Better understanding the role of genetic and environmental factors in disease risk will likely improve diagnosis, prevention and treatment. It is now technically and economically feasible to conduct genome‐wide association studies (GWAS) with over a million single‐nucleotide polymorphisms (SNPs) distributed across the genome. Although GWAS have led to many discoveries, the genetic underpinnings of most common diseases remain largely unexplained. One likely explanation for this “missing hereditability” is that traditional GWAS approaches have focused on one SNP at a time and have failed to account for the complexity of many genotype–phenotype relationships that are characterised by substantial heterogeneity, and gene–gene and gene–environment interactions. Such underlying genetic complexity creates bioinformatics challenges related to modelling, attribute selection and biological interpretation that must be addressed in order to realise the full potential of GWAS. The benefits of meeting these bioinformatics challenges will also extend to whole‐genome and whole‐exome sequence analysis.

Key Concepts:

  • Genetic complexity is likely to underlie many common diseases.

  • Bioinformatics tools will be necessary to uncover nonlinear genetic predictors of common diseases.

  • Data mining and machine learning methods can increase power to discover nonlinear genetic predictors of common disease.

  • Filter and wrapper algorithms are necessary to limit the number of attributes examined so modelling strategies are powerful and computationally practical.

  • Prior biological knowledge can improve the analysis and interpretation of GWAS data.

  • Powerful and intuitive software packages are necessary to enable collaboration between biologists, biostatisticians and bioinformaticists.

Keywords: GWAS; genome; genetics; bioinformatics; statistics; epistasis

Figure 1.

Overview of the random forest (RF) algorithm. Feature selection using a RF classifier for the integrated analysis of multiple data types. Adapted from Reif et al.. Reproduced from Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. Washington D.C., pp. 171–178. Copyright © 2006, IEEE.

Figure 2.

Summary of the constructive induction process for multifactor dimensionality reduction (MDR). The left bar and right bars represent the number of cases and controls, respectively. Dark‐shaded cells are high risk whereas light‐shaded cells are low risk. Prediction using any classifier can be carried out using the final constructed attribute. Image reproduced from Moore et al. (2010) Bioinformatics challenges for genome‐wide association studies. Bioinformatics26(4): 445–455. Copyright Oxford University Press.

Figure 3.

Summary of the neighbour selection process of Relief, ReliefF and spatially uniform ReliefF (SURF). Each panel shows cases and controls distributed by their genotypes for two continuous markers. When analysing real data, the process is similar, however, there are thousands of discrete valued markers (SNPs) that are each represented by one of thousands of dimensions. A randomly selected instance (R) is shown by the filled red circle. The neighbours that are used for weighting are highlighted in blue. The three shown algorithms differ in the selection of neighbours. Relief (a) selects the nearest individual of the same case/control status (blue circle) and the nearest neighbour of the opposite case/control status (blue cross). ReliefF (b) selects some user‐specified number of individuals (two in this example) to use for weighting. SURF (c) uses all individuals within a distance threshold (represented by the dotted line). Image reproduced from Moore et al. (2010) Bioinformatics challenges for genome‐wide association studies. Bioinformatics26(4): 445–455. Copyright Oxford University Press.

Figure 4.

Suggested flowchart for bioinformatics analysis of GWAS data. In addition to parametric statistical methods, filter and wrapper algorithms are used in conjunction with computational modelling approaches. Biological knowledge public databases play a very important role at all levels of analysis and interpretation. Image reproduced from Moore et al. (2010) Bioinformatics challenges for genome‐wide association studies. Bioinformatics26(4): 445–455. Copyright Oxford University Press.



Ahmed S, Thomas G, Ghoussaini M et al. (2009) Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nature Genetics 41(5): 585–590.

Amundadottir L, Kraft P, Stolzenberg‐Solomon RZ et al. (2009) Genome‐wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nature Genetics 41(9): 986–990.

Andrew AS, Karagas MR, Nelson HH et al. (2008) DNA repair polymorphisms modify bladder cancer risk: a multifactor analytic strategy. Human Heredity 65(2): 105–118.

Askland K, Read C and Moore JH (2009) Pathways‐based analyses of whole‐genome association study data in bipolar disorder reveal genes mediating ion channel activity and synaptic neurotransmission. Human Genetics 125(1): 63–79.

Banzhaf W, Nordin P, Keller RE et al. (1998) Genetic Programming – An Introduction; On the Automatic Evolution of Computer Programs and its Applications, 1st edn. San Francisco: Morgan Kaufmann.

Breiman L (2001) Random forests. Machine Learning 45(1): 5–32.

Bureau A, Dupuis J, Falls K et al. (2005) Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology 28(2): 171–182.

Bush WS, Dudek SM and Ritchie MD (2009) Biofilter: a knowledge‐integration system for the multi‐locus analysis of genome‐wide association studies. Pacific Symposium on Biocomputing 14: 368–379.

Bush WS, Edwards TL, Dudek SM et al. (2008) Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics 9: 238.

Calle ML, Urrea V, Vellalta G et al. (2008) Improving strategies for detecting genetic patterns of susceptibility in association studies. Statistics in Medicine 27(30): 6532–6546.

Cattaert T, Urrea V, Naj AC et al. (2010) FAM‐MDR: a flexible family based multifactor dimensionality reduction technique to detect epistasis using related individuals. PLoS ONE 5(4): e10304.

Chung Y, Lee SY, Elston RC et al. (2007) Odds ratio based multifactor‐dimensionality reduction method for detecting gene–gene interactions. Bioinformatics 23(1): 71–76.

Combarros O, van Duijn CM, Hammond N et al. (2009) Replication by the Epistasis Project of the interaction between the genes for IL‐6 and IL‐10 in the risk of Alzheimer's disease. Journal of Neuroinflammation 6: 22.

Cook NR, Zee RY and Ridker PM (2004) Tree and spline based association analysis of gene–gene interaction models for ischemic stroke. Statistics in Medicine 23(9): 1439–1453.

Cordell HJ (2009) Detecting gene–gene interactions that underlie human diseases. Nature Reviews Genetics 10(6): 392–404.

Cowper‐Sal lari R, Cole MD, Karagas MR et al. (2010) Layers of epistasis: genome‐wide regulatory networks and network approaches to genome‐wide association studies. Wiley Interdisciplinary Reviews: Systems Biology and Medicine 3(5): 513–526.

Donnelly P (2008) Progress and challenges in genome‐wide association studies in humans. Nature 456(7223): 728–731.

Easton DF and Eeles RA (2008) Genome‐wide association studies in cancer. Human Molecular Genetics 17(R2): R109–R115.

Easton DF, Pooley KA, Dunning AM et al. (2007) Genome‐wide association study identifies novel breast cancer susceptibility loci. Nature 447(7148): 1087–1093.

Emily M, Mailund T, Hein J et al. (2009) Using biological networks to search for interacting loci in genome‐wide association studies. European Journal of Human Genetics 17(10): 1231–1240.

Fogel GB and Corne DW (2003) Evolutionary Computation in Bioinformatics, 1st edn. Boston: Morgan Kaufmann Publishers.

Freitas A (2002) Data Mining and Knowledge Discovery with Evolutionary Algorithms, 1st edn. New York: Springer.

Greene CS, Hill DP, Moore JH (2010a) Environmental sensing using expert knowledge in a computational evolution system for complex problem solving in human genetics. In: Riolo RL et al. (eds) Genetic Programming Theory and Practice VII. Ann Arbor: Springer.

Greene CS, Himmelstein DS, Nelson HH et al. (2010b) Enabling personal genomics with an explicit test of epistasis. The Pacific Symposium on Biocomputing 15: 327–336.

Greene CS, Kiralis J and Moore JH (2009a) Nature‐inspired Algorithms for the Genetic Analysis of Epistasis in Common Human Diseases: a Theoretical Assessment of Wrapper vs. Filter Approaches. Proceeding of the IEEE Congress on Evolutionary Computation, pp. 800–807, Trondheim, Norway.

Greene CS, Penrod NM, Kiralis J et al. (2009b) Spatially uniform relieff (SURF) for computationally efficient filtering of gene–gene interactions. BioData Mining 2(1): 5.

Greene CS, White BC and Moore JH (2007) An expert knowledge‐guided mutation operator for genome wide genetic analysis using genetic programming. Lecture Notes in Bioinformatics 4774: 30–40.

Greene CS, White BC and Moore JH (2009c) Sensible Initialization Using Expert Knowledge for Genome Wide Analysis of Epistasis Using Genetic Programming. Proceedings of the IEEE Congress on Evolutionary Computing, pp. 1289–1296, Trondheim, Norway.

Gui J, Andrew AS, Andrews P et al. (2011) A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility. Annals of Human Genetics 75(1): 20–28.

Hahn LW, Ritchie MD and Moore JH (2003) Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics 19(3): 376–382.

Hastie T, Tibshirani R and Friedman J (2009) The Elements of Statistical Learning – Data Mining, Inference and Prediction, 2nd edn. New York: Springer.

Herold C, Steffens M, Brockschmidt FF et al. (2009) INTERSNP: genome‐wide interaction analysis guided by a priori information. Bioinformatics 25(24): 3275–3281.

Hirschhorn JN and Daly MJ (2005) Genome‐wide association studies for common diseases and complex traits. Nature Reviews Genetics 6(2): 95–108.

Infante J, Sanz C, Fernandez‐Luna JL et al. (2004) Gene–gene interaction between interleukin‐1A and interleukin‐8 increases Alzheimer's disease risk. Journal of Neurology 251(4): 482–483.

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431(7011): 931–945.

Jakobsdottir J, Gorin MB, Conley YP et al. (2009) Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genetics 5(2): e1000337.

Kira K and Rendell LA (1992) A practical Approach to Feature Selection. In Machine Learning: Proceedings of the American Association for Artificial Intelligence Meeting'92, San Francisco.

Kononenko I (1994) Estimating Attributes: Analysis and Extension of Relief. Machine Learning: ECML‐94, New York, pp. 171–182.

Kooperberg C, Ruczinski I, LeBlanc ML et al. (2001) Sequence analysis using logic regression. Genetic Epidemiology 21(suppl. 1): S626–S631.

Lee SY, Chung Y, Elston RC et al. (2007) Log‐linear model‐based multifactor dimensionality reduction method to detect gene–gene interactions. Bioinformatics 23(19): 2589–2595.

Li M, Ye C, Fu W et al. (2011) Detecting genetic interactions for quantitative traits with U‐statistics. Genetic Epidemiology 35(6): 457–468.

Lou XY, Chen GB, Yan L et al. (2007) A generalized combinatorial approach for detecting gene‐by‐gene and gene‐by‐environment interactions with application to nicotine dependence. American Journal of Human Genetics 80(6): 1125–1137.

Lou XY, Chen GB, Yan L et al. (2008) A combinatorial approach to detecting gene–gene and gene–environment interactions in family studies. American Journal of Human Genetics 83(4): 457–467.

Lunetta KL, Hayward LB, Segal J et al. (2004) Screening large‐scale association study data: exploiting interactions using random forests. BMC Genetics 5: 32.

McKinney BA, Crowe JE, Guo J et al. (2009) Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genetics 5(3): e1000432.

McKinney BA, Reif DM, White BC et al. (2007) Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics 23(16): 2113–2120.

Mei H, Cuccaro ML and Martin ER (2007) Multifactor dimensionality reduction‐phenomics: a novel method to capture genetic heterogeneity with use of phenotypic variables. American Journal of Human Genetics 81(6): 1251–1261.

Michalski RS (1983) A theory and methodology of inductive learning. Artificial Intelligence 20: 111–161.

Millstein J, Conti DV, Gilliland FD et al. (2006) A testing framework for identifying susceptibility genes in the presence of epistasis. American Journal of Human Genetics 78(1): 15–27.

Mitchell T (1997) Machine Learning. New York: McGraw‐Hill.

Moore JH, Gilbert JC, Tsai CT et al. (2006) A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology 241(2): 252–261.

Moore JH and Ritchie MD (2004) The challenges of whole‐genome approaches to common diseases. JAMA 291(13): 1642–1643.

Moore JH and White BC (2006) Exploiting expert knowledge in genetic programming for genome‐wide genetic analysis. Lecture Notes in Computer Science 4193: 969–977.

Moore JH and White BC (2007a) Tuning ReliefF for genome‐wide genetic analysis. Lecture Notes in Computer Science 4447: 166–175.

Moore JH and White BC (2007b) Genome‐wide genetic analysis using genetic programming: the critical need for expert knowledge. In: Riolo R, Soule T and Worzel B (eds) Genetic Programming Theory and Practice IV. New York: Springer.

Moore JH and Williams SW (2002) New strategies for identifying gene–gene interactions in hypertension. Annals of Medicine 34(2): 88–95.

Moore JH and Williams SM (2005) Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. Bioessays 27(6): 637–646.

Namkung J, Kim K, Yi S et al. (2009) New evaluation measures for multifactor dimensionality reduction classifiers in gene–gene interaction analysis. Bioinformatics 25(3): 338–345.

Nelson MR, Kardia SL, Ferrell RE et al. (2001) A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Research 11(3): 458–470.

Pattin KA and Moore JH (2009) Role for protein–protein interaction databases in human genetics. Expert Review of Proteomics 6(6): 647–659.

Pattin KA, White BC, Barney N et al. (2009) A computationally efficient hypothesis testing method for epistasis analysis using multifactor dimensionality reduction. Genetic Epidemiology 33(1): 87–94.

Reich M, Liefeld T, Gould J et al. (2006) GenePattern 2.0. Nature Genetics 38(5): 500–501.

Reif DM, Dudek SM, Shaffer CM et al. (2005) Exploratory visual analysis of pharmacogenomic results. The Pacific Symposium on Biocomputing 10: 296–307.

Reif DM, Motsinger AA, McKinney BA et al. (2006) Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, Washington, DC, pp. 171–178.

Reif DM, Motsinger‐Reif AA, McKinney BA et al. (2009) Integrated analysis of genetic and proteomic data identifies biomarkers associated with adverse events following smallpox vaccination. Genes and Immunity 10(2): 112–119.

Ritchie MD, Hahn LW, Roodi N et al. (2001) Multifactor dimensionality reduction reveals high‐order interactions among estrogen metabolism genes in sporadic breast cancer. American Journal of Human Genetics 69(1): 138–147.

Ritchie MD, White BC, Parker JS et al. (2003) Optimization of neural network architecture using genetic programming improves detection and modeling of gene–gene interactions in studies of human diseases. BMC Bioinformatics 4: 28.

Robnik‐Šikonja M and Kononenko I (2003) Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53: 23–69.

Sinnott‐Armstrong NA, Greene CS, Cancare F et al. (2009) Accelerating epistasis analysis in human genetics with consumer graphics hardware. BMC Research Notes 2: 149.

Sun YV, Cai Z, Desai K et al. (2007) Classification of rheumatoid arthritis status with candidate gene and genome‐wide single‐nucleotide polymorphisms using random forests. BMC Proceedings 1(suppl. 1): S62.

The International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437(7063): 1299–1320.

Thornton‐Wells TA, Moore JH and Haines JL (2004) Genetics, statistics and human disease: analytical retooling for complexity. Trends in Genetics 20(12): 640–647.

Torkamani A, Topol EJ and Schork NJ (2008) Pathway analysis of seven common diseases assessed by genome‐wide association. Genomics 92(5): 265–272.

Tsoi LC, Boehnke M, Klein RL et al. (2009) Evaluation of genome‐wide association study results through development of ontology fingerprints. Bioinformatics 25(10): 1314–1320.

Velez DR, White BC, Motsinger AA et al. (2007) A balanced accuracy metric for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology 31(4): 306–315.

Wahlsten D (1990) Insensitivity of the analysis of variance to heredity–environment interactions. Behavioral and Brain Sciences 13: 109–161.

Wilke R, Reif DM and Moore JH (2005) Combinatorial pharmacogenetics. Nature Reviews Drug Discovery 4(11): 911–918.

Williams SM, Canter JA, Crawford DC et al. (2007) Problems with genome‐wide association studies. Science 316(5833): 1840–1842.

Williams SM, Ritchie MD, Phillips JA et al. (2004) Multilocus analysis of hypertension: a hierarchical approach. Human Heredity 57(1): 28–38.

Yu W, Wulf A, Liu T et al. (2008) Gene Prospector: an evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases. BMC Bioinformatics 9: 528.

Zamar D, Tripp B, Ellis G et al. (2009) Path: a tool to facilitate pathway‐based genetic association analysis. Bioinformatics 25(18): 2444–2446.

Zhang H, Wang M and Chen X (2009) Willows: a memory efficient tree and forest construction package. BMC Bioinformatics 10: 130.

Further Reading

Kraft P and Cox DG (2008) Study designs for genome‐wide association studies. Advances in Genetics 60: 465–504.

Manolio TA, Collins FS, Cox NJ et al. (2009) Finding the missing heritability of complex diseases. Nature 461(7265): 747–753.

Marchini J, Donnelly P and Cardon LR (2005) Genome‐wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics 37(4): 413–417.

McKinney BA, Reif DM, Ritchie MD et al. (2006) Machine learning for detecting gene–gene interactions: a review. Applied Bioinformatics 5(2): 77–88.

Ritchie MD (2010) Using biological knowledge to uncover the mystery in the search for epistasis in genome‐wide association studies. Annals of Human Genetics 75(1): 172–182.

Ritchie MD, Hahn LW and Moore JH (2003) Power of multifactor dimensionality reduction for detecting gene–gene interactions in the presence of genotyping error, phenocopy, and genetic heterogeneity. Genetic Epidemiology 24(2): 150–157.

Wilke RA, Mareedu RK and Moore JH (2008) The pathway less traveled: moving from candidate genes to candidate pathways in the analysis of genome‐wide data from large scale pharmacogenetic association studies. Current Pharmacogenomics and Personalized Medicine 6(3): 150–159.

Yu K, Li Q, Bergen AW et al. (2009) Pathway analysis by adaptive combination of P‐values. Genetic Epidemiology 33(8): 700–709.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Gilbert‐Diamond, Diane, Asselbergs, Folkert W, Williams, Scott M, and Moore, Jason H(Oct 2011) Role of Bioinformatics in Genome‐wide Association Studies. In: eLS. John Wiley & Sons Ltd, Chichester. [doi: 10.1002/9780470015902.a0023578]