Hidden Markov Models

Abstract

A hidden Markov model (HMM) is a statistical approach that is frequently used for modelling biological sequences. In applying it, a sequence is modelled as an output of a discrete stochastic process, which progresses through a series of states that are ‘hidden’ from the observer. Each such hidden state emits a symbol representing an elementary unit of the modelled data, for example, in case of a protein sequence – an amino acid. The parameters of a hidden Markov model can be estimated by learning from training data. Efficient algorithms are available to infer the most likely paths of states for given sequence data, which often lead to biological predictions and interpretations. Thanks to the well developed theories and algorithms, hidden Markov models have found wide applications in diverse areas of computational molecular biology.

Key Concepts:

  • Hidden Markov model is a statistical approach for modelling sequences with broad applications in computational biology.

  • In an HMM, a biological sequence is modelled as being generated by a stochastic process moving from one state to the next state, where each state emits one element of the sequence according to some emission probability distribution which, in general, is different in different states.

  • Training of an HMM is a process in which the parameters of the model are computed based on a training set of representative examples.

  • Overfitting/overtraining the model occurs when model parameters correctly represent the training set but the model cannot generalise the training data to a larger set.

  • Gene finding is a process of computational identification of genes, including exon/intron structure, in a genome.

Keywords: HMM; gene finding; profile HMM; training HMM; protein structure prediction; epigenetics; phylogenetics

Figure 1.

A simple hidden Markov model. The boxes correspond to states where the emission probabilities for each state are given inside each box. The transition probabilities are given above the corresponding arrows. Note that there are two state paths that can be used to generate the sequence GAGCGCT: 0,1,2,4,4,4,4,5,6 and 0,1,2,3,3,3,3,5,6. The probability of generating the sequence using the first path is 1.06×10–4 and using the second path is 1.35×10−7. The probability of generating the sequences by the model is the sum of these probabilities.

Figure 2.

Topology of a simple HMM for prokaryotic gene recognition. In practice, the topology is more complex (e.g. Krogh et al., ; Henderson et al., ).

Figure 3.

Topology of a profile HMM for a sequence family. The states labelled with M correspond to matches, the states labelled with I correspond to insertions and (silent) circle states correspond to deletions. Adopted with permission from Durbin et al., .

Figure 4.

Two related HMMs modelling two different biological processes. (a) Topology of an HMM for recognising of regions with epigenetic makers. The simple HMM model consist with the states: enriched region and background regions. (b) Topology of a phylogenetic HMM (Siepel et al., ) for the prediction of conserved genomic elements. The states labelled with C and N corresponding to conserved and nonconserved regions respectively. The block of input alignment in the box illustrates a conserved region and the corresponding alignment columns are assumed to be emitted by state C.

close

References

Bagos PG, Liakopoulos TD, Spyropoulos IC and Hamodrakas SJ (2004) A hidden Markov model method, capable of predicting and discriminating beta‐barrel outer membrane proteins. BMC Bioinformatics 5: 29.

Bateman A, Birney E, Cerruti L et al. (2002) The Pfam protein families database. Nucleic Acids Research 30(1): 276–280.

Boitard S, Schlotterer C and Futschik A (2009) Detecting selective sweeps: a new approach based on hidden Markov models. Genetics 181(4): 1567–1578.

Burge C and Karlin S (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268(1): 78–94.

Bystroff C, Thorsson V and Baker D (2000) HMMSTR: a hidden Markov model for local sequence‐structure correlations in proteins. Journal of Molecular Biology 301(1): 173–190.

Chen X, Hoffman MM, Bilmes JA, Hesselberth JR and Noble WS (2010) A dynamic Bayesian network for identifying protein‐binding footprints from single molecule‐based sequencing data. Bioinformatics 26(12): i334–i342.

Colella S, Yau C, Taylor JM et al. (2007) QuantiSNP: an objective Bayes Hidden‐Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Research 35(6): 2013–2025.

Di Francesco V, Garnier J and Munson PJ (1997a) Protein topology recognition from secondary structure sequences: application of the hidden Markov models to the alpha class proteins. Journal of Molecular Biology 267(2): 446–463.

Di Francesco V, Geetha V, Garnier J and Munson PJ (1997b) Fold recognition using predicted secondary structure sequences and hidden Markov models of protein folds. Proteins Suppl 1: 123–128.

Durbin R, Eddy SR, Krogh A and Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press.

Felsenstein J and Churchill GA (1996) A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13(1): 93–104.

Freeman JL, Perry GH, Feuk L et al. (2006) Copy number variation: new insights in genome diversity. Genome Research 16(8): 949–961.

Hargbo J and Elofsson A (1999) hidden Markov models that use predicted secondary structures for fold recognition. Proteins 36(1): 68–76.

Henderson J, Salzberg S and Fasman KH (1997) Finding genes in DNA with a hidden Markov model. Journal of Computational Biology 4(2): 127–141.

Humburg P, Bulger D and Stone G (2008) Parameter estimation for robust HMM analysis of ChIP‐chip data. BMC Bioinformatics 9: 343.

Husmeier D and Wright F (2001) Detection of recombination in DNA multiple alignments with hidden Markov models. Journal of Computational Biology 8(4): 401–427.

Karchin R, Cline M, Mandel‐Gutfreund Y and Karplus K (2003) Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins 51(4): 504–514.

Karplus K (2009) SAM‐T08, HMM‐based protein structure prediction. Nucleic Acids Research 37(Web Server issue): W492–W497.

Kern AD and Haussler D (2010) A population genetic hidden Markov model for detecting genomic regions under selection. Molecular Biology and Evolution 27(7): 1673–1685.

Krogh A, Brown M, Mian IS, Sjolander K and Haussler D (1994a) Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology 235(5): 1501–1531.

Krogh A, Larsson B, von Heijne G and Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Journal of Molecular Biology 305(3): 567–580.

Krogh A, Mian IS and Haussler D (1994b) A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research 22(22): 4768–4778.

Ku M, Koche RP, Rheinbay E et al. (2008) Genomewide analysis of PRC1 and PRC2 occupancy identifies two classes of bivalent domains. PLoS Genetics 4(10): e1000242.

Li SC, Bu D, Xu J and Li M (2008) Fragment‐HMM: a new approach to protein structure prediction. Protein Science 17(11): 1925–1934.

Li W, Meyer CA and Liu XS (2005) A hidden Markov model for analyzing ChIP‐chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 21(Suppl 1): i274–i282.

Lian H, Thompson WA, Thurman R et al. (2008) Automated mapping of large‐scale chromatin structure in ENCODE. Bioinformatics 24(17): 1911–1916.

Lukashin AV and Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Research 26(4): 1107–1115.

Martelli PL, Fariselli P, Krogh A and Casadio R (2002) A sequence‐profile‐based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 18(Suppl 1): S46–S53.

Pedersen JS and Hein J (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19(2): 219–227.

Qin ZS, Yu J, Shen J et al. (2010) HPeak: an HMM‐based algorithm for defining read‐enriched regions in ChIP‐Seq data. BMC Bioinformatics 11: 369.

Rabiner LR (1989) A tutorial on Hidden Markov–models and selected applications in speech recognition. Proceedings of the IEEE 77(2): 257–286.

Razzaki T and Bukhari AI (1975) Events following prophage Mu induction. Journal of Bacteriology 122(2): 437–442.

Salzberg SL, Delcher AL, Kasif S and White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26(2): 544–548.

Siepel A, Bejerano G, Pedersen JS et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 15(8): 1034–1050.

Siepel A and Haussler D (2005) In: Nielsen R (ed.) Phylogenetic Hidden Markov Models. Statistical Methods in Molecular Evolution, vol. 4, pp. 325–351. Springer.

Siepel A, Pollard KS and Haussler D (2006) New methods for detecting lineage‐specific selection. Research in Computational Molecular Biology, Proceedings 3909: 190–205.

Simpson JT, McIntyre RE, Adams DJ and Durbin R (2010) Copy number variant detection in inbred strains from short read sequence data. Bioinformatics 26(4): 565–567.

Sonnhammer EL, von Heijne G and Krogh A (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. Proceedings of the International Conference on Intelligent System for Molecular Biology 6: 175–182.

Wang K, Li MY, Hadley D et al. (2007) PennCNV: an integrated hidden Markov model designed for high‐resolution copy number variation detection in whole‐genome SNP genotyping data. Genome Research 17(11): 1665–1674.

Won KJ, Chepelev I, Ren B and Wang W (2008) Prediction of regulatory elements in mammalian genomes using chromatin signatures. BMC Bioinformatics 9: 547.

Won KJ, Ren B and Wang W (2010) Genome‐wide prediction of transcription factor binding sites using an integrated model. Genome Biology 11(1): R7.

Xu H, Wei CL, Lin F and Sung WK (2008) An HMM approach to genome‐wide identification of differential histone modification sites from ChIP‐seq data. Bioinformatics 24(20): 2344–2349.

Yang Z (1995) A space–time process model for the evolution of DNA sequences. Genetics 139(2): 993–1005.

Zhou H and Zhou Y (2003) Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden‐Markov‐model‐based method. Protein Science 12(7): 1547–1555.

Further Reading

Clote P and Backofen R (2000) Computational Molecular Biology: An Introduction. Chichester, UK: John Wiley & Sons.

Durbin R, Eddy SR, Krogh A and Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press.

Rasmus N (2005) Statistical Methods in Molecular Evolution. New York: Springer Verlag.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Przytycka, Teresa M, and Zheng, Jie(May 2011) Hidden Markov Models. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005267.pub2]