Substitution Matrices

Abstract

A substitution matrix is a collection of scores for aligning nucleotides or amino acids with one another. These scores generally represent the relative ease with which one nucleotide or amino acid may mutate into or substitute for another, and they are used to measure similarity in sequence alignments. Substitution matrices are used in conjunction with gap costs to construct local or global alignments. A local alignment substitution matrix implicitly corresponds to a set of target frequencies for aligned letters, which characterise the alignments the matrix may most easily distinguish from chance. Thus different substitution matrices are optimal for comparing related sequences that have diverged to differing degrees. The point accepted mutation and blocks substitution matrix (BLOSUM) series of matrices for protein sequence comparison have been derived from collections of related sequences. Special‐purpose matrices may be appropriate for comparing sequences with biased nucleotide or amino acid composition.

Key Concepts:

  • Match/mismatch scores generally are suboptimal for comparing protein sequences.

  • A log‐odds substitution matrix is characterised by its target frequencies and scale.

  • All local‐alignment substitution matrices are implicitly log‐odds matrices.

  • Different substitution matrices are tailored to different degrees of evolutionary divergence.

  • Amino acid substitution matrices are constructed from curated sets of alignments.

  • The sequence divergence for which a substitution matrix is tailored may be measured by relative entropy.

  • Special matrices may be preferred for comparing sequences with nonstandard nucleotide or amino acid composition.

  • For gapped local alignments, statistical parameters must be estimated by random simulation.

  • For protein‐coding DNA, distant relationships are more easily detected at the protein than at the DNA level.

  • For global alignments, an equivalent scoring system results from adding 2x to all substitution scores, and x to the score for each letter aligned with a null.

Keywords: alignment score; sequence similarity; gap costs; PAM matrices; BLOSUM matrices; local alignment statistics

References

Ali J, Paila U and Ranjan A (2011) ApicoAlign: an alignment and sequence search tool for apicomplexan proteins. BMC Genomics 12(Suppl. 3): S6.

Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. Journal of Molecular Biology 219: 555–565.

Altschul SF, Bundschuh R, Olsen R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29: 351–361.

Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.

Altschul SF, Wootton JC, Gertz EM et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS Journal 272: 5101–5109.

Chiaromonte F, Yap VB and Miller W (2002) Scoring pairwise genomic sequence alignments. In: Altman RB, Dunker AK, Hunter L, Lauderdale K and Klein TE (eds) Pacific Symposium on Biocomputing, pp. 115–126. Singapore: World Scientific Publishing.

Dayhoff MO, Schwartz RM and Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed.) Atlas of Protein Sequence and Structure, vol. 5 (Suppl. 3), pp. 345–352. Washington, DC: National Biomedical Research Foundation.

Dembo A, Karlin S and Zeitouni O (1994) Limit distribution of maximal non‐aligned two‐sequence segmental score. Annals of Probability 22: 2022–2039.

Henikoff S and Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89: 10915–10919.

Henikoff S and Henikoff JG (1993) Performance evaluation of amino acid substitution matrices. Proteins 17: 49–61.

Jones DT, Taylor WR and Thornton JM (1992) The rapid generation of mutation data matrices from protein sequences. Computer Applications in the Biosciences 8: 275–282.

Karlin S and Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA 87: 2264–2268.

Muller T, Rahmann S and Rehmsmeier M (2001) Non‐symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 7(Suppl. 1): S182–S189.

Muller T and Vingron M (2000) Modeling amino acid replacement. Journal of Computational Biology 7: 761–776.

Ng PC, Henikoff JG and Henikoff S (2000) PHAT: a transmembrane‐specific substitution matrix. Bioinformatics 16: 760–766.

Park Y, Sheetlin S and Spouge JL (2009) Estimating the Gumbel scale parameter for local alignment of random sequences by importance sampling with stopping times. Annals of Statistics 37: 3697–3714.

Pearson WR (1995) Comparison of methods for searching protein sequence databases. Protein Science 4: 1145–1160.

Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85: 2444–2448.

Sardiu ME, Alves G and Yu Y‐K (2005) Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem. Physical Review E: Statistical, Nonlinear and Soft Matter Physics 72: 061917.

Schneider A, Cannarozzi GM and Gonnet GH (2005) Empirical codon substitution matrix. BMC Bioinformatics 6: 134.

Schwartz RM and Dayhoff MO (1978) Matrices for detecting distant relationships. In: Dayhoff MO (ed.) Atlas of Protein Sequence and Structure, vol. 5 (Suppl. 3), pp. 353–358. Washington, DC: National Biomedical Research Foundation.

Smith TF and Waterman MS (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147: 195–197.

States DJ, Gish W and Altschul SF (1991) Improved sensitivity of nucleic acid database searches using application‐specific scoring matrices. Methods 3: 66–70.

Wan H and Wootton JC (2000) A global compositional complexity measure for biological sequences: AT‐rich and GC‐rich genomes encode less complex proteins. Computers and Chemistry 24: 71–94.

Yu Y‐K and Altschul SF (2005) The construction of amino acid substitution matrices for the comparison of proteins with non‐standard compositions. Bioinformatics 21: 902–911.

Yu Y‐K, Wootton JC and Altschul SF (2003) The compositional adjustment of amino acid substitution matrices. Proceedings of the National Academy of Sciences of the USA 100: 15688–15693.

Further Reading

Altschul SF (1993) A protein alignment scoring system sensitive at all evolutionary distances. Journal of Molecular Evolution 36: 290–300.

Altschul SF, Wootton JC, Zaslavsky E and Yu Y‐K (2010) The construction and use of log‐odds substitution scores for multiple sequence alignment. PLoS Computational Biology 6: e1000852.

Claverie J‐M (1993) Detecting frameshifts by amino acid sequence comparison. Journal of Molecular Biology 234: 1140–1157.

Durbin R, Eddy S, Krogh A and Mitchison G (1998) Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press.

Ewens WJ and Grant GR (2001) Statistical Methods in Bioinformatics. New York, NY: Springer.

Gonnet GH, Cohen MA and Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256: 1443–1445.

Henikoff S and Henikoff JG (2000) Amino acid substitution matrices. Advances in Protein Chemistry 54: 73–97.

Vingron M and Waterman MS (1994) Sequence alignment and penalty choice. Review of concepts, case studies and implications. Journal of Molecular Biology 235: 1–12.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Altschul, Stephen F(Apr 2013) Substitution Matrices. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005265.pub3]