Sequence Similarity


Sequence similarity is a measure of an empirical relationship between sequences. A common objective of sequence similarity calculations is establishing the likelihood for sequence homology: the chance that sequences have evolved from a common ancestor. A similarity score is therefore aimed to approximate the evolutionary distance between a pair of nucleotide or protein sequences. Many implementations for measuring sequence similarity exist, where a general aim is to infer structural or functional characteristics of an unannotated molecular sequence.

Keywords: sequence similarity; sequence identity; homology; alignment; homology searching; E‐value

Figure 1.

Distribution of 66 066 global similarity scores, derived from pairwise global alignments over an artificial database of sequences derived using a random mutation and insertion/deletion protocol, versus the length of the shortest sequence in each pairwise alignment. The alignments were effected using the PRALINE method (Heringa, ), where the alignment scores were calculated using the BLOSUM62 matrix and gap penalty values of 12 and 1 for gap initiation and extension, respectively. A clearly linear lower band of alignment scores of unrelated sequences is visible. The correlation coefficient of the random scores within the lower band is 0.99 while the slope of the regression line is 7.864. Also the higher scores of putatively related sequences above the lower band are correlated: the correlation coefficient is 0.98 and the linear regression line slope is 12.50. Random and real scores were separated by the line y=9.334x.

Figure 2.

Probability density function for the extreme value distribution (EVD) resulting from parameter values μ=0 and λ=1, where μ is the characteristic value and λ the decay constant.



Abagyan RA and Batalov S (1997) Do aligned sequences share the same fold? Journal of Molecular Biology 273: 355–368.

Altschul SF and Gish W (1996) Local alignment statistics. In: Doolittle RF (ed.) Methods in Enzymology, vol. 266, pp. 460–480. San Diego, CA: Academic Press.

Altschul SF, Gish W, Miller W, Meyers EW and Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.

Altschul SF, Madden TL, Schäffer AA et al. (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Arratia R and Waterman MS (1994) A phase transition for the sore in matching random sequences allowing depletions. Annals of Applied Probability 4: 200–225.

Cartwright RA (2007) Ngila: global pairwise alignments with logarithmic and affine gap costs. Bioinformatics 23: 1427–1428.

Dembo A and Karlin S (1991) Strong limit theorems of empirical functionals for large exceedances of partial sums of i.i.d. variables. Annals of Probability 19: 1737.

Dembo A, Karlin S and Zeitouni O (1994) Limit distributions of maximal non‐aligned two‐sequence segmental score. Annals of Probability 22: 2022.

Doolittle RF (1981) Similar amino acid sequences: chance or common ancestry. Science 214: 149–159.

Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14: 755–763.

Edgar RC and Sjölander K (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20: 1301–1308.

George RA and Heringa J (2002) Protein domain identification and improved sequence searching using PSI‐BLAST. Proteins – Structure Function and Genetics 48: 672–681.

Gumbel EJ (1958) Statistics of Extremes. New York, NY: Columbia University Press.

Heringa J (1999) Two strategies for sequence comparison: profile‐preprocessed and secondary structure‐induced multiple alignment. Computers and Chemistry 23: 341–364.

Heringa J (2002) Local weighting schemes for protein multiple sequence alignment. Computers and Chemistry 26: 459–477.

Jaroszewski L, Rychlewski L, Li Z, Li W and Godzik A (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Research 33: W284–W288.

Karlin S and Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA 87: 2264–2268.

Karplus K, Barrett C and Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14: 846–856.

Kent WJ (2002) BLAT – The BLAST‐like alignment tool. Genome Research 12: 656–664.

Kevin K, Karchin R, Barrett C et al. (2001) What is the value added by human intervention in protein structure prediction? Proteins: Structure, Function, and Genetics 45(S5): 86–91.

Lawless JF (1982) Statistical Models and Methods for Lifetime Data pp. 141–202. New York, NY: Wiley.

May AC (2001) Related problems. Nature 413: 453.

Mott R (1992) Maximum‐likelihood estimation of the statistical distribution of Smith–Waterman local sequence similarity scores. Bulletin of Mathematical Biology 54: 59–75.

Needleman SB and Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443–453.

von Ohsen N, Sommer I, Zimmer R and Lengauer T (2004) Arby: automatic protein structure prediction using profile–profile alignment and confidence measures. Bioinformatics 20: 2228–2235.

Pascarella S and Argos P (1992) A data bank merging related protein structures and sequences. Protein Engineering 5: 121–137.

Pearson WR (1996) Effective protein sequence comparison. In: Doolittle RF (ed.) Methods in Enzymology, vol. 266, pp. 227–258. San Diego, CA: Academic Press.

Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. Journal of Molecular Biology 276: 71–84.

Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85: 2444–2448.

Przybylski D and Rost B (2007) Consensus sequences improve PSI‐BLAST through mimicking profile–profile alignments searches. Nucleic Acids Research 35(7): 2238–2246.

Rost B (2002) Enzyme function is less conserved than anticipated. Journal of Molecular Biology 318: 595–608.

Sander C and Schneider R (1991) Database of homology derived protein structures and the structural meaning of sequence alignment. Proteins – Structure Function and Evolution 9: 56–68.

Schäffer AA, Aravind L, Madden TL et al. (2001) Improving the accuracy of PSI‐BLAST protein database searches with composition‐based statistics and other refinements. Nucleic Acids Research 29: 2994–3005.

Sharon I, Birkland A, Chang K, El‐Yaniv R and Yona G (2005) Correcting BLAST e‐values for low‐complexity segments. Journal of Computational Biology 12: 978–1001.

Simossis VA, Kleinjung J and Heringa J (2005) Homology‐extended sequence alignment. Nucleic Acids Research 33: 816–824.

Smith TF and Waterman MS (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147: 195–197.

Smith TF, Waterman MS and Burks C (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Research 13: 645.

Tomii K and Akiyama Y (2004) FORTE: a profile–profile comparison tool for protein fold recognition. Bioinformatics 20: 594–595.

Waterman MS and Eggert M (1987) A new algorithm for best subsequences alignment with applications to the tRNA–rRNA comparisons. Journal of Molecular Biology 197: 723–728.

Waterman MS and Vingron M (1994) Rapid and accurate estimates of statistical significance for sequence data base searches. Proceedings of the National Academy of Sciences of the USA 91: 4625.

Wooton JC and Federhen S (1996) Analysis of compositionally biased regions in sequence databases. In: Doolittle RF (ed.) Methods in Enzymology, vol. 266, pp. 554–571. San Diego, CA: Academic Press.

Yona G and Levitt M (2002) Within the twilight zone: a sensitive profile–profile comparison tool based on information theory. Journal of Molecular Biology 315: 1257–1275.

Yu YK, Wootton JC, Altschul SF et al. (2003) The compositional adjustment of amino acid substitution matrices. Proceedings of the National Academy of Sciences of the USA 100: 15688–15693.

Further Reading

Doolittle RF (ed.) (1996) Methods in Enzymology, vol. 266, p. 711. San Diego, CA: Academic Press.

Higgins D and Taylor WR (eds) (2000) Bioinformatics: Sequence, Structure and Databanks, p. 249. Oxford: Oxford University Press.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Heringa, Jaap(Jul 2008) Sequence Similarity. In: eLS. John Wiley & Sons Ltd, Chichester. [doi: 10.1002/9780470015902.a0005317.pub2]