BLAST Algorithm

Abstract

BLAST is an acronym for ‘Basic Local Alignment Search Tool’. Members of the BLAST family of database search programs take as input a query deoxyribonucleic acid (DNA) or protein sequence, and search a DNA or protein sequence database for similarities that may indicate homology. The programs implement variations of the BLAST algorithm, which is a heuristic method for rapidly finding local alignments with scores sufficiently high to be statistically significant. The BLAST algorithm first seeks near‐perfect matches to words within a query sequence, and then extends these matches to determine whether they are contained within longer, high‐scoring local alignments. BLAST approximates the rigorous Smith–Waterman local alignment algorithm; it permits a chance of missing weak sequence similarities in exchange for greatly increased speed.

Key Concepts:

  • Local sequence alignment is used for similarity searches of sequence databases.

  • The Smith–Waterman algorithm is a rigorous dynamic programming method for finding optimal local alignments.

  • BLAST is a fast, heuristic approximation to the Smith–Waterman algorithm.

  • An analytic theory describes the optimal scores of ungapped local alignments.

  • The statistical parameters for BLAST's gapped local alignments are precomputed by random simulation.

  • Very efficient algorithms exist for finding perfect or near‐perfect word matches.

  • BLAST finds weak local alignments if they contain near‐perfect word matches.

  • BLAST provides a trade‐off between speed and the probability of missing statistically significant alignments.

  • A variety of specialized BLAST programs are adapted to searching DNA or protein sequence databases with DNA or protein query sequences.

  • PSI‐BLAST increases the sensitivity of protein database searches by constructing a protein profile from the results of a regular BLAST search.

Keywords: local alignment; sequence similarity; PSI‐BLAST

Figure 1.

The BLAST algorithm. One length‐3 word, ‘MVD’, from the query sequence is shown in red. Given a threshold T=11, the neighbourhood of ‘MVD’ consists of all words that align to it with score ≥T, computed using the BLOSUM‐62 amino acid substitution matrix (Henikoff and Henikoff, ). Any such word found in a database or ‘subject’ sequence constitutes a hit, and triggers an ungapped extension in both directions. An extension is abandoned when the score of the resulting alignment falls X, below the best score yet found. The highest‐scoring local alignment is reported if its E‐value is sufficiently low. If T were increased to 12 in this example, the number of neighbourhood words for ‘MVD’ would fall from 11 to 5, and the hit shown in red would be lost. However, the original BLAST algorithm would still find the illustrated local alignment, due to a hit involving the later query word, ‘QWI’, shown in blue. The alignment's central line echoes matching letters, and indicates with a ‘+’ any nonidentical letters that align with positive score.

close

References

Acland A, Agarwala R, Barrett T et al. (2014) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 42: D7–D17.

Altschul SF, Bundschuh R, Olsen R and Hwa T (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic acids Research 29: 351–361.

Altschul SF and Gish W (1996) Local alignment statistics. Methods in Enzymology 266: 460–480.

Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.

Altschul SF, Madden TL, Schäffer AA et al. (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Altschul SF, Wootton JC, Gertz EM et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS Journal 272: 5101–5109.

Boratyn GM, Schäffer AA, Agarwala R et al. (2012) Domain enhanced lookup time accelerated BLAST. Biology Direct 7: 12.

Gertz EM, Yu Y‐K, Agarwala R, Schäffer AA and Altschul SF (2006) Composition‐based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biology 4: 41.

Gish W and States DJ (1993) Identification of protein coding regions by database similarity search. Nature Genetics 3: 266–272.

Gusfield D (1997) Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. New York, NY: Press Syndicate of the University of Cambridge.

Henikoff S and Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89: 10915–10919.

Karlin S and Altschul SF (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA 87: 2264–2268.

Marchler‐Bauer A, Zheng C, Chitsaz F et al. (2013) CDD: conserved domains and protein three‐dimensional structure. Nucleic Acids Research 41: D348–D352.

Morgulis A, Coulouris G, Raytselis Y et al. (2008) Database indexing for production MegaBLAST searches. Bioinformatics 24: 1757–1764.

Park Y, Sheetlin SL, Ma N, Madden TL and Spouge JL (2012) BMC Research Notes 5: 286.

Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85: 2444–2448.

Schäffer AA, Aravind L, Madden TL et al. (2001) Improving the accuracy of PSI‐BLAST protein database searches with composition‐based statistics and other refinements. Nucleic Acids Research 29: 2994–3005.

Schäffer AA, Wolf YI, Ponting CP et al. (1999) IMPALA: matching a protein sequence against a collection of PSI‐BLAST‐constructed position‐specific score matrices. Bioinformatics 15: 1000–1011.

Sheetlin S, Park Y and Spouge JL (2005) The Gumbel pre‐factor k for gapped local alignment can beestimated from simulations of global alignment. Nucleic Acids Research 33: 4987–4994.

Smith TF and Waterman MS (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147: 195–197.

Yu Y‐K, Wootton JC and Altschul SF (2003) The compositional adjustment of amino acid substitution matrices. Proceedings of the National Academy of Sciences of the USA 100: 15688–15693.

Zhang Z, Schäffer AA, Miller W et al. (1998) Protein sequence similarity searches using patterns as seeds. Nucleic Acids Research 26: 3986–3990.

Zhang Z, Schwartz S, Wagner L and Miller W (2000) A greedy algorithm for aligning DNA sequences. Journal of Computational Biology 7: 203–214.

Further Reading

Altschul SF, Boguski MS, Gish W and Wootton JC (1994) Issues in searching molecular sequence databases. Nature Genetics 6: 119–129.

Altschul SF and Koonin EV (1998) Iterated profile searches with PSI‐BLAST: a tool for discovery in protein databases. Trends in Biochemical Sciences 23: 444–447.

Baxevanis AD and Ouellette BFF (2005) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 3rd edn. New York, NY: John Wiley.

Dembo A, Karlin S and Zeitouni O (1994) Limit distribution of maximal non‐aligned two‐sequence segmental score. Annals of Probability 22: 2022–2039.

Ewens WJ and Grant GR (2010) Statistical Methods in Bioinformatics: An Introduction, 2nd edn. New York, NY: Springer Science+Business Media.

Korf I, Yandell M and Bedell J (2003) BLAST. Sebastopol, CA: O'Reilly.

Smith TF, Waterman MS and Burks C (1985) The statistical distribution of nucleic acid similarities. Nucleic Acids Research 13: 645–656.

Wilson AC, Carlson SS and White TJ (1977) Biochemical evolution. Annual Review of Biochemistry 46: 573–639.

Web Link

National Center for Biotechnology Information BLAST server. http://blast.ncbi.nlm.nih.gov

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Altschul, Stephen F(Jun 2014) BLAST Algorithm. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005253.pub2]