FASTA Search Programs

Abstract

The FASTA programs search protein and deoxyribonucleic acid (DNA) databases for sequences with statistically significant similarity. The programs compare proteins, DNA and short peptides and oligonucleotides, and run on most popular computers. FASTA and BLAST both seek to identify homologous proteins or DNA sequences. BLAST is faster, but FASTA is more flexible, providing both rigorous (SSEARCH, LALIGN, GGSEARCH and GLSEARCH) and heuristic (FASTA, FASTX/Y, TFASTX/Y and FASTS/M/F) algorithms, a wider range of scoring matrices and different approaches for estimating statistical significance. In addition, the FASTA programs offer options to search a small, representative database, but then the report results from a larger sequence set linked to the initial significant hits. The FASTA programs can also annotate the alignments to include the conservation state of aligned functional residues, such as active sites, and subalignment scores associated with domain boundaries. The FASTA programs provide flexible and rigorous alternatives to BLAST for protein, translated‐DNA and DNA alignment.

Key Concepts:

  • The FASTA program uses a heuristic (approximate) strategy for finding similar sequences, but the FASTA package includes SSEARCH and GGSEARCH, which provide rigorous algorithms.

  • Homologs can be identified because they share excess (statistically significant) sequence similarity.

  • E()‐values (expect‐values) report the significance (expectation) of a sequence similarity score.

  • Sequence alignments are more accurate when the scoring matrix matches the evolutionary distance of the aligned sequences.

  • The FASTA programs can align against sequences not included in the initial search using library expansion.

  • The FASTA programs can use external annotations to modify aligned sequences and to partition similarity scores.

Keywords: sequence similarity; homology; statistical significance; protein sequence comparison; DNA sequence comparison

Figure 1.

FASTA alignment showing annotated functional sites and domains. Alignment of Uniprot proteins AKT2_HUMAN and KS6A5_HUMAN, using an annotation script to highlight functional residues and domain. ‘qSite’ and ‘Site’ locations specify residues annotated in the query and library sequence, respectively. Symbols (#‐binding, *‐modified residue) indicate the type of functional residue. The numbers, letters, and symbol after a ‘Site’ annotation, e.g. 181 K=81 K, indicate the query residue coordinate, query residue, conservation state, library residue coordinate and library residue. In cases where the residue has changed, the state of the replacement residue is indicated with a ‘<’, for nonconservative changes (the functional symbol is highlighted in red, e.g. #), and ‘>’ for conservative changes (highlighted in green, e.g. *). ‘Region’ annotations indicate the boundaries and scores associated with annotated domains. After the query start‐end and library start‐end for a region aligment, the raw similarity score, bit score, fraction identical and Q‐value (−10 log p) for the score is shown. Regions with Q‐values >30 have probabilities<0.001.

close

References

Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) A basic local alignment search tool. Journal of Molecular Biology 215: 403–410.

Altschul SF, Madden TL, Schaffer AA et al. (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Damer CK, Partridge J, Pearson WR and Haystead TAJ (1998) Rapid identification of protein phosphatase 1‐binding proteins by mixed peptide sequencing and data base searching: characterization of a novel holoenzymic form of protein phosphatase 1. Journal of Biological Chemistry 273: 24396–24405.

Henikoff S and Henikoff JG (1992) Amino acid substitutions matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89: 10915–10919.

Huang X and Miller W (1991) A time‐efficient, linear‐space local similarity algorithm. Advances in Applied Mathematics 12: 337–357.

Li W, McWilliam H, Goujon M et al. (2012) PSI‐Search: iterative HOE‐reduced profile SSEARCH searching. Bioinformatics 28: 1650–1651.

Lipman DJ and Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227: 1435–1441.

Mackey AJ, Haystead TAJ and Pearson WR (2002) Getting more from less: algorithms for rapid protein identification with multiple short peptide sequences. Molecular and Cellular Proteomics 1: 139–147.

Mills LJ and Pearson WR (2013) Adjusting scoring matrices to correct overextended alignments. Bioinformatics 29: 3007–3013.

Mueller T, Spang R and Vingron M (2002) Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Molecular Biology and Evolution 19: 8–13.

Pearson WR (1991) Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith–Waterman and FASTA algorithms. Genomics 11: 635–650.

Pearson WR (2013) Selecting the right similarity‐scoring matrix. Current Protocols in Bioinformatics 43: 3.5.1–3.5.9.

Pearson WR and Lipman DJ (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the USA 85: 2444–2448.

Pearson WR, Wood TC, Zhang Z and Miller W (1997) Comparison of DNA sequences with protein sequences. Genomics 46: 24–36.

Smith TF and Waterman MS (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147: 195–197.

Waterman MS and Eggert M (1987) A new algorithm for best subsequences alignment with application to tRNA‐rRNA comparisons. Journal of Molecular Biology 197: 723–728.

Wilbur WJ and Lipman DJ (1983) Rapid similarity searches of nucleic acid and protein data banks. Proceedings of the National Academy of Sciences of the USA 80: 726–730.

Further Reading

Pearson WR (2013) An introduction to similarity (“homology”) searching. Current Protocols in Bioinformatics 42: 3.1.1–3.1.8. doi:10.1002/0471250953.bi0301s42.

Pearson WR (2014) BLAST and FASTA similarity searching for multiple sequence alignment. Methods in Molecular Biolology 1079: 75–101.

Pearson WR and Sierk ML (2005) The limits of protein sequence comparison? Current Opinion in Structural Biology 15: 254–260.

Weblinks

EMBL‐EBI European Bioinformatics Institute. FASTA similarity searching. http://www.ebi.ac.uk/Tools/sss/

FASTA Programs at the University of Virginia. FASTA server. http://fasta.bioch.virginia.edu

Site for downloading current versions of the FASTA programs. http://faculty.virginia.edu/wrpearson/fasta

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Pearson, William R(Apr 2014) FASTA Search Programs. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005255.pub2]