FASTA Search Programs
William R Pearson, University of Virginia, Charlottesville, Virginia, USA
Published online: April 2014
DOI: 10.1002/9780470015902.a0005255.pub2
Abstract
The FASTA programs search protein and deoxyribonucleic acid (DNA) databases for sequences with statistically significant similarity. The programs compare proteins, DNA and short peptides
and oligonucleotides, and run on most popular computers. FASTA and BLAST both seek to identify homologous proteins or DNA
sequences. BLAST is faster, but FASTA is more flexible, providing both rigorous (SSEARCH, LALIGN, GGSEARCH and GLSEARCH) and
heuristic (FASTA, FASTX/Y, TFASTX/Y and FASTS/M/F) algorithms, a wider range of scoring matrices and different approaches
for estimating statistical significance. In addition, the FASTA programs offer options to search a small, representative database,
but then the report results from a larger sequence set linked to the initial significant hits. The FASTA programs can also
annotate the alignments to include the conservation state of aligned functional residues, such as active sites, and subalignment
scores associated with domain boundaries. The FASTA programs provide flexible and rigorous alternatives to BLAST for protein,
translated‐DNA and DNA alignment.
Key Concepts:
-
The FASTA program uses a heuristic (approximate) strategy for finding similar sequences, but the FASTA package includes SSEARCH
and GGSEARCH, which provide rigorous algorithms.
-
Homologs can be identified because they share excess (statistically significant) sequence similarity.
-
E()‐values (expect‐values) report the significance (expectation) of a sequence similarity score.
-
Sequence alignments are more accurate when the scoring matrix matches the evolutionary distance of the aligned sequences.
-
The FASTA programs can align against sequences not included in the initial search using library expansion.
-
The FASTA programs can use external annotations to modify aligned sequences and to partition similarity scores.
Keywords: sequence similarity; homology; statistical significance; protein sequence comparison; DNA sequence comparison
References
Altschul SF,
Gish W,
Miller W,
Myers EW and
Lipman DJ
(1990)
A basic local alignment search tool.
Journal of Molecular Biology
215:
403–410.
Altschul SF,
Madden TL,
Schaffer AA et al.
(1997)
Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs.
Nucleic Acids Research
25:
3389–3402.
Damer CK,
Partridge J,
Pearson WR and
Haystead TAJ
(1998)
Rapid identification of protein phosphatase 1‐binding proteins by mixed peptide sequencing and data base searching: characterization of a novel holoenzymic form of protein phosphatase 1.
Journal of Biological Chemistry
273:
24396–24405.
Henikoff S and
Henikoff JG
(1992)
Amino acid substitutions matrices from protein blocks.
Proceedings of the National Academy of Sciences of the USA
89:
10915–10919.
Huang X and
Miller W
(1991)
A time‐efficient, linear‐space local similarity algorithm.
Advances in Applied Mathematics
12:
337–357.
Li W,
McWilliam H,
Goujon M et al.
(2012)
PSI‐Search: iterative HOE‐reduced profile SSEARCH searching.
Bioinformatics
28:
1650–1651.
Lipman DJ and
Pearson WR
(1985)
Rapid and sensitive protein similarity searches.
Science
227:
1435–1441.
Mackey AJ,
Haystead TAJ and
Pearson WR
(2002)
Getting more from less: algorithms for rapid protein identification with multiple short peptide sequences.
Molecular and Cellular Proteomics
1:
139–147.
Mills LJ and
Pearson WR
(2013)
Adjusting scoring matrices to correct overextended alignments.
Bioinformatics
29:
3007–3013.
Mueller T,
Spang R and
Vingron M
(2002)
Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method.
Molecular Biology and Evolution
19:
8–13.
Pearson WR
(1991)
Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith–Waterman and FASTA algorithms.
Genomics
11:
635–650.
Pearson WR
(2013)
Selecting the right similarity‐scoring matrix.
Current Protocols in Bioinformatics
43:
3.5.1–3.5.9.
Pearson WR and
Lipman DJ
(1988)
Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences of the USA
85:
2444–2448.
Pearson WR,
Wood TC,
Zhang Z and
Miller W
(1997)
Comparison of DNA sequences with protein sequences.
Genomics
46:
24–36.
Smith TF and
Waterman MS
(1981)
Identification of common molecular subsequences.
Journal of Molecular Biology
147:
195–197.
Waterman MS and
Eggert M
(1987)
A new algorithm for best subsequences alignment with application to tRNA‐rRNA comparisons.
Journal of Molecular Biology
197:
723–728.
Wilbur WJ and
Lipman DJ
(1983)
Rapid similarity searches of nucleic acid and protein data banks.
Proceedings of the National Academy of Sciences of the USA
80:
726–730.
Further Reading
Pearson WR
(2013)
An introduction to similarity (“homology”) searching.
Current Protocols in Bioinformatics
42:
3.1.1–3.1.8. doi:10.1002/0471250953.bi0301s42.
Pearson WR
(2014)
BLAST and FASTA similarity searching for multiple sequence alignment.
Methods in Molecular Biolology
1079:
75–101.
Pearson WR and
Sierk ML
(2005)
The limits of protein sequence comparison?
Current Opinion in Structural Biology
15:
254–260.
Weblinks
EMBL‐EBI European Bioinformatics Institute. FASTA similarity searching. http://www.ebi.ac.uk/Tools/sss/
FASTA Programs at the University of Virginia. FASTA server. http://fasta.bioch.virginia.edu
Site for downloading current versions of the FASTA programs. http://faculty.virginia.edu/wrpearson/fasta