Pattern Searches in Protein Sequences

Abstract

Common amino acid patterns characterise protein families. The results of automated searches for such patterns are used to qualify protein structure and function and to explore evolutionary relationships. Considering the increasing number of deoxyribonucleic acid (DNA) and protein sequences generated by high‐throughput technologies, pattern search is commonly undertaken in the identification of new protein function or the elucidation of biological processes. A wide array of pattern matching methods has been implemented. They aim at identifying the constraints governing the occurrence of amino acids in protein regions. These constraints are expressed as probabilities or as templates or both to set the basis of automated search.

Key Concepts:

  • Protein families are structured on the basis of common sequence patterns.

  • Patterns constrain the nature and the position of amino acids.

  • Patterns are matched to templates called signatures or profiles.

  • Computing significant scores in addition to building realistic predictive models are the basic requirements for accurate identification by pattern matching.

Keywords: motif; pattern; fingerprint; signature; profile; domain; protein family; pattern matching

Figure 1.

Motifs are representative of protein families and domains. They are translated into motif descriptors for automatic detection. The various methods described in the text are split into two main categories: deterministic and probabilistic. As a general trend deterministic methods tend to reflect the constraints on the occurrence of amino acids (qualitative), whereas probabilistic methods mainly rely on frequency calculations (quantitative).

Figure 2.

Use of motif detection for proteome annotation. The ypbE gene product of Bacillus subtilis strain 168 is of unknown function. It does not look like any other protein sequence beside a few found in other B. subtilis strains; however, it is found to belong to the Lysin motif protein family (accession number PF01476 in the PFAM database) since positions 191–236 delineate a pattern known as the Lysin motif. Common features of proteins in this family can provide hints for further understanding of the ypbE protein.

close

References

Altschul SF, Bundschuh R, Olsen R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29(2): 351–361.

Attwood TK and Beck ME (1994) PRINTS – a protein motif fingerprint database. Protein Engineering 7(7): 841–848.

Bairoch A (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 19(suppl.): 2241–2245.

Bannai H, Tamada Y, Maruyama O, Nakai K and Miyano S (2002) Extensive feature detection of N‐terminal protein sorting signals. Bioinformatics 18(2): 298–305.

Brazma A, Jonassen I, Ukkonen E and Vilo J (1998) Approaches to the automatic discovery of patterns in biosequences. Journal of Comparative Biology 5(2): 279–305.

Bucher P, Karplus K, Moeri N and Hofmann K (1996) A flexible motif search technique based on generalized profiles. Computers in Chemistry 20(1): 3–23.

Corel E, Pitschi F, Laprevotte I et al. (2010) MS4 – multi‐scale selector of sequence signatures: an alignment‐free method for classification of biological sequences. BMC Bioinformatics 11: 406–420.

Davey NE, Haslam NJ, Shields DC and Edwards RJ (2010) SLiMFinder: a web server to find novel, significantly over‐represented, short protein motifs. Nucleic Acids Research 38: W534–W539.

Didier G, Laprevotte I, Pupin M, Hénaut A (2006) Local decoding of sequences and alignment‐free comparison. Journal of Computational Biology 13(8): 1465–1476.

Doğruel M, Down TA and Hubbard TJ (2008) NestedMICA as an ab initio protein motif discovery tool. BMC Bioinformatics 9: 19.

Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Information 23: 205–211.

Eisenhaber B, Bork P and Eisenhaber F (1999) Prediction of potential GPI‐modification sites in protein sequences. Journal of Molecular Biology 292: 741–758.

Finn RD, Clements J and Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39: W29–W37.

Gonnet P and Lisacek F (2002) Probabilistic alignment of motifs with sequences. Bioinformatics 18: 1091–1101.

Gusfield D (1997) Algorithms on Strings, Trees and Sequences. Cambridge, UK: Cambridge University Press.

Hofmann K, Bucher P, Falquet L and Bairoch A (1999) The PROSITE database: its status in 1999. Nucleic Acids Research 17(27): 215–219.

Lawrence CE, Altschul SF, Boguski MS et al. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262: 208–214.

Li Y, Chia N, Lauria M and Bundschuh R (2011) A performance enhanced PSI‐BLAST based on hybrid alignment. Bioinformatics 27: 31–37.

Maetschke SR, Kassahn KS, Dunn JA et al. (2010) A visual framework for sequence analysis using n‐grams and spectral rearrangement. Bioinformatics 26(6): 737–744.

Nevill‐Manning CG, Nu TD, Brutlag DL et al. (1998) Highly specific protein sequence motifs for genome analysis. Proceedings of the National Academy of Sciences of the USA 95: 5865–5871.

Pierleoni A, Martelli PL and Casadio R (2008) PredGPI: a GPI‐anchor predictor. BMC Bioinformatics 9: 392.

Rigoutsos I, Floratos A, Parida L, Gao Y and Platt D (2000) The emergence of pattern discovery techniques in computational biology. Metabolic Engineering 2(3): 159–177.

Sakakibara Y (2005) Grammatical inference in bioinformatics. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7): 1051–1062.

Shu N, Zhou T and Hovmöller S (2008) Prediction of zinc‐binding sites in proteins from sequence. Bioinformatics 24(6): 775–782.

Sigrist CJ, Cerutti L, de Castro E et al. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38(Database issue): D161–D166.

Further Reading

Baldi P and Brunak S (1998) Bioinformatics: The Learning Approach. Cambridge, MA: MIT Press.

Durbin R, Eddy SR, Krogh A and Mitchson G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press.

Web Links

Gibbs sampler http://bayesweb.wadsworth.org/gibbs/gibbs.html

HMMER http://hmmer.janelia.org/

MEME http://meme.sdsc.edu/meme/

Nested MICA http://www.sanger.ac.uk/Software/analysis/nmica/

PFAM (database of protein families) http://www.sanger.ac.uk/Pfam/

PROSITE http://prosite.expasy.org/

SLiMFinder http://bioware.ucd.ie/slimfinder.html

TEIRESIAS and MUSCA http://cbcsrv.watson.ibm.com/Ttwpd.html

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Koua, Dominique, and Lisacek, Frédérique(Jun 2012) Pattern Searches in Protein Sequences. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0006222.pub2]