Protein Structure Classification

Abstract

To understand and map the universe of protein structures, it is necessary to collate, annotate and classify these structures in a rational scheme. While the individual protein structures reveal much about the specific molecular mechanisms that underlie a particular biological function, taken together this body of data also allows biologists to explore the evolution of structure and function. As structure is much better conserved than sequence, these data also facilitate the recognition of evolutionary relationships that are hidden at the sequence level. The different approaches that have been taken to tackle this problem include the identification of protein domains, phylogenetic and phenetic classification and hierarchical and nearest‐neighbour clustering. Powerful sequence searching methods then enable structural assignments to be allocated to genomic data.

Key Concepts

  • Proteins comprise recognisable smaller sequence domains.
  • Domains usually consist of secondary and supersecondary structures and have an average size of 150 ± 50 residues.
  • These domains may be thought of as units of evolution, which recur in many proteins in various combinations.
  • Structural classifications have been developed that group domains into fold and sequence families.
  • The phenetic approach groups the proteins according to their structural characteristics.
  • The phylogenetic approach groups proteins into families according to their evolutionary history.

Keywords: protein structure classification; common folds; protein architecture; structural comparison; genomes

Figure 1. Schematic representation of the Class (C), Architecture (A) and Topology/fold (T) levels in the CATH database.
Figure 2. There are over 2700 recognised homologous superfamilies in the CATH database, however more than half of all known protein domains are found within just 100 of the most highly populated superfamilies.
Figure 3. This figure shows the structural superposition of 120 nonredundant CATH domains from the Trypsin‐like Serine Proteases CATH Superfamily (id: 2.40.10.10). These superpositions are calculated for all superfamilies in CATH and can be used to identify the highly conserved structural core that acts like a fingerprint for each superfamily.
Figure 4. Domain motions.
close

References

Altschul SF, Madden TL, Schaffer AA, et al. (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Andreeva A, Howorth D, Brenner SE, et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Research 32: D226–D229.

Andreeva A and Murzin AG (2006) Evolution of protein fold in the presence of functional constraints. Current Opinion in Structural Biology 16: 399–408.

Andreeva A, Howorth D, Chothia C, Kulesha E and Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Research 42: D310–D314.

Brenner SE, Chothia C and Hubbard TJ (1997) Population statistics of protein structures: lessons from structural classifications. Current Opinion in Structural Biology 7: 369–376.

Chothia C (1992) Proteins. One thousand families for the molecular biologist. Nature 357: 543–544.

Coulson AF and Moult J (2002) A unifold, mesofold, and superfold model of protein fold use. Proteins 46: 61–71.

Flores S, Echols N, Milburn D, et al. (2006) The Database of Macromolecular Motions: new features added at the decade mark. Nucleic Acids Research 34: D296–D301.

Gerstein M and Krebs W (1998) A database of macromolecular motions. Nucleic Acids Research 26: 4280–4290.

Grishin NV (2001) Fold change in evolution of protein structures. Journal of Structural Biology 134: 167–185.

Gutmanas A, Alhroub Y, Battle GM, et al. (2014) PDBe: Protein Data Bank in Europe. Nucleic Acids Research 42: D285–D291.

Hadley C and Jones DT (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure 7: 1099–1112.

Harrison A, Pearl F, Mott R, Thornton J and Orengo C (2002) Quantifying the similarities within fold space. Journal of Molecular Biology 323: 909–926.

Hogue CW, Ohkawa H and Bryant SH (1996) A dynamic look at structures: WWW‐Entrez and the Molecular Modeling Database. Trends in Biochemical Sciences 21: 226–229.

Holm L and Sander C (1993) Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology 233: 123–138.

Holm L and Sander C (1994) Parser for protein folding units. Proteins 19: 256–268.

Holm L and Sander C (1995) Dali: a network tool for protein structure comparison. Trends in Biochemical Sciences 20: 478–480.

Holm L and Sander C (1996) Mapping the protein universe. Science 273: 595–603.

Holm L and Sander C (1997) Dali/FSSP classification of three‐dimensional protein folds. Nucleic Acids Research 25: 231–234.

Holm L and Rosenstrom P (2010) Dali server: conservation mapping in 3D. Nucleic Acids Research 38: W545–W549.

Hubbard SJ and Argos P (1996) A functional role for protein cavities in domain: domain motions. Journal of Molecular Biology 261: 289–300.

Islam SA, Luo J and Sternberg MJ (1995) Identification and analysis of domains in proteins. Protein Engineering 8: 513–525.

Jones S, Stewart M, Michie A, et al. (1998) Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Science: A Publication of the Protein Society 7: 233–242.

Karplus K, Barrett C and Hughey R (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14: 846–856.

Kinch LN and Grishin NV (2002) Evolution of protein structures and functions. Current Opinion in Structural Biology 12: 400–408.

Kinjo AR, Suzuki H, Yamashita R, et al. (2012) Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format. Nucleic Acids Research 40: D453–D460.

Krishna SS and Grishin NV (2005) Structural drift: a possible path to protein fold change. Bioinformatics 21: 1308–1310.

Krissinel E and Henrick K (2004) Secondary‐structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica. Section D: Biological Crystallography 60: 2256–2268.

Lee D, Grant A, Marsden RL and Orengo C (2005) Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 59: 603–615.

Lees JG, Lee D, Studer RA, et al. (2014) Gene3D: Multi‐domain annotations for protein sequence and comparative genome analysis. Nucleic Acids Research 42: D240–D245.

Levitt M and Chothia C (1976) Structural patterns in globular proteins. Nature 261: 552–558.

Madej T, Lanczycki CJ, Zhang D, et al. (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Research 42: D297–D303.

Michie AD, Orengo CA and Thornton JM (1996) Analysis of domain structural class using an automated class assignment protocol. Journal of Molecular Biology 262: 168–185.

Mizuguchi K, Deane CM, Blundell TL and Overington JP (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science: A Publication of the Protein Society 7: 2469–2471.

Oates ME, Stahlhacke J, Vavoulis DV, et al. (2015) The SUPERFAMILY 1.75 database in 2014: a doubling of data. Nucleic Acids Research 43: D227–D233.

Orengo CA, Jones DT and Thornton JM (1994) Protein superfamilies and domain superfolds. Nature 372: 631–634.

Orengo CA, Michie AD, Jones S, et al. (1997) CATH‐‐a hierarchic classification of protein domain structures. Structure 5: 1093–1108.

Pearl F, Todd AE, Bray JE, et al. (2000) Using the CATH domain database to assign structures and functions to the genome sequences. Biochemical Society Transactions 28: 269–275.

Pearl FM, Bennett CF, Bray JE, et al. (2003) The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Research 31: 452–455.

Pearl F, Todd A, Sillitoe I, et al. (2005) The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Research 33: D247–D251.

Qi G, Lee R and Hayward S (2005) A comprehensive and non‐redundant database of protein domain movements. Bioinformatics 21: 2832–2838.

Redfern OC, Harrison A, Dallman T, Pearl FM and Orengo CA (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Computational Biology 3: e232.

Reeves GA, Dallman TJ, Redfern OC, Akpor A and Orengo CA (2006) Structural diversity of domain superfamilies in the CATH database. Journal of Molecular Biology 360: 725–741.

Rose PW, Prlic A, Bi C, et al. (2015) The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Research 43: D345–D356.

Shindyalov IN and Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11: 739–747.

Siddiqui AS and Barton GJ (1995) Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. Protein Science: A Publication of the Protein Society 4: 872–884.

Sillitoe I, Cuff AL, Dessailly BH, et al. (2013) New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Research 41: D490–D498.

Sillitoe I, Lewis TE, Cuff A, et al. (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Research 43: D376–D381.

Swindells MB (1995) A procedure for detecting structural domains in proteins. Protein Science: A Publication of the Protein Society 4: 103–112.

Taylor WR and Orengo CA (1989) Protein structure alignment. Journal of Molecular Biology 208: 1–22.

Thornton JM, Orengo CA, Todd AE and Pearl FM (1999) Protein folds, functions and evolution. Journal of Molecular Biology 293: 333–342.

Further Reading

Branden C and Tooze J (1999) Introduction to Protein Structure, 2nd edn. New York: Garland Publishing.

Gu J and Bourne PE (eds) (2009) Structural Bioinformatics, 2nd edn. Hoboken, NJ: John Wiley & Sons Inc.

Lesk AM (2010) Introduction to Protein Science: Architecture, Function and Genomics, 2nd edn. Oxford: Oxford University Press.

Orengo CA, Thornton JM and Jones DT (eds) (2003) Bioinformatics: Genes, Proteins & Computers. Oxford: BIOS Scientific.

Williamson M (2012) How Proteins Work. New York: Garland Science, Taylor & Francis Group, LLC.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Pearl, Frances MG, Sillitoe, Ian, and Orengo, Christine A(Oct 2015) Protein Structure Classification. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0003033.pub3]