Protein Superfamilies


The concept of ‘superfamily’ is key for understanding the organisation of the protein universe and extracting useful information from it. As a set of proteins with a common ancestor, a superfamily contains rich information on the structural, functional and evolutionary features of its members. It represents a good equilibrium between conservation and variability. In one hand, the members of a superfamily have the same global 3D structure, can be aligned, and share many functional aspects. On the other hand, they are variable enough to encode important information on the functional diversity of their members. This variability also allows generating informative profiles which can be used to pick up distant relatives (e.g. to predict their function or their 3D structure). For these reasons, the concept of superfamily is behind many modern approaches for studying and predicting protein function and structure. The available massive data on protein superfamilies is accessible through a number of web resources.

Key Concepts:

  • A superfamily is a group of protein domains arisen from a common ancestor, being such common origin evident in their sequence identities or not.

  • The superfamily is the most informative level of protein classification hierarchies.

  • The concept of superfamily is behind most methods for predicting protein structure and function.

  • All available information on protein superfamilies is available online in dedicated web sites.

  • The rate of ‘discovery’ of new superfamilies is decaying and some people think that we might be close to having ‘touched’ most superfamilies of the protein universe.

Keywords: protein sequence; protein three‐dimensional structure; protein domain; protein function; protein structure prediction; protein function prediction; protein functional sites; protein evolution

Figure 1.

(a) Hierarchical classification of the protein universe. The main levels (structural class, fold, superfamily, etc.) are depicted, including some examples for each. Some variations of this general schema are shown in grey at the top. (b) Illustration of some elements of the hierarchical classification (fold, superfamily (SF) and family (F)) in a 2D representation of the sequence space. The different (solid) colours represent different functions. ‘?’ represents a protein of unknown structure (target) that can be modelled based on different templates (red dots), using different modelling approaches (in red) depending on the target–template relationship (dotted red lines): within the same family, same superfamily, or same fold. Example structural alignments of calmodulin (PDBId/chain: 1c7vA) with troponin (1topA, same family), polcalcin (1k9uA, same superfamily) and THP12 carrier protein (1c3zA, same fold) are shown in backbone representation, coloured according with the amino‐acid type. Calmodulin is always the thin backbone. The percentages of sequence identity are shown.

Figure 2.

Yearly increment in the number of structures deposited in PDB@RCSB compared with the number of different subfamilies they represent (as defined in SCOP and CATH). The Y axis is in logarithmic scale. Generated from the data available at the RCSB statistics page:



Adzhubei IA, Schmidt S, Peshkin L et al. (2010) A method and server for predicting damaging missense mutations. Nature Methods 7: 248–249.

Andrade MA and Sander C (1997) Bioinformatics: from genome data to biological knowledge. Current Opinion in Biotechnology 8: 675–6836.

Bauer B, Mirey G, Vetter IR et al. (1999) Effector recognition by the small GTP‐binding proteins Ras and Ral. Journal of Biological Chemistry 274: 17763–17770.

Berman HM, Westbrook J, Feng Z et al. (2000) The protein data bank. Nucleic Acids Research 28: 235–242.

Chandonia JM and Brenner SE (2006) The impact of structural genomics: expectations and outcomes. Science 311: 347–351.

Chothia C and Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO Journal 5: 823–826.

Chubb D, Jefferys BR, Sternberg MJ and Kelley LA (2010) Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe. Bioinformatics 26: 2664–2671.

D'Alfonso G, Tramontano A and Lahm A (2001) Structural conservation in single‐domain proteins: implications for homology modeling. Journal of Structural Biology 134: 246–256.

Dayhoff MO (1974) Computer analysis of protein sequences. Federation Proceedings 33: 2314–2316.

Dessailly BH, Dawson NL, Mizuguchi Kand Orengo CA (2013) Functional site plasticity in domain superfamilies. Biochimica et Biophysica Acta 1834: 874–889.

Devos D and Valencia A (2000) Practical limits of function prediction. Proteins 41: 98–107.

Finn RD, Bateman A, Clements J et al. (2014) Pfam: the protein families database. Nucleic Acids Research 42: D222–D230.

Juan D, Pazos F and Valencia A (2013) Emerging methods in protein co‐evolution. Nature Reviews Genetics 14: 249–261.

de Lima Morais DA, Fang H, Rackham OJ et al. (2011) SUPERFAMILY 1.75 including a domain‐centric gene ontology method. Nucleic Acids Research 39: D427–D434.

Lewis TE, Sillitoe I, Andreeva A et al. (2013) Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. Nucleic Acids Research 41: D499–D507.

Marchler‐Bauer A, Zheng C, Chitsaz F et al. (2013) CDD: conserved domains and protein three‐dimensional structure. Nucleic Acids Research 41: D348–D352.

Murzin AG, Brenner SE, Hubbard T and Chotia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247: 536–540.

Nagano N, Orengo CA and Thornton JM (2002) One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. Journal of Molecular Biology 321: 741–765.

Orengo CA, Jones DT and Thornton JM (1994) Protein superfamilies and domain superfolds. Nature 372: 631–634.

Pearl F, Todd A, Sillitoe I et al. (2005) The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Research 33: D247–D251.

Perez‐Iratxeta C, Palidwor G and Andrade‐Navarro MA (2007) Towards completion of the Earth's proteome. EMBO Reports 8: 1135–1141.

Ponting CP and Russell RB (2002) The natural history of protein domains. Annual Review of Biophysics and Biomolecular Structure 31: 45–71.

Ranea JA, Sillero A, Thornton JM and Orengo CA (2006) Protein superfamily evolution and the last universal common ancestor (LUCA). Journal of Molecular Evolution 63: 513–525.

Rausell A, Juan D, Pazos F and Valencia A (2010) Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proceedings of the National Academy of Sciences of the USA 107: 1995–2000.

Redfern OC, Dessailly B and Orengo CA (2008) Exploring the structure and function paradigm. Current Opinion in Structural Biology 18: 394–402.

Rojas AM, Fuentes G, Rausell A and Valencia A (2012) The Ras protein superfamily: evolutionary tree and role of conserved amino acids. Journal of Cell Biology 196: 189–201.

Sadowski MI and Jones DT (2009) The sequence‐structure relationship and protein function prediction. Current Opinion in Structural Biology 19: 357–362.

Sanchez‐Pulido L, Diffley JF and Ponting CP (2010) Homology explains the functional similarities of Treslin/Ticrr and Sld3. Current Biology 20: R509–R510.

Sanchez‐Pulido L, Pidoux AL, Ponting CP and Allshire RC (2009) Common ancestry of the CENP‐A chaperones Scm3 and HJURP. Cell 137: 1173–1174.

Todd AE, Orengo CA and Thornton JM (2001) Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology 307: 1113–1143.

Valdar WS (2002) Scoring residue conservation. Proteins 48: 227–241.

Further Reading

Chothia C, Gough J, Vogel C and Teichmann S (2003) Evolution of the protein repertorie. Science 300: 1701–1703.

Creighton TE (1993) Proteins, Structures and Molecular Properties. New York, NY: W. H. Freeman and Company.

Koonin EV and Galperin MY (2003) Sequence – Evolution – Function: Computational Approaches in Comparative Genomics. Boston, MA: Kluwer Academic.

Lee D, Redfern O and Orengo C (2007) Predicting protein function from sequence and structure. Nature Reviews Molecular Cell Biology 8: 995–1005.

Ouzounis C, Coulson RMR, Enright AJ, Kunin V and Pereira‐Leal JB (2003) Classification schemes for protein structure and function. Nature Reviews Genetics 4: 508–519.

Vendruscolo M and Dobson CM (2005) A glimpse at the organization of the protein universe. Proceedings of the National Academy of Sciences of the USA 102: 5641–5642.

Wallace IM, Blackshields G and Higgins DG (2005) Multiple sequence alignments. Current Opinion in Structural Biology 15: 261–266.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Pazos, Florencio, and Sánchez‐Pulido, Luis(Aug 2014) Protein Superfamilies. In: eLS. John Wiley & Sons Ltd, Chichester. [doi: 10.1002/9780470015902.a0025587]