Protein Family Databases


As new protein sequences continue to flood into public databases with the advancement of sequencing technologies, the importance of protein family databases for automatic protein functional classification increases. These databases are developed independently and each has its own methods and areas of interest, as well as its own strengths and weaknesses. To simplify access to multiple databases by the user, many of these databases have also been amalgamated into integrated protein family resources, which vary in their level of manual curation. These protein family databases or integrated resources have a number of applications in modern biology or bioinformatics, including protein functional annotation, orthologue prediction, protein–protein interaction prediction, gene set enrichment analysis and providing datasets for evaluation of mathematic models of biological systems or networks.

Key Concepts:

  • Protein signatures are mathematical descriptions of the sequence characteristics of members of the same protein family or domain.

  • Profiles and hidden Markov models are tools for characterising protein families or domains.

  • Regular expressions or patterns are used for describing short highly conserved motifs.

  • Protein family data has a number of applications, notably for the functional classification of new protein sequences.

Keywords: protein family; domain; annotation; functional classification; profiles; hidden Markov models

Figure 1.

Example output for the query protein O54689, the mouse C‐C chemokine receptor type 6 protein in (a) InterPro output for O54689 and (b) CDD output for O54689.



Altschul SF, Madden TL, Schaffer AA et al. (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Andreeva A, Howorth D, Brenner SE et al. (2008) Data growth and its impact on the SCOP databasee: new developments. Nucleic Acids Research 36(Database issue): D419–D425.

Attwood TK, Coletta A, Muirhead G et al. (2012) The PRINTS database: a fine‐grained protein sequence annotation and analysis resource – its status in 2012. Database 10.1093/database/base019.

Bru C, Courcelle E, Carrere S et al. (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Research 33: D212–D215.

Cuff AL, Sillitoe I, Lewis T et al. (2011) Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Research 39(Database issue): D420–D426.

Dowell RD, Jokerst RM, Day A, Eddy SR and Stein L (2001) The distributed annotation system. BMC Bioinformatics 2: 7.

Eddy SR (1996) Hidden Markov models. Current Opinion in Structural Biology 6(3): 361–365.

Finn RD, Mistry J, Tate J et al. (2010) The Pfam protein families database. Nucleic Acids Research 38(Database issue): D211–D222.

Gene Ontology Consortium (2010) The gene ontology in 2010: extensions and refinements. Nucleic Acids Research 38(Database issue): D331–D335.

Gribskov M, Luthy R and Eisenberg D (1990) Profile analysis. Methods in Enzymology 183: 146–159.

Huang DW, Sherman BT and Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols 4: 44–57.

Hunter S, Jones P, Mitchell A et al. (2012) InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Research 40(1): D306–D312.

Krogh A, Brown M, Mian IS, Sjolander K and Haussler D (1994) Hidden Markov models in computational biology. Applications to protein modeling. Journal of Molecular Biology 235(5): 1501–1531.

Lees J, Yeats C, Perkins J et al. (2012) Gene3D: a domain‐based resource for comparative genomics, functional annotation and protein network analysis. Nucleic Acids Research 40(Database issue): D465–D471.

Letunic I, Doerks T and Bork P (2012) SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Research 40(D1): D302–D305.

Marchler‐Bauer A, Lu S, Anderson JB et al. (2011) CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Research 39(Database Issue): D225–D229.

Meinel T, Krause A, Luz H, Vingron M and Staub E (2005) The SYSTERS protein family database in 2005. Nucleic Acids Research 33(Database issue): D226–D229.

Mi H, Dong Q, Muruganujan A et al. (2010) PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the gene ontology consortium. Nucleic Acids Research 38: D204–D210.

Mitchell AL, Selimas I and Attwood TK (2012) MINOTAUR: a web‐based annotator‐assistant tool. International Journal of Systems Biology and Biomedical Technology 1: 1–10.

Portugaly E, Linial N and Linial M (2007) EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Research 35(Database issue): D241–D246.

Quevillon E, Silventoinen V, Pillai S et al. (2005) InterProScan: protein domains identifier. Nucleic Acids Research 33: W116–W120.

Selengut JD, Haft DH, Davidsen T et al. (2007) TIGRFAMs and genome properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Research 35(Database issue): D260–D264.

Sigrist CJ, Cerutti L, de Castro E et al. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38(Database issue): D161–D166.

Szklarczyk D, Franceschini A, Kuhn M et al. (2011) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research 39: D561–D568.

UniProt Consortium (2010) The universal protein resource (UniProt) in 2010. Nucleic Acids Research 38(Database issue): D142–D148.

Wilson D, Pethica R, Zhou Y et al. (2009) SUPERFAMILY – comparative genomics, datamining and sophisticated visualization. Nucleic Acids Research 37(Database issue): D380–D386.

Wu CH, Nikolskaya A, Huang H et al. (2004) PIRSF: family classification system at the protein information resource. Nucleic Acids Research 32: D112–D114.

Yona G, Linial N and Linial M (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Research 28: 49–55.

Further Reading

Geer LY, Domrachev M, Lipman DJ and Bryant SH (2002) CDART: protein homology by domain architecture. Genome Research 10: 1619–1623.

Goodsell DS (2010) The protein data bank: exploring biomolecular structure. Nature Education 3(9): 39.

Jones P, Binns D, McMenamin C, McAnulla C and Hunter S (2011) The InterPro BioMart: federated query and web service access to the InterPro Resource. Database 10.1093/database/bar033.

Redfern O, Grant A, Maibaum M and Orengo C (2005) Survey of current protein family databases and their application in comparative, structural and functional genomics. Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences 815(1–2): 97–107.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Mulder, Nicola J(Oct 2012) Protein Family Databases. In: eLS. John Wiley & Sons Ltd, Chichester. [doi: 10.1002/9780470015902.a0003058.pub3]