Protein Tertiary Structures: Prediction from Amino Acid Sequences


Proteins have a crucial role in all cellular processes. The function of a protein is intrinsically linked to its structure, so solving protein tertiary structures is the key to understanding the biological functions of proteins. Resolving protein three‐dimensional (3D) structures is complicated and time‐consuming, so despite recent efforts to determine representative structures for each protein family, the number of known protein structures is dwarfed by the number of known protein sequences. Computational methods for the prediction of protein tertiary structures directly from their amino acid sequences have been developed to bridge this gap. These structure prediction algorithms are based on the observation that there are a limited number of protein folds and that most protein sequences will fold into one of these limited globular structures. The field of structure prediction is now quite mature and many stable methods exist for generating alignments between sequences and structures and for building 3D models.

Key Concepts:

  • Structure can be predicted from sequence because protein folds are relatively stable.

  • A wide variety of methods exist for the prediction of 3D structure from protein sequence.

  • The easiest targets are those for which it is possible to detect an evolutionary related template structure that aligns with more than 30% sequence and few gaps.

  • In these cases, structure prediction is trivial and the emphasis should be placed on all atom refinement measures.

  • Where template structures are more remotely related, maximum effort should be put into obtaining an alignment between template and target sequences.

  • Structural domains, disorder, secondary structure and important functional residues need to be considered when building a model of a target protein.

Keywords: protein structure prediction; protein folding; homology modelling; fold recognition; ab initio prediction

Figure 1.

The crystal structure of a putative nitroreductase from Mycobacterium smegmatis (PDB code: 2ymv) showing beta‐strands in red and alpha helices in teal.

Figure 2.

View of an alignment between target sequence and structural template (PDB code: 3tac) from the HHPred web server. HHPred is a particularly useful server for alignment editing. In the figure, predicted helix‐forming residues for the target (first line) and template (last line), as well as the real secondary structure for the template (penultimate line), are indicated in red with an ‘H’. There are five single residue gaps in the alignment and three (marked by a red arrow) broken helices in the template. Here, the predictor should consider shifting the gaps so that they fall in an adjacent loop region (marked as ‘C’ in the secondary structure notation). However, where possible the predictor should try not to disturb the conserved loop regions (indicated by the symbols ‘|’ and ‘+’ in the fourth line).

Figure 3.

The crystal structure of domain 1 of a hypothetical protein from Bacteroides eggerthii (in red, PDB code: 4FTD) superimposed on one of the best predictions from CASP10 (light blue, from HHPred). The nearest template had just 17% identity with the target. While the positioning of many of the secondary structure elements in the model are target structures, there are large differences in the loop regions, especially clear in the loops on the left. The positions of the side chains in the two structures (not shown) are not at all similar. The crystal structure of a putative lipoprotein from Parabacteroides distasonis (in red, PDB code: 4FVS) superimposed on a good model (light blue, from HHPred). Here, the nearest template was 43% identical to the target. With the better alignments all the helices, strands and conserved loops in the model and target structures superimpose well, and the largest differences between target and model are in nonconserved loop regions. The positions of the side chains in the first panel (not shown) where the target and template structures had just 17% identity are not at all similar. In the second panel where the target and template had 43% identity the side chains (not shown) are more or less correctly positioned in the model, with the exception of the variable loops.

Figure 4.

The crystal structure of a guide‐strand‐containing Argonaute protein silencing complex (PDB code: 3 dlb). The protein has five separate structural domains, each shown in a different colour.



Altschul SF, Madden TL, Schaffer AA et al. (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research 25: 3389–3402.

Berman HM, Westbrook J, Feng Z et al. (2000) The protein data bank. Nucleic Acids Research 28: 235–242.

Bowie JU, Luthy R and Eisenberg D (1991) A method to identify protein sequences that fold into a known three‐dimensional structure. Science 253: 164–170.

Brooks BR, Bruccoleri RE, Olafson BD et al. (1993) CHARMM: a program for macromolecular energy minimization, and dynamics calculations. Journal of Computational Chemistry 4: 187–217.

Cole C, Barber JD and Barton GJ (2008) The Jpred 3 secondary structure prediction server. Nucleic Acids Research 35: W197–W201.

Dayhoff MO, Schwartz RM and Orcutt BC (1978) A model of evolutionary change in protein matrices for detecting distant relationships. In: Dayhoff MO (ed.) Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, p. 345–352. Washington, DC: National Biomedical Research Foundation.

Eddy SR (2004) What is a hidden Markov model? Nature Biotechnology 22: 1315–1316.

Henikoff S and Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the USA 89: 10915–10919.

Hildebrand A, Remmert M, Biegert A and Soding J (2009) Fast and accurate automatic structure prediction with HHpred. Proteins 77: 128–132.

Hopf TA, Colwell LJ, Sheridan R et al. (2012) Three‐dimensional structures of membrane proteins from genomic sequencing. Cell 149: 1607–1621.

Jones DT (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology 1287: 797–815.

Jones DT, Buchan DW, Cozzetto D and Pontil M (2011) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28: 184–190.

Jones DT, Taylor WR and Thornton JM (1992) A new approach to protein fold recognition. Nature 358: 86–89.

Kelley LA, MacCallum RM and Sternberg MJ (2000) Enhanced genome annotation using structural profiles in the program 3D‐PSSM. Journal of Molecular Biology 299: 499–520.

Kinch LN, Shi S, Cheng H et al. (2011) CASP9 target classification. Proteins: Structure, Function, and Bioinformatics 79: 21–36.

Larsson P, Skwark MJ, Wallner B and Elofsson A (2011) Improved predictions by using multiple templates. Bioinformatics 27: 426–427.

Leaver‐Fay A, Tyka M, Lewis SM et al. (2011) ROSETTA3: an object‐oriented software suite for the simulation and design of macromolecules. Methods in Enzymology 487: 545–574.

Levitt M (2007) Growth of novel protein structural data. Proceedings of the National Academy of Sciences of the USA 104: 3183–3188.

Luthy R, Bowie JU and Eisenberg D (1992) Assessment of protein models with three‐dimensional profiles. Nature 356: 83–85.

Marks DS, Colwell LJ, Sheridan R et al. (2011) Protein 3D structure computed from evolutionary sequence variation. PLoS One 6: e28766.

McGuffin LJ, Bryson K and Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16: 404–405.

McGuffin LJ and Roche DB (2010) Rapid model quality assessment for protein structure predictions using the comparison of multiple models without structural alignments. Bioinformatics 26: 182–188.

Mirjalili V, Noyes K and Feig M (2013) Physics-based protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins. [ Epub ahead of print].

Morea V and Tramontano A (2003) Assessment of homology‐based predictions in CASP5. Proteins: Structure, Function, and Bioinformatics 53: 352–368.

Needleman SB and Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology 48: 443–453.

Nugent T and Jones DT (2012) Accurate de novo structure prediction of large transmembrane protein domains using fragment‐assembly and correlated mutation analysis. Proceedings of the National Academy of Sciences of the USA 109: E1540–E1547.

Pearlman DA, Case DA, Caldwell JW et al. (1995) AMBER, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules. Computer Physics Communications 91: 1–41.

Pearson WR (1990) Rapid and sensitive sequence comparison with PASTP and FASTA. Methods in Enzymology 183: 63–98.

Punta M, Coggill PC, Eberhardt RY et al. (2012) Pfam protein families database. Nucleic Acids Research 40: D290–D301.

Raman S, Vernon R, Thompson J et al. (2009) Structure prediction for CASP8 with all‐atom refinement using Rosetta. Proteins 77(S9): 89–99.

Sali A and Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology 234: 779–815.

Schwede T, Kopp J, Guex N and Peitsch MC (2003) SWISS‐MODEL: an automated protein homology‐modeling server. Nucleic Acids Research 31: 3381–3385.

Smith TF and Waterman MS (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147: 195–197.

Tress ML, Ezkurdia I, Graña O, López G and Valencia A (2005b) Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins: Structure, Function, and Bioinformatics 61: 27–45.

Tress ML, Ezkurdia I and Richardson JS (2009) Target domain definition and classification in CASP8. Proteins: Structure, Function, and Bioinformatics 77: 10–17.

Tress ML, Tai C‐H, Wang G et al. (2005a) Domain definition and target classification for CASP6. Proteins: Structure, Function, and Bioinformatics 61: 8–18.

Tress ML and Valencia A (2010) Predicted residue‐residue contacts can help the scoring of 3D models. Proteins: Structure, Function, and Bioinformatics 78:1980–1991.

UniProt, Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 38: D142–D148.

Wallner B and Elofsson A (2006) Identification of correct regions in protein models using structural, alignment, and consensus information. Protein Science 15: 900–913.

Xu D, Zhang J, Roy A and Zhang Y (2011) Automated protein structure modeling in CASP9 by I‐TASSER pipeline combined with QUARK‐based ab initio folding and FG‐MD‐based structure refinement. Proteins: Structure, Function, and Bioinformatics 79: 147–160.

Further Reading

Baker D and Sali A (2001) Protein structure prediction and structural genomics. Science 294: 93–96.

Ginalski K, Grishin NV, Godzik A and Rychlewski L (2005) Practical lessons from protein structure prediction. Nucleic Acids Research 33: 1874–1891.

Levitt M (2009) Nature of the protein universe. Proceedings of the National Academy of Sciences of the USA 106: 11079–11084.

Zhang Y (2009) Protein structure prediction: when is it useful? Current Opinion in Structural Biology 19: 145–155.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Tress, Michael(Oct 2013) Protein Tertiary Structures: Prediction from Amino Acid Sequences. In: eLS. John Wiley & Sons Ltd, Chichester. [doi: 10.1002/9780470015902.a0003040.pub2]