Protein Databases: Operations, Possibilities and Challenges

Abstract

The information on protein function, as essential component of biological systems, is essential for the development of biology and biomedicine. The information on proteins is organised in databases that store their sequences, domain organisation, three‐dimensional structures, posttranslational modifications, interactions, molecular functions and other protein features. This information is combined with the proper bioinformatics methods to address complex biological problems. The annotation of splice isoforms is used as a current example of the use of information from central protein databases. Instead of offering a mere catalogue of the many protein databases, which status is periodically reviewed and maintained in ‘The Molecular Biology Database Collection’ (http://www.oxfordjournals.org/nar/database/c/), this article reviews the operations necessary to organise and make openly accessible their information as part of the complex ecosystems of bioinformatics infrastructures. Finally, this article revises the main challenges that their sustainability represents.

Key Concepts:

  • Functional annotation is a process of assigning function to a protein based on the experimental evidence published in the literature or transferred from other proteins with similar sequences.

  • Process of adding information to database entries by groups of experts typically associated to large databases is known as database curation.

  • Information transfer is a process of assigning functions or functional characteristics to protein sequences based on the annotations of similar – in most cases orthologos – sequences.

  • Combining annotations from different publications and database to produce a representative description of protein characteristics and functional properties is known as information integration.

  • Bioinformatics infrastructure is the collection of databases, bioinformatics methods, and computational resources that provide the essential support to the work of biologists.

Keywords: protein; database; annotation; text mining; protein function; protein structure; protein domains; posttranslational modifications

Figure 1.

APPRIS constitutive protein isoform annotation system. APPRIS, as example of a secondary database based on the information provided by the core protein databases. The APPRIS database contains information on human splice isoforms annotating them with protein structural and functional and evolutionary features. APPRIS is part of the infrastructure behind the ongoing annotation of the human genome. The database integrates eight annotation modules: Matador3D is based on structural homologies extracted from the PDB database (http://www.rcsb.org/pdb/); firestar (http://firedb.bioinfo.cnio.es) makes predictions of conserved functionally important amino acid residues using information from protein structure comparison databases (for example, http://www.cathdb.info); SPADE uses information from protein domain database (pfam.sanger.ac.uk/); INERTIA calculates unusual evolutionary rates, CRASH makes conservative predictions of signal peptides and THUMP predicts transmembrane helices, the three of use information provided by the Uniprot central database (http://www.uniprot.org/); CExonic and CORSAIR determine patterns of conservation of exonic structures using different type of alignment strategies, similar to the one provided by databases on orthologs and interspecies alignments (see http://questfororthologs.org/orthology_databases). Adapted with permission from Rodriguez et al., ().

close

References

Krallinger M, Valencia A and Hirschman L (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biology 9(suppl. 2): S8.

Leitner F, Chatr‐aryamontri A, Mardis SA et al. (2010) The FEBS Letters/BioCreative II.5 experiment: making biological information accessible. Nature Biotechnology 28: 897–899.

Rodriguez JM, Maietta P, Ezkurdia I et al. (2013) APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Research 41(Database issue): D110–D117.

Further Reading

Fernández‐Suárez XM, Rigden DJ and Galperin MY (2014) The 2014 nucleic acids research database issue and an updated NAR online molecular biology database collection. Nucleic Acids Research 42(Database issue): D1–D6.

Gannon F (2006) Life science infrastructures are different. EMBO Reports 7(4): 347.

Gu J and Bourne PE (eds) (2009) Structural Bioinformatics. Weinheim: John Wiley & Sons, Inc.

Krallinger M, Leitner F and Valencia A (2010) Analysis of biological processes and diseases using text mining approaches. Methods in Molecular Biology 593: 341–382.

Orengo C and Bateman A (eds) (2013) Protein Families: Relating Protein Sequence, Structure, and Function. Weinheim: John Wiley & Sons Inc.

Valencia A (2002) Search and retrieve. Large‐scale data generation is becoming increasingly important in biological research. But how good are the tools to make sense of the data?. EMBO Reports 3: 396–400.

Web Links

1000 human genomes project. http://www.1000genomes.org

Cytoscape. www.cytoscape.org

ELIXIR. www.elixir‐europe.org

ENCODE. https://www.genome.gov/encode/

Gencode/ENCODE. http://www.gencodegenes.org

International Cancer Genome Consortium. https://icgc.org

String. string.embl.de

Swissmodel. http://swissmodel.expasy.org

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Valencia, Alfonso(Dec 2014) Protein Databases: Operations, Possibilities and Challenges. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005251.pub2]