Genetic Databases

Abstract

Internationally available databases contain the sequences of genes, together with related information concerning the proteins they encode, their functions, disease associations and so on. These repositories capture the data emerging from the worldwide genome sequencing projects, and make them freely available to the community for all aspects of modern biological and biomedical research.

Keywords: EMBL; GenBank; DDBJ; Swiss‐Prot; UniProt; Internet

Figure 1.

Landmark events in the dawning of the genomic era. The top line immediately below the time line charts the events leading from manual peptide sequencing of the first hormone and enzyme, through the appearance of automated peptide sequencers in the 1960s to deoxyribonucleic acid (DNA) sequencing in the 1970s, with high‐throughput (HT) techniques appearing around 1992, and the resultant flood of completed genomes in the mid‐1990s (Haemophilus influenzae, Methanococcus jannaschii, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens). In parallel with the information boom (bottom line), databases began to proliferate in the 1980s, starting with relatively simple sequence repositories, through individual family databases, organism‐specific databases, molecular interaction and pathway databases, to integrated databases of protein families and functional sites. The birth of the Web (WWW) in the early 1990s provided a vehicle for rapid dissemination of the data deluge throughout the world.

Figure 2.

Format of a typical EMBL entry: mouse prion protein. The two‐letter code at the beginning of each line indicates the type of information contained in that line. The sequence field (SQ) contains the nucleotide sequence, and the coding sequence (CDS) field of the feature table (FT) its protein translation. Note the cross‐references to Swiss‐Prot (Figure ) and the Mouse Genome Database (MGD) in the database cross‐reference (DR) lines in the centre of the figure.

Figure 3.

Format of a typical Swiss‐Prot entry: mouse prion protein. The two‐letter code at the beginning of each line adheres to the EMBL format and indicates the type of information contained in that line. Note the increased amount of annotation by comparison with the EMBL entry in Figure : Swiss‐Prot includes more literature references (for convenience, references 2–5 have been replaced by ellipses); an extensive comment field (CC), describing the function, structure and disease associations of the protein; an increased number of database cross‐references (DR); and an extended feature table (FT), describing features of the sequence, such as potential carbohydrate binding sites, lipid attachment sites, internal repeats. Note the reciprocal cross‐reference to EMBL/GenBank/DDBJ in the DR lines in the centre of the figure.

Figure 4.

Format of a typical TrEMBL entry: dog prion protein. The two‐letter code at the beginning of each line adheres to the EMBL format and indicates the type of information contained in that line. The information included in the entry is generated automatically and is consequently limited by comparison with the Swiss‐Prot entry in Figure . Notable absences are the lack of a free‐text comment field (CC) and a feature table (FT). Note the cross‐reference to EMBL/GenBank/DDBJ in the DR lines in the centre of the figure.

Figure 5.

Computer‐generated image of the structure of the prion protein, whose related EMBL, Swiss‐Prot and TrEMBL entries are illustrated in Figures , respectively. The extended coils indicate local helical structures (α‐helices) and the short arrows denote stretches of β‐strand within the overall protein fold. The structure is drawn from the PDB link (DR lines) within Swiss‐Prot.

close

Further Reading

Attwood TK (2000) The quest to deduce protein function from sequence: the role of pattern databases. International Journal of Biochemistry and Cell Biology 32: 139–155.

Benson DA, Karsch‐Mizrachi I, Lipman DJ, Ostell J and Wheeler DL (2007) GenBank. Nucleic Acids Research 35: D21–25.

Berners‐Lee T and Fischetti M (1999) Weaving the Web. San Francisco, CA: Harper.

Dayhoff MO (ed.) (1965) Atlas of Protein Sequence and Structure. Silver Spring, MD: National Biomedical Research Foundation.

Galperin MY (2007) The molecular biology database collection: 2007 update. Nucleic Acids Research 35: D3–D4.

Kulikova T, Akhtar R, Aldebert P et al. (2007) EMBL nucleotide sequence database in 2006. Nucleic Acids Research 35: D16–D20.

Special History Issue (2000) Bioinformatics 16: 1–75.

Sugawara H, Abe T, Gojobori T and Tateno Y (2007) DDBJ working on evaluation and classification of bacterial genes in INSDC. Nucleic Acids Research 35: D13–D15.

The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Research 35: D193–D197.

Web Links

DDBJ. The homepage of the DNA Data Bank of Japan. http://www.ddbj.nig.ac.jp/

EMBL. The homepage of Europe's EMBL nucleotide database. http://www.ebi.ac.uk/embl/

GenBank. The homepage of NCBI's GenBank database. http://www.ncbi.nlm.nih.gov/GenBank/

INSDC. The homepage of the International Nucleotide Sequence Collaboration. http://www.insdc.org/

UniProt. The homepage of the Universal Protein Resource. http://www.uniprot.org

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Attwood, Teresa K(Sep 2007) Genetic Databases. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005312.pub2]