Genetic Sequence Databases

Abstract

From elucidation of the first protein sequence to publication of the human genome, sequence information has been transformative. High‐throughput technologies created torrents of genomic data, which needed to be annotated and stored for use in research. The sequencing revolution was remarkable; what made it a ‘game changer’ was the simultaneous innovation in Web technologies that opened the Internet to mass audiences and gave scientists the unique ability to collect, organise and immediately share information – the impact was huge. The cottage industry of sequence collection assumed industrial proportions, requiring worldwide cooperation to harness the deluge. Today, internationally available databases house the sequences of genes, information about their encoded proteins, their functions, disease associations and so on. Capturing data from worldwide genome projects, and making them freely available for research, these repositories continue to support the ongoing quest to understand our genetic ancestry and to address major challenges in the health, pharmaceutical and agricultural industries.

Key Concepts

  • Sequence information became available slowly, from pioneering work on the manual sequencing of proteins.
  • Elucidating nucleotide sequences was technically more difficult because of the size of DNA molecules.
  • Once molecular sequences were published, enthusiasts began to collect them in databases (the first ‘database’ was actually a book!).
  • Automatic, and then high‐throughput, technologies changed the pace of sequencing, and made whole‐genome sequencing feasible for the first time.
  • Landmark genome‐sequencing projects ushered in a new era of data generation.
  • The arrival of Web technologies coincided with the emergence of high‐throughput sequencing capabilities, and led to a proliferation of biological databases.
  • Sequence data are now generated on such a scale that the information has to be gathered via international consortia.
  • The first nucleotide sequence databases were the EMBL data library, GenBank and DDBJ, which now cooperate under the auspices of the INSDC.
  • Amongst the first protein sequence databases were the PIR‐PSD, Swiss‐Prot and TrEMBL, which now pool resources under the umbrella resource, UniProt.
  • Sequence databases play pivotal roles in all aspects of life‐science research, and will continue to make important contributions to research in the health, pharmaceutical and agricultural sciences.

Keywords: ENA; GenBank; DDBJ; Swiss‐Prot; TrEMBL; PIR; UniProt; Internet; sequence; genome

Figure 1. Landmark events in the dawning of the genomic era. Flags beneath the timeline show how changes in sequencing technology (from the appearance of automated peptide sequencers in the 1960s, to DNA sequencing in the 1970s and high‐throughput (HT) techniques in the early 1990s) led from manual peptide sequencing of the first hormone in 1955 and enzyme in 1965 to the flood of completed genomes in the mid‐1990s. Echoing the information boom (flags above the timeline), databases began to proliferate in the 1980s, starting with relatively simple sequence repositories, through individual family databases, organism‐specific databases, molecular interaction and pathway databases, to integrated databases of protein families and functional sites. The birth of the Web (WWW) in the early 1990s provided a vehicle for rapid dissemination of the data deluge throughout the world.
Figure 2. Format of a typical EMBL entry: the luxF gene from Photobacterium phosphoreum. The two‐letter code at the beginning of each line indicates the type of information contained in that line. The entry contains an identifier (ID), an accession number (AC) – here, both M22128 – and a literature cross‐reference (RN, RP, RX, etc.); the sequence field (SQ) contains the nucleotide sequence, and the coding sequence (CDS) field of the feature table (FT) its protein translation. Note the cross‐references to UniProtKB/Swiss‐Prot (Figure) and the PDB (Figure) in the database cross‐reference (db_xref) lines within the feature table. Ellipses denote lines deleted for brevity.
Figure 3. Format of a typical UniProtKB/Swiss‐Prot entry: the nonfluorescent flavoprotein from P. phosphoreum (whose gene sequence is shown in the EMBL entry in Figure). The two‐letter code at the beginning of each line adheres to the EMBL format and indicates the type of information contained in that line. Note the detailed annotations: the entry includes the identifier (ID) and the accession number (AC) lines – here, LUXF_PHOPO and P12745, respectively; several literature cross‐references (RN, RP, RX, etc.); a free‐text comment field (CC), describing the protein's cofactor, subunit structure and family relationships; a large number of database cross‐references (DR); a feature table (FT), describing elements of the 3D structure and their locations; and the amino acid sequence itself (SQ). Note the reciprocal cross‐reference to EMBL, plus the link to the PDB, in the DR lines in the centre of the figure. Ellipses denote lines deleted for brevity.
Figure 4. Format of a typical UniProtKB/TrEMBL entry: the nonfluorescent flavoprotein from Photobacterium leiognathi. The two‐letter code at the beginning of each line adheres to the EMBL format and indicates the type of information contained in that line. The information included in the entry is generated automatically and is hence more limited than that typically found in UniProtKB/Swiss‐Prot entries (Figure). Notable absences are the free‐text comment field (CC) and the substantial feature table (FT). Note the reciprocal cross‐reference to EMBL in the DR lines in the centre of the figure.
Figure 5. Computer‐generated image of the structure of flavoprotein 390 from P. phosphoreum (1FVP), whose related EMBL and UniProtKB/Swiss‐Prot entries are illustrated in Figures and , respectively. Local helical structures (α‐helices) are red, and extended ribbons (β‐strands) are yellow. The image was generated using the NGL Viewer (Rose and Hildebrand, ).
close

References

Adams MD, Celniker SE, Holt RA, et al. (2000) The genome sequence of Drosophila melanogaster. Science 287 (5461): 2185–2195.

Appel RD, Bairoch A and Hochstrasser DF (1994) A new generation of information retrieval tools for biologists: the example of the ExPASy WWW server. Trends in Biochemical Sciences 19 (6): 258–260.

Apweiler R, Attwood TK, Bairoch A, et al. (2001a) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research 29 (1): 37–40.

Apweiler R, Kersey P, Junker V and Bairoch A (2001b) Technical comment to ‘Database verification studies of SWISS‐PROT and GenBank’, by Karp et al. Bioinformatics 17: 533–534.

Apweiler R, Bairoch A, Wu CH, et al. (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Research 32 (Database issue): D115–D119.

Attwood TK (2000) The Babel of bioinformatics. Science 290: 471–473.

Attwood TK, Beck ME, Bleasby AJ and Parry‐Smith DJ (1994) PRINTS – a database of protein motif fingerprints. Nucleic Acids Research 22 (17): 3590–3596.

Attwood TK and Miller CJ (2001) Which craft is best in bioinformatics? Computers & Chemistry 25: 329–339.

Bairoch A (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 19: 2241–2244.

Bairoch A and Apweiler R (1996) The SWISS‐PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Research 24 (1): 21–25.

Bairoch A and Boeckmann B (1991) The SWISS‐PROT protein sequence data bank. Nucleic Acids Research 19 (Suppl): 2247–2249.

Barker WC, George DG, Mewes HW and Tsugita A (1992) The PIR‐International Protein Sequence Database. Nucleic Acids Research 20 : 2023–2026.

Benson DA, Cavanaugh M, Clark K, et al. (2017) GenBank. Nucleic Acids Research 45 (Database issue): D37–D42.

Bork P (2000) Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Research 10: 398–400.

Bork P and Bairoch A (1996) Go hunting in sequence databases but watch out for the traps. Trends in Genetics 12: 425–427.

Brenner SJ (1999) Genome analysis: errors in genome annotation. Trends in Genetics 15: 753–754.

Bult CJ, White O, Olsen GJ, et al. (1996) Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science 273 (5278): 1058–1073.

Burks C, Fickett JW, Goad WB, et al. (1985) The GenBank nucleic acid sequence database. Computer Applications in the Biosciences 1 (4): 225–233.

C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012–2018.

Cherry JM, Adler C, Ball C, et al. (1998) SGD: Saccharomyces Genome Database. Nucleic Acids Research 26 (1): 73–79.

Cochrane G, Karsch‐Mizrachi I, Tagaki T and International Sequence Database Consortium (2016) The International Sequence Database Collaboration. Nucleic Acids Research 44 (Database Issue): D48–D50.

Dayhoff MO, Eck RV, Chang MA and Sochard MR (eds) (1965) Atlas of Protein Sequence and Structure. Silver Spring, MD: National Biomedical Research Foundation.

Fleischmann RD, Adams MD, White O, et al. (1995) Whole‐genome random sequencing and assembly of Haemophilus influenzae. Science 269 (5223): 496–512.

Fraser CM, Gocayne JD, White O, et al. (1995) The minimal gene complement of Mycoplasma genitalium. Science 270 (5235): 397–403.

George DG, Barker WC and Hunt LT (1986) The protein identification resource (PIR). Nucleic Acids Research 14 (1): 11–15.

Goffeau A, Barrell BG, Bussey H, et al. (1996) Life with 6000 genes. Science 274 (5287): 546–567.

Hamm GH and Cameron GN (1986) The EMBL data library. Nucleic Acids Research 14 (1): 5–9.

Henikoff S and Henikoff JG (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Research 19 (23): 6565–6572.

Hirs CHW, Moore S and Stein WH (1960) The sequence of the amino acid residues in performic acid‐oxidised ribonuclease. Journal of Biological Chemistry 235: 633–647.

International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921.

Kanehisa M and Goto S (2000) KEGG: Kyoto encyclopaedia of genes and genomes. Nucleic Acids Research 28: 27–30.

Karp P (1998) What we do not know about sequence analysis and sequence databases. Bioinformatics 14: 753–754.

Karp P (2000) An ontology for biological function based on molecular interactions. Bioinformatics 16: 269–285.

Mashima J, Kodama Y, Fujisawa T, et al. (2017) DNA Data Bank of Japan. Nucleic Acids Research 45 (Database issue): D25–D31.

Michie AD, Jones ML and Attwood TK (1996) DbBrowser: integrated access to databases worldwide. Trends in Biochemical Sciences 21 (5): 191.

Mullis KB and Faloona FA (1987) Specific synthesis of DNA in vitro via a polymerase‐catalyzed chain reaction. Methods in Enzymology 155: 335–350.

Overbeek R, Larsen N and Pusch GD (2000) WIT: integrated system for high‐throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Research 28: 123–125.

Pearson P, Francomano C, Foster P, et al. (1994) The status of online Mendelian inheritance in man (OMIM) medio 1994. Nucleic Acids Research 22 (17): 3470–3473.

Rose AS and Hildebrand PW (2015) NGL Viewer: a web application for molecular visualization. Nucleic Acids Research 43 (W1): W576–W579.

Ryle AP, Sanger F, Smith LF and Kitai R (1955) The disulphide bonds of insulin. The Biochemical Journal 60 (4): 541–556.

Sanger F (1949) The terminal peptides of insulin. The Biochemical Journal 45: 563–574.

Schuler GD, Epstein JA, Ohkawa H and Kans JA (1996) Entrez: molecular biology database and retrieval system. Methods in Enzymology 266: 141–162.

Sonnhammer EL, Eddy SR and Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28 (3): 405–420.

Stoesser G, Baker W, van den Broek A, et al. (2002) The EMBL Nucleotide Sequence Database. Nucleic Acids Research 30 (1): 21–26.

Tateno Y, Fukami‐Kobayashi K, Miyazaki S, Sugawara H and Gojobori T (1998) DNA Data Bank of Japan at work on genome sequence data. Nucleic Acids Research 26 (1): 16–20.

The FlyBase Consortium (1999) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Research 27: 85–88.

The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Research 45 (Database issue): D158–D169.

Toribio AL, Alako B, Amid C, et al. (2017) European nucleotide archive in 2016. Nucleic Acids Research 45 (Database issue): D32–D36.

Walsh S, Anderson M and Cartinhour SW (1998) ACEDB: a database for genome information. Methods of Biochemical Analysis 39: 299–318.

Wheelan SJ and Boguski MS (1998) Late night thoughts on the sequence annotation problem. Genome Research 8: 168–169.

Wu C, Yeh LS, Huang H, et al. (2003) The Protein Information Resource. Nucleic Acids Research 31 (1): 345–347.

Further Reading

Attwood TK and Miller CJ (2002) Progress in bioinformatics and the importance of being earnest. Biotechnology Annual Review 8: 1–54.

Attwood TK, Pettifer SR and Thorne D (2016) Bioinformatics Challenges at the Interface of Biology and Computer Science: Mind the Gap. Chichester: John Wiley & Sons, Ltd. ISBN: 978-0-470-03548-1.

Bairoch A (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times!. Bioinformatics 16 (1): 48–64.

Berners‐Lee T and Fischetti M (1999) Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. New York: Harper Collins. ISBN 0-06-251587-X.

Galperin MY, Fernandez‐Suarez XM and Rigden DJ (2017) The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes. Nucleic Acids Research 45 (Database issue): D1–D11.

Heather JM and Chain B (2016) The sequence of sequencers: the history of sequencing DNA. Genomics 107: 1–8.

Special History Issue (2000) Bioinformatics 16: 1–75.

Sanger F (1988) Sequences, sequences, and sequences. Annual Review of Biochemistry 57: 1–18.

Strasser BJ (2008) GenBank – natural history in the 21st century? Science 322: 537–538.

Stretton AOW (2002) The first sequence: Fred Sanger and insulin. Genetics 162: 527–532.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Attwood, Teresa K(Apr 2018) Genetic Sequence Databases. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005312.pub3]