DNA Sequence Analysis


Recent advances in deoxyribonucleic acid (DNA) sequencing technology have produced a massive amount of nucleotide sequences, which are stored in DNA databanks and genomic data repositories. Furthermore, comprehensive analyses of transcriptional and genomic elements have uncovered an elaborate system of gene expression that broadens our understanding of fundamental biological phenomena. The analysis of DNA data has therefore become essential to predict gene function or detect regulatory motifs through comparative studies. In this article, DNA databases, homology search tools and sequence alignment methods are surveyed. The concept of distance between genes and how to calculate this measure using DNA or amino acid sequences and introducing several commonly used techniques for phylogenetic analysis and tree evaluation are also described.

Key concepts

  • Advances in DNA sequencing technology have produced an unprecedented amount of sequence data.

  • The DNA Data Bank of Japan (DDBJ), the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) are the three major sequence data repositories. They exchange data periodically, and maintain various services for data search and retrieval.

  • Similarity searching, alignment of sequences, prediction of function and reconstruction of the evolutionary history (phylogenetic tree) of a group of species are among the most commonly used techniques for sequence analysis.

  • BLAST (similarity searching), ClustalW (sequence alignment), Pfam (protein domains) and TRANSFAC (transcription factors) are popular tools and resources.

  • The genetic distance, a measure of evolutionary similarity, is usually calculated as the number of nucleotide or amino acid differences (substitutions) among sequences. Nucleotide substitutions are synonymous (not affecting the codified amino acid) or nonsynonymous (triggering an amino acid change).

  • Distance‐ and character‐based methods can be used to reconstruct phylogenetic trees. Distance‐based methods reconstruct the tree from an estimation of the evolutionary distance among taxa. Character‐based methods derive the phylogeny directly from the observable state of characters in the taxa.

  • The bootstrap method is commonly used to determine the quality of an inferred phylogeny.

Keywords: DNA databank; genome projects; similarity search; evolutionary distance; molecular phylogeny

Figure 1.

Growth of the number of completely sequenced genomes. Data obtained from GOLD.



Ahituv N, Zhu Y, Visel A et al. (2007) Deletion of ultraconserved elements yields viable mice. PLoS Biology 5: e234.

Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology 215: 403–410.

Bandelt HJ and Dress AW (1992) Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Molecular Phylogenetics and Evolution 1: 242–252.

Bernstein BE, Kamal M, Lindblad‐Toh K et al. (2005) Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 120: 169–181.

Dayhoff MO, Schwartz RM and Orcutt BC (1978) A model of evolutionary change in proteins. In: Dayhoff MO (ed.) Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, pp. 345–352. Washington, DC: National Biomedical Research Foundation.

Edwards AWF and Cavalli‐Sforza LL (1964) Reconstruction of evolutionary trees. In: Heywood VH and McNeill J (ed.) Phenetic and Phylogenetic Classification, pp. 67–76, Publ. No. 6. London: Systematics Association.

Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum‐likelihood approach. Journal of Molecular Evolution 17: 368–376.

Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783–791.

Felsenstein J (2004) Inferring Phylogenies. Sunderland, MA: Systematics Association.

Gerstein MB, Bruce C, Rozowsky JS et al. (2007) What is a gene, post‐ENCODE? History and updated definition. Genome Research 17: 669.

Hendy MD, Penny D and Steel MA (1994) A discrete Fourier analysis for evolutionary trees. Proceedings of the National Academy of Sciences of the USA 91: 3339–3343.

Huelsenbeck JP, Ronquist F, Nielsen R and Bollback JP (2001) Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294(5550): 2310–2314.

Imanishi T, Itoh T, Suzuki Y et al. (2004) Integrative Annotation of 21,037 Human Genes Validated by Full‐Length cDNA Clones. PLoS Biology 2: e162.

Jukes TH and Cantor CR (1969) Evolution of protein molecules. In: Munro HN (ed.) Mammalian Protein Metabolism, pp. 21–132. New York: Academic Press.

Katoh K, Misawa K, Kuma K and Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30: 3059–3066.

Kimura M (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequence. Journal of Molecular Evolution 16: 111–120.

Kimura M (1983) The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press.

Le Quesne WJ (1969) A method of selection of characters in numerical taxonomy. Systematic Biology 51: 217–234.

Li W‐H, Wu C‐I and Luo C‐C (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Molecular Biology and Evolution 2: 150–174.

Nei M and Gojobori T (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution 3: 418–426.

Notredame C, Higgins DG and Heringa J (2000) T‐coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302: 205–217.

Rice Annotation Project (2008) The Rice Annotation Project Database (RAP‐DB): 2008 update. Nucleic Acids Research 36: D1028–D1033.

Rzhetsky A and Nei M (1992) A simple method for estimating and testing minimum‐evolution trees. Molecular Biology and Evolution 9: 945–967.

Saitou N and Nei M (1987) The neighbor‐joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406–425.

Sokal RR and Michener CD (1958) A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin 28: 1409–1438.

Swarbreck D, Wilks C and Lamesch P (2008) The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Research 36: D1009–D1014.

Swofford DL and Maddison DR (1987) Reconstructing ancestral character states under Wagner parsimony. Mathematical Biosciences 87: 199–229.

Swofford DL, Olsen GJ, Waddell PJ and Hillis DM (1996) Phylogenetic inference. In: Hillis DM, Moritz C and Mable BK (eds) Molecular Systematics, 2nd edn, pp. 411–501. Sunderland, MA: Sinauer Associates.

The ENCODE Project Consortium (2007) Identification and analysis of functional elements in of the human genome by the ENCODE pilot project. Nature 447: 799–816.

The FANTOM Consortium (2005) The transcriptional landscape of the mammalian genome. Science 309: 1559–1563.

Thompson JD, Higgins DG and Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position‐specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673–4680.

Zuckerkandl E and Pauling L (1965) Evolutionary divergence and convergence in proteins. In: Bryson V and Vogel VH (eds) Evolving Genes and Proteins, pp. 97–166. New York: Academic Press.

Further Reading

Altschul SF, Boguski MS, Gish W and Wootton JC (1994) Issues in searching molecular sequence databases. Nature Genetics 6: 119–129.

Cooper GM and Brown CD (2008) Qualifying the relationship between sequence conservation and molecular function. Genome Research 18: 201–205.

Database issue (2008) Nucleic Acids Research 36(1).

DNA Database of Japan (DDBJ) [www.ddbj.nig.ac.jp].

European Bioinformatics Institute (EBI) [www.ebi.ac.uk].

Fitch WM (2000) Homology. Trends in Genetics 16: 227–231.

Gojobori T, Moriyama EN and Kimura M (1990) Statistical methods for estimating sequence divergence. Methods in Enzymology 183: 531–550.

GOLD [http://www.genomesonline.org/].

Graur D and Li W‐H (2000) Fundamentals of Molecular Evolution, 2nd edn. Sunderland, MA: Sinauer Associates.

Mount DW (2004) Bioinformatics: Sequence and Genome Analysis, 2nd edn. Woodbury, NY: Cold Spring Harbor Laboratory Press.

National Center for Biotechnology Information (NCBI) [www.ncbi.nlm.nih.gov].

Nei M (1996) Phylogenetic analysis in molecular evolutionary genetics. Annual Review of Genetics 30: 371–403.

Nei M and Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford: Oxford University Press.

Smith TF and Waterman MS (1981) Identification of common molecular subsequences. Journal of Molecular Biology 147: 195–197.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Gojobori, Takashi, Nakagawa, So, and Clemente, Jose C(Sep 2009) DNA Sequence Analysis. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0001798.pub2]