Sequence Complexity and Composition

Abstract

Local compositional complexity is a numerical measure of repetitiveness of sequences of symbols from a finite alphabet. Highly repetitive sequences are considered simple, whereas highly nonrepetitive sequences are considered complex.

Keywords: alphabet; local compositional complexity; pattern; sequence analysis; sequence annotation

Figure 1.

Examples of complexity charts used for DNA sequence segmenting and approximate functional annotation. Both charts were generated with a window width, W, of 200 nucleotides moving one nucleotide at a time (window step, s=1). The accuracy of correctly annotated positions is too low (100 nucleotides) to be useful for exact gene structure determination, but it is clear that compositional complexity is correlated with gene structure. (a) Modified compositional complexity chart (z‐score of MCC) for the region analogous to the α‐operon of Escherichia coli in halophilic archaea, Halobacterium halobium. Arrows with transparent points indicate probable intergenic regions between protein‐coding sequences. (b) Local compositional complexity and modified compositional complexity in chicken ovalbumin gene X. Arrows with filled points indicate probable positions of introns. Arrows with transparent points show the false‐positive indications of intergenic spacers. (Determining the number of different genes in putative protein‐coding regions is a serious problem that plagues all computer‐assisted gene prediction methods. Compositional complexity chart methods face this problem as well.)

Figure 2.

Slopes of straight line fits for surprisal versus complexity data for short oligonucleotides (regular and patchy) in large samples of human exons of confirmed protein‐coding genes, introns, 3′ untranslated regions (UTRs) and 5′ UTRs of these genes. The x coordinate of each plot in the ‘matrix’ represents patchiness (k=0, 1,…,9). The y coordinate represents slope values of surprisal versus complexity regression line. (All slope values are significant at a confidence level of 5% or better.) In every figure panel, block lengths L=5, 9, 13 and 20 correspond to the top to bottom lines respectively. Figure panels in the ‘exons’ column of the matrix show that exons display clear three‐base periodicity of occurrence of short oligonucleotides at all levels of patchiness, in all four alphabets. Comparison of ‘introns’ and ‘3′ UTRs’ columns shows that the complexity‐related properties of introns and 3′ UTRs are remarkably similar in most cases. This explains known difficulties with determining number of protein‐coding genes in computationally predicted ‘coding regions’. The only significant differences (and precious for practical purposes of gene identification) can be found for 20‐grams in {A, C, G, T}, {K, M} and {R, Y} alphabets. Comparison of ‘exons’ and ‘5′ UTRs’ columns also shows that complexity‐related properties of exons and 5′ UTRs are similar enough to cause problems with computational identification of 5′ ends of protein‐coding genes. Comparison of figure panels in the bottom right and the bottom left corners of the matrix suggests that using statistics of 20‐grams in the {S, W} alphabet should help to correctly identify 5′‐UTRs correctly.

close

References

Chaitin GJ (1966) On the length of programs for computing finite binary sequences. Journal of the ACM 13: 547–569.

Horgan J (1995) From complexity to perplexity. Scientific American June: 104–109.

Kolmogorov AN (1965) Three approaches to the definition of the concept ‘Quantity of Information’. Problems of Information Transmission (Russian) 1: 3–11.

Konopka AK (1990) Towards mapping functional domains in indiscriminantly sequenced nucleic acids: a computational approach. In: Sarma RH and Sarma MH (eds.) Structure and Methods – Human Genome Initiative and DNA Recombination, vol. 1, pp. 113–125. Guiderland, NY: Adenine Press.

Konopka AK (1994) Sequences and codes: fundamentals of biomolecular cryptology. In: Smith D (ed.) Biocomputing: Informatics and Genome Projects, pp. 119–174. San Diego, CA: Academic Press.

Konopka AK (1997) Theoretical molecular biology. In: Meyers RA (ed.) Encyclopedia of Molecular Biology and Molecular Medicine, vol. 6. pp. 37–53. Weinheim: VCH Publishers.

Konopka AK and Owens J (1990a) Non‐contiguous patterns and compositional complexity of nucleic acid sequences. In: Bell GI and Marr TG (eds.) Computers and DNA, pp. 147–155. Redwood City, CA: Addison‐Wesley Longman.

Konopka AK and Owens J (1990b) Complexity charts can be used to map functional domains in DNA. Gene Analysis Techniques and Applications 7: 35–38.

Mikulecky DC (2001) The emergence of complexity: science coming of age or science growing old? Computers and Chemistry 25: 341–348.

Salamon P and Konopka AK (1992) A maximum entropy principle for distribution of local complexity in naturally occurring nucleotide sequences. Computers and Chemistry 16(2): 117–124.

Salamon P, Wootton JC, Konopka AK and Hansen LK (1993) On the robustness of maximum entropy relationships for complexity distributions of nucleotide sequences. Computers and Chemistry 17(2): 135–148.

Shannon CE (1948) A mathematical theory of communication. Bell System Technical Journal 27: 379–423, 623–656.

Solomonoff RJ (1964) A formal theory of inductive inference. Information and Control 7: 224–254.

Wootton JC and Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers and Chemistry 17(2): 149–163.

Further Reading

Bell GI and Torney DC (1993) Repetitive DNA sequences: some considerations for simple sequence repeats. Computers and Chemistry 17(2): 185–190.

Britten RJ and Kohne DE (1968) Repeated sequences in DNA. Science 161: 529–540.

Konopka AK (1993) Plausible classification codes and local compositional complexity of nNucleotide sequences. In: Lim HA, Fickett JW, Cantor CR and Robbins RJ (eds.) The Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis, pp. 69–87. New York: World Scientific Publishing.

Rosen R (2000) Essays on Life Itself. New York: Columbia University Press.

Shannon CE (1951) Prediction and entropy of printed English. Bell System Technical Journal 30: 50–64.

Tautz D, Trick M and Dover GA (1986) Cryptic simplicity in DNA is a major source of genetic variation. Nature 322: 652–656.

Wootton JC (1997) Simple sequences of protein and DNA. In: Bishop MJ and Rawlings CJ (eds.) DNA and Protein Sequence Analysis, pp. 169–183. Oxford: IRL Press.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Konopka, Andrzej K(Sep 2005) Sequence Complexity and Composition. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1038/npg.els.0005260]