Biomolecular Sequence Analysis: Pattern Acquisition and Frequency Counts

Abstract

Nucleic acid and protein sequence analysis draws on methods for studying the morphology of printed texts in letter‐based writing systems. In particular, it explores the art of frequency counts of contiguous and noncontiguous patterns.

Keywords: sequence analysis; pattern; alphabet; frequency; statistics; contiguous; noncontiguous

Figure 1.

Simplified classification of sequence analysis tasks. One can clearly distinguish three major activities of sequence analysts. (1) Pattern searches: sequence alignment and searches for known patterns expressed in known alphabets of sequence (or structure‐related) motifs. (2) Database design: creation of new data structures or inclusion of new knowledge into existing data structures. The ‘new knowledge’ can be inferred from sequence analyses but it can also be acquired from research outside sequence analysis. (3) Pattern acquisition: determining new alphabets and function‐associated patterns. The shaded content of blocks indicates activities that pertain to pattern acquisition. Dashed borders and connector lines indicate pattern acquisition proper while dotted lines and connectors indicate partial relation to pattern acquisition simultaneously with partial relation to sequence alignment‐based database searches.

Figure 2.

Pragmatic inference and sequence pattern acquisition. Broadly specified biological knowledge about biomolecular sequences and structures, as well as from outside sequences and structures, is involved in every step of alphabet determination. Once sequence motifs (meaningful pattern‐generating descriptors of sequences) are determined, the alphabets of patterns can be used for sequence analysis and comparison. The most important classes of sequence patterns are mentioned in shaded boxes with broken borders. The dotted borders of shaded boxes indicate classes of patterns that are often used in sequence analysis but are usually less significant as functional ‘signatures’ (correlates) of sequences.

Figure 3.

Examples of periodic and quasiperiodic (nucleotide) sequence patterns. Motifs taken into account are printed in bold.

Figure 4.

Example of pattern acquisition via analysis of short oligonucleotide distance charts. All four distance charts were made for large collections of sufficiently long intron and exon sequences from several eukaryotic genomes (primarily human, mouse and fruit fly). (a) Distance charts between two instances of the same nonhomopolymeric dinucleotide AC (charts are almost identical for 11 other nonhomopolymeric dinucleotides). (b) Nearest (shortest) distance charts between two instances of nonhomopolymeric dinucleotide AC. (c) Distance charts between two instances of the same mirror‐symmetric trinucleotide that begins with nonhomopolymeric dinucleotides whose distance and shortest distance charts are A and B respectively. In our examples the trinucleotide is ACA but all other mirror‐symmetric trinucleotides display very similar charts. (d) Shortest distance charts between two instances of the same mirror‐symmetric trinucleotide. A host of significant defining patterns (descriptors) can be inferred from studying charts (a)–(d). For instance the fact that dinucleotides in introns occur at preferred distance of 0, 2, 4,… and generally 2n (n=1, 2, 3,…) while the preferred shortest distance is 0 indicates that tandem repeats of nonhomopolymeric dinucleotides are a predominant, defining, pattern in introns. The fact that mirror‐symmetric, nonhomopolymeric, trinucleotides display 3‐base quasiperiodicity in exons but not in introns is a clear confirmation of triplet nature of the genetic code (avoidance of stop codons in the protein‐encoding reading frame). Even if the genetic code was not known we could easily conclude that trinucleotides and their 3‐base quasiperiodicity constitute defining, function‐associated, patterns in exons but not in introns.

Figure 5.

Lengths of sequences that are required for statistical analyses of k‐grams (k=1, 2,…, 10) over three elementary alphabets of sizes 2, 4 and 20 letters. The alphabets of size 2 and 4 correspond to typical representations of nucleotide sequences while the alphabet of size 20 is used for primary structure of proteins. The lengths of the sequences have been evaluated from the so‐called statistics of blanks (Kullback, ) for Bernoulli texts. According to this statistics the expected number of ‘blanks’ (absent symbols) in a text of length L over an alphabet of size n is

where p(i) stands for the probability of the ith letter. In Bernoulli texts, probabilities p(i) are all equal to 1/n and therefore the number of letters absent by chance alone is B′(L) = n[1 − (1/n)]L. One can set B′(L) close to 0 (we used the value 0.1 for this figure) and then calculate minimum value of L0 that will be large enough not to affect the number of missing letters. Only those sequences with lengths greater than or equal to L0 are sufficiently long to be suitable for statistical analysis based on k‐gram frequency counts.

close

References

Abramson N (1968) Information Theory and Coding. New York: McGraw‐Hill.

Bucher P and Trifonov EN (1987) On Nussinov's compilation of eukaryotic transcription initiation sites. Journal of Theoretical Biology 126: 373–375.

Gentleman JF and Mullin RC (1989) The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics 45: 35–52.

Guibas LJ and Odlyzko AM (1981) String overlaps, pattern matching and nontransitive games. Journal of Combinatorial Theory, Series A 30: 183–208.

Klaerr‐Blanchard M, Chiapello H and Coward E (2000) Detecting localized repeats in genomic sequences: a new strategy and its application to Bacillus subtilis and Arabidopsis thaliana sequences. Computers and Chemistry 24: 57–70.

Konopka AK (1994) Sequences and codes: fundamentals of biomolecular cryptology. In: Smith D (ed.) Biocomputing: Informatics and Genome Projects, pp. 119–174. San Diego, CA: Academic Press.

Konopka AK (2002) Grand metaphors of biology in the genome era. Computers and Chemistry 26: 397–401.

Kullback S (1976) Statistical Methods in Cryptanalysis. Laguna Hills, CA: Aegean Park Press.

Lewontin R (2000) The Triple Helix Gene Organism and Environment. Cambridge, MA: Harvard University Press.

Morange M (2001) The Misunderstood Gene. Cambridge, MA: Harvard University Press.

Régnier M (1998) A unified approach to word statistics. In: Istrail S, Pevzner P and Waterman M (eds.) Proceedings of the Second Annual International Conference on Computational Molecular Biology (RECOMB 98), pp. 207–213. New York: ACM Press

Watson JD (2001) Genes, Girls and Gamow. Oxford, UK: Oxford University Press.

Further Reading

Arques DG and Michel CJ (1987) Periodicities in introns. Nucleic Acids Research 15: 7581–7592.

Blaisdell BE (1983) A prevalent persistent nonrandomness that distinguishes coding and noncoding eukaryotic nuclear DNA sequences. Journal of Molecular Evolution 19: 122–133.

Konopka AK (1993) Plausible classification codes and local compositional complexity of nucleotide sequences. In: Lim HA, Fickett JW, Cantor CR and Robbins RJ (eds.) The Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis, pp. 69–87. New York: World Scientific Press.

Konopka AK (1997) Theoretical molecular biology. Meyers RA (ed.) Encyclopedia of Molecular Biology and Molecular Medicine, vol. 6, pp. 37–53. Weinheim: VCH.

Konopka AK and Smythers GW (1987) DISTAN: a program which detects significant distances between short oligonucleotides. Computer Applications in the Biosciences 3: 193–201.

Martindale C and Konopka AK (1996) Oligonucleotide frequencies in DNA follow a Yule distribution. Computers and Chemistry 20(1): 35–38.

Pevzner PA, Yu Borodovsky M and Mironov AA (1989) Linguistics of nucleotide sequences. II. Stationary words in genetic texts and the zonal structure of DNA. Journal of Biomolecular Structure and Dynamics 6: 1027–1038.

Shulman MJ, Steinberg CM and Westmoreland N (1981) The coding function of nucleotide sequences can be discerned by statistical analysis. Journal of Theoretical Biology 88: 409–420.

Trifonov EN (1989) The multiple codes of nucleotide sequences. Bulletin of Mathematical Biology 51(4): 417–432.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Konopka, Andrzej K(Jan 2006) Biomolecular Sequence Analysis: Pattern Acquisition and Frequency Counts. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1038/npg.els.0005926]