Repetitive Elements: Bioinformatic Identification, Classification and Analysis

Abstract

Multicopy, or repetitive, deoxyribonucleic acid (DNA) is routinely being detected and analysed by computer‐assisted comparison of genomic DNA with reference databases of repeats. The most representative collection of repetitive elements is ‘Repbase Update’ (RU), which currently contains >15 000 unique entries from diverse eukaryotic species. The majority of transposable elements (TEs) in RU are consensus sequences based on multiple alignments of individual repeats. Consensus sequences are approximations of active TEs responsible for generating multiple mutated copies in the genome. The current two major repeat detection and annotation programs, RepeatMasker and CENSOR, both use RU for annotation of repeats in eukaryotic genomes. RU is also increasingly being used as a master reference library to create custom libraries for detection of repeats in newly sequenced genomes. Finally, a combination of different routines can be used to detect repeats not similar to those already present in the reference libraries (de novo approach).

Key Concepts:

  • Active transposable elements (TEs) produce families and subfamilies of multiple copies in the genome, called ‘interspersed repetitive elements’ or ‘repeats’.

  • Consensus sequences derived from aligned families and subfamilies of repeats are excellent approximations of the active TEs from which they were derived.

  • Consensus sequences are also preferred reference sequences used in screening and annotation of repetitive elements, especially the most divergent ones.

  • RepeatMasker and CENSOR are basic repeat screening and annotation programs using reference sequence libraries.

  • In the absence of reference sequences, repetitive DNA can be detected by screening for multiple copies and characteristic structural features (de novo approach).

Keywords: transposable elements (TEs); simple sequence repeats (SSRs); repeat maps; computational biology; reference databases

Figure 1.

Basic prototypes of human repetitive DNA: (a) tandemly repeated DNA (minisatellites, microsatellites, centromeric and telomeric repeats); (b) LINE retro(trans)posons; (c) SINE retro(trans)posons; (d) and (e) autonomous and nonautonomous endogenous retroviral elements; (f) and (g) autonomous and nonautonomous DNA transposons. ORF1 and ORF2 denote open reading frames 1 and 2 in the human L1 (LINE1) element. ORF2 encodes an enzyme with endonuclease (EN) and reverse transcriptase (RT) activities. LTR, long terminal repeat and TIR, terminal inverted repeat.

Figure 2.

General scheme for computer‐assisted identification of repetitive DNA. Full and broken arrows indicate major and alternative steps in the process, respectively. The first step is identifying and masking simple sequence repeats (SSRs) using any variety of programs described in the text or by sequence alignment against a reference collection of simple repeats. This is followed by identification of complex repeats, by aligning the masked sequence against the respective reference collection. There are several types of possible output file: a list of alignments against the reference sequences, an input file with masked simple or complex repeats or both. Output files listing repeat location and other characteristics can be organised in a form of map similar to that in Table . S–W, Smith–Waterman algorithm.

close

References

Baldi P and Baisnee P‐F (2000) Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths. Bioinformatics 16: 865–889.

Bedell JA, Korf I and Gish W (2000) Maskeraid: a performance enhancement to RepeatMasker. Bioinformatics 16: 1040–1041.

Bergman CM and Quesneville H (2007) Discovering and detecting transposable elements in genome sequences. Briefings in Bioinformatics 8: 382–392.

Buisine N, Quesneville H and Colot V (2008) Improved detection and annotation of transposable elements in sequenced genomes using multiple reference sequence sets. Genomics 91: 467–475.

Jurka J (1994) Approaches to identification and analysis of interspersed repetitive DNA sequences. In: Adams MD, Fields C and Venter JC (eds) Automated DNA Sequencing and Analysis, pp. 294–298. San Diego, CA: Academic Press.

Jurka J (1998) Repeats in genomic DNA: mining and meaning. Current Opinion in Structural Biology 8: 333–337.

Jurka J (2000) Repbase Update: a database and an electronic journal of repetitive elements. Trends in Genetics 16: 418–420.

Jurka J, Kapitonov VV, Pavlicek A et al. (2005) Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research 110: 462–467.

Jurka J, Klonowski P, Dagman V and Pelton P (1996) CENSOR – a program for identification and elimination of repetitive elements from DNA sequences. Computers & Chemistry 20: 119–121.

Jurka J, Walichiewicz J and Milosavljevic A (1992) Prototypic sequences for human repetitive DNA. Journal of Molecular Evolution 35: 286–291.

Kohany O, Gentles AJ, Hankus L and Jurka J (2006) Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 25: 474.

Lander ES, Linton LM, Birren B et al. (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921.

Lerat E (2010) Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104: 520–533.

Quesneville H, Bergman CM, Andrieu O et al. (2005) Combined evidence annotation of transposable elements in genome sequences. PLoS Computational Biology 1: 166–175.

Wootton JC and Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods in Enzymology 266: 554–571.

Further Reading

Brosius J (1999) Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107: 209–238.

Jurka J (2003) Repetitive DNA: detection, annotation, and analysis. In: Krawetz SA and Womble DD (eds) Introduction to Bioinformatics: A Theoretical and Practical Approach, chapter 8, pp. 151–167. Totowa, NJ: Humana Press.

Jurka J, Kapitonov VV, Kohany O and Jurka MV (2007) Repetitive sequences in complex genomes: structure and evolution. Annual Review of Genomics and Human Genetics 8: 241–259.

Kapitonov VV and Jurka J (2008) Universal classification of eukaryotic transposable elements implemented in Repbase. Nature Reviews. Genetics 9: 411–412.

Prak ET and Kazazian HHJ (2000) Mobile elements and the human genome. Nature Reviews. Genetics 1: 134–144.

Smit AFA (1996) The origin of interspersed repeats in the human genome. Current Opinion in Genetics & Development 6: 743–748.

Smit AFA (1999) Interspersed repeats and other mementos of transposable elements in mammalian genomes. Current Opinion in Genetics & Development 9: 657–663.

Web Links

Genetic Information Research Institute (http://www.girinst.org). Provides CENSOR server, which screens DNA sequences for simple and interspersed repeats using the current version of Repbase Update (Jurka, 2000; Jurka et al. 2005), which includes a content of the electronic journal Repbase Reports. It also provides a server for detailed classification of non‐LTR retrotransposons (http://www.girinst.org/RTphylogeny/RTclass1).

Repeat Masker Server at the University of Washington. RepeatMasker screens DNA sequences for interspersed repeats and low complexity DNA sequences using Repbase Update (Jurka, 2000) as well as custom libraries http://www.repeatmasker.org.

REPET package. http://urgi.versailles.inra.fr/index.php/urgi/Tools/REPET.

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Jurka, Jerzy, Bao, Weidong, Kojima, Kenji, and Kapitonov, Vladimir V(Feb 2011) Repetitive Elements: Bioinformatic Identification, Classification and Analysis. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005270.pub2]