Bioinformatics in Genome Sequencing Projects

Abstract

Genome sequencing and analysis is a field that has evolved very rapidly over the 10 years since a final draft of the human genome sequence was published in 2003. From obtaining a full genome sequence from a representative individual or strain of a small number of species, the genomics community has moved to documenting genetic diversity within species, with an emphasis on humans, and more generally by sequencing the genomes of a rapidly growing number of species. The advent of low‐cost, very high throughput sequencing techniques has also made it possible to sample transcriptomes (the part of the genome transcribed into ribonucleic acid), genomic regions bound by proteins, bacterial communities, or even entire ecosystems, using sequencing approaches. This has spawned a new generation of software tools designed to handle the very large numbers of short sequences, commonly referred to as reads, produced by the new machines, and has also propelled the field of computational genomics into the realm of Big Data that require large and sophisticated computer systems for their management and analysis.

Key Concepts:

  • Rapidly evolving sequencing technologies have revolutionised the analysis of genomes.

  • The management and analysis of billions of short sequence reads requires specialised software and access to high‐end computer hardware.

  • Clusters of commodity servers are the preferred infrastructure for genome sequencing projects, but machines with large amounts of memory are required for assembly.

  • Cloud computing is expected to have a big impact on the field, but issues of data transfer are not fully resolved.

  • Large distributed resources supporting a Map/Reduce programming paradigm and distributed storage, such as Apache Hadoop, will probably become standard.

Keywords: computer infrastructure; hardware; networking; processors; software; genome projects; genome sequencing

References

Bentley DR, Balasubramanian S, Swerdlow HP et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218): 53–59.

Birney E, Clamp M and Durbin R (2004) GeneWise and Genomewise. Genome Research 14(5): 988–995.

Brenner S, Johnson M, Bridgham J et al. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nature Biotechnology 18(6): 630–634.

Burge C and Karlin S (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268(1): 78–94.

Eddy SR (2011) Accelerated Profile HMM Searches. PLoS Computational Biology 7(10): e1002195.

Eid J, Fehr A, Gray J et al. (2009) Real‐time DNA sequencing from single polymerase molecules. Science 323(5910): 133–138.

Flicek P, Amode MR, Barrell D et al. (2012) Ensembl 2012. Nucleic Acids Research 40(Database issue): D84–D90.

Gentleman RC, Carey VJ, Bates DM et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5(10): R80.

Gnerre S, Maccallum I, Przybylski D et al. (2011) High‐quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences of the USA 108(4): 1513–1518.

Goecks J, Nekrutenko A and Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11(8): R86.

Haas BJ, Papanicolaou A, Yassour M et al. (2013) De novo transcript sequence reconstruction from RNA‐seq using the Trinity platform for reference generation and analysis. Nature Protocols 8(8): 1494–1512.

Hoff KJ and Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Research 41(Web Server issue): W123–W128.

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431(7011): 931–945.

Johnson M, Zaretskaya I, Raytselis Y et al. (2008) NCBI BLAST: a better web interface. Nucleic Acids Research 36(Web Server issue): W5–W9.

Karolchik D, Hinrichs AS and Kent WJ (2012) The UCSC genome browser. Current Protocols in Bioinformatics, Chapter 1: Unit1.4.

Korpar M and Sikic M (2013) SW#‐GPU‐enabled exact alignments on genome scale. Bioinformatics 29(19): 2494–2495.

Langmead B, Schatz MC, Lin J, Pop M and Salzberg SL (2009) Searching for SNPs with cloud computing. Genome Biology 10(11): R134.

Langmead B and Salzberg SL (2012) Fast gapped‐read alignment with Bowtie 2. Nature Methods 9(4): 357–359.

Lewin HA, Larkin DM, Pontius J and O'Brien SJ (2009) Every genome sequence needs a good map. Genome Research 19(11): 1925–1928.

Li H, Handsaker B, Wysoker A et al. (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16): 2078–2079.

Li H and Durbin R (2009) Fast and accurate short read alignment with Burrows‐Wheeler transform. Bioinformatics 25(14): 1754–1760.

Liu CM, Wong T, Wu E et al. (2012) SOAP3: ultra‐fast GPU‐based parallel alignment tool for short reads. Bioinformatics 28(6): 878–879.

Luo R, Liu B, Xie Y et al. (2012) SOAPdenovo2: an empirically improved memory‐efficient short‐read de novo assembler. Gigascience 1(1): 18.

Margulies M, Egholm M, Altman WE et al. (2005) Genome sequencing in microfabricated high‐density picolitre reactors. Nature 437(7057): 376–380.

McKenna A, Hanna M, Banks E et al. (2010) The genome analysis toolkit: a MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Research 20(9): 1297–1303.

Nishihara H, Smit AF and Okada N (2006) Functional noncoding sequences derived from SINEs in the mammalian genome. Genome Research 16(7): 864–874.

Rothberg JM, Hinz W, Rearick TM et al. (2011) An integrated semiconductor device enabling non‐optical genome sequencing. Nature 475(7356): 348–352.

Rutherford K, Parkhill J, Crook J et al. (2000) Artemis: sequence visualization and annotation. Bioinformatics 16(10): 944–945.

Salamov AA and Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Research 10(4): 516–522.

Schatz MC, Delcher AL and Salzberg SL (2010) Assembly of large genomes using second‐generation sequencing. Genome Research 20(9): 1165–1173.

Schatz MC, Langmead B and Salzberg SL (2010) Cloud computing and the DNA data race. Nature Biotechnology 28(7): 691–693.

Thorvaldsdottir H, Robinson JT and Mesirov JP (2013) Integrative Genomics Viewer (IGV): high‐performance genomics data visualization and exploration. Briefings in Bioinformatics 14(2): 178–192.

Trapnell C, Roberts A, Goff L et al. (2012) Differential gene and transcript expression analysis of RNA‐seq experiments with TopHat and Cufflinks. Nature Protocols 7(3): 562–578.

Valouev A, Ichikawa J, Tonthat T et al. (2008) A high‐resolution, nucleosome position map of C. elegans reveals a lack of universal sequence‐dictated positioning. Genome Research 18(7): 1051–1063.

Westesson O, Skinner M and Holmes I (2013) Visualizing next‐generation sequencing data with JBrowse. Briefings in Bioinformatics 14(2): 172–177.

Zerbino DR and Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18(5): 821–829.

Web Links

Amazon Web services. http://aws.amazon.com/

Apache Hadoop. http://hadoop.apache.org/

Broad Institute. http://www.broadinstitute.org/

Ensembl. http://www.ensembl.org/

European Bioinformatics Institute. http://www.ebi.ac.uk/

HIPAA privacy rules. http://www.hhs.gov/ocr/privacy/hipaa/understanding/

National Center for Biotechnology Information (NCBI). http://www.ncbi.nlm.nih.gov/

SeqAnswers list of bioinformatics software. http://seqanswers.com/wiki/Software

UCSC Genome viewer. http://genome.ucsc.edu

Wellcome Trust Sanger Institute. http://www.sanger.ac.uk/

Contact Editor close
Submit a note to the editor about this article by filling in the form below.

* Required Field

How to Cite close
Jongeneel, Cornelis Victor(Mar 2014) Bioinformatics in Genome Sequencing Projects. In: eLS. John Wiley & Sons Ltd, Chichester. http://www.els.net [doi: 10.1002/9780470015902.a0005311.pub3]