Stockholm Bioinformatics Center, SBC
Lecture notes: Molecular Bioinformatics 2001, Uppsala University

Lecture 26 Jan 2001 Per Kraulis

The genomes

3. The genomes: how?

The methods we have available today for determining the sequence of DNA can produce sequence data for at most 1000 bases. This means that it is necessary to split up the total DNA of an organism, sequence the parts, and reassemble the know sequence segments based on similarity (identity) between the overlapping parts of the DNA sequence segments.

The two main variants of this 'divide-and-conquer' approach are:

Clone-based sequencing

A good illustration of this method is the description of the procedure used for the Caenorhabditis elegans genome as published by the C. elegans sequencing consortium, Science (1998) 282, 2012-2018.

Shotgun sequencing

An important aspect of genome sequencing is that it was long believed that the shotgun strategy would not work on such large DNA molecules as chromosomes. The frequency of gaps in the final assembly (due to uncovered segments) and the problems that repetitive DNA sequences caused, would render shotgun sequencing useless for large DNA molecules. However, Craig Venter formed The Institute for Genomics Research (TIGR) in order to use shotgun sequencing in a systematic fashion for determining entire bacterial genomes.

The basic idea is that by sequencing a large number of randomly chosen fragments in an industrialized fashion, it should be possible to reassemble the complete genome computationally. The oversampling is critical. This is shown schematically in the figure. Simulations indicate that at a coverage of 10x, it should be possible to reassemble a complete genome with very few gaps left over. By 10x one means the number of fragments required to have each base in the genome represented in 10 fragments on average.

The strategy developed by TIGR, and subsequently used by Celera for the Drosophila melanogaster project (and their human genome project) has another important component. The DNA fragments are generated as BAC clones (100-150 kb), of which some are randomly chosen to be fully sequenced by the shotgun method. Other BAC clones are then sequenced only at their ends, so that it becomes possible to choose which of them should be sequenced next. This extra information ('mates') helps considerably in reassembling the entire genome. A description of the entire procedure used to assemble the Drosophila melanogaster genome is given in the paper by Eugene Myers et al, Science (2000) 287, 2196-2204.

There are two important problems inherent in any known sequencing strategy:

Copyright © 2001 Per Kraulis $Date: 2001/01/26 14:59:06 $