Lecture 14

Bioinformatics Lecture 14 • Genome sequencing projects • Hierarchical and Shotgun approaches • Genome assembly • TIGR Assembler • Ensembl

Genome size Mammalian genome ~ 3 megabase = 3x109 base pairs How many books are needed to print the entire mammalian genome? 1,500 letter per page x 1000 pages per book x 2000 books Assuming 5 cm per book this shelf is ~ 100 meters long!

Genome sequencing: the problem • Sequencing read lengths vary depending upon several parameters but 600 to 800 nucleotides correspond to a good estimate. To sequence much larger fragments or even whole genome, essentially two strategies have been designed. • a) The hierarchical approach. Depending on the vector used for cloning BAC, YAC, cosmid and other libraries of cloned contigs are usually created. The size of insert/contig may vary from tens to hundred thousand of base pairs. Collections of sub-fragments obtained by enzymatic restriction are mapped to get a unique contigs from which a minimal set of sub-fragments can be selected and sequenced thus limiting sequence redundancy. • b) The shotgun approach. This can be applied to a DNA sequence of any size, including the whole genome. DNA is randomly fragmented by sonication or shearing. Following fragmentation and enzymatic end repair the DNA fragments are ligated to a plasmid vector and a bacterium host transformed to produce a library. Clones taken at random from the library are then sequenced from both end using two universal primers. At this stage a shotgun is characterised by its depth i.e. the cumulative length of sequence determined divided by the length of the fragment or genome to be sequenced. For example with an estimated size of 4 Mb a 10X shotgun would correspond to the assembly of about 60,000 reads with a mean size of 650 nt. The resulting sequences are assembled in a unique contig representing the whole fragment by sequence comparison using appropriate bio-informatic programs. The final stage or “polishing stage” corresponds to the elimination of gaps and other possible problems.

Shotgun approach

Genome assembly

Assembly of a contiguous DNA sequences • Sequencing projects have rapidly moved to using the two approaches sequentially. • For example, the construction of a BAC map covering an entire genome or chromosome is followed by a shotgun strategy to sequence a minimal set of BACs. • The change that was introduced by G. Venter was the size of the DNA fragment or genome that was directly shotguned. The possibility to increase the size of the shotgun projects was dependent upon the development of robots adapted to high throughput project and of bioinformatic programs that solve two major problems. • One is a quantitative problem regarding the capacity to store, compare, retrieve millions of reads corresponding to billions of nucleotides. DB problem. • The second problem is related to the presence of numerous repeat sequences that are often longer than the mean read length, complicating correct assembly. Assembly problem.

Fragment assembly problem • The Shortest Superstring Problem, while representing a challenge, is simplified abstraction, since it should also take into consideration three other difficulties. • 1. Sequence data are not perfect and mistaken reads are possible. • 2. Presence of numerous repeats. There is ~ a million of 300 base pairs Alu copies and many other repeats. Fortunately some repeats may slightly differ due to mutation process. • 3. As DNA is double-stranded, orientation of substrings is unknown and it is not known which strand should be used in the reconstruction. • Most of fragment assembly algorithms include the following three steps: • Overlap. The problem is to find the best match between the suffix of one sequence an the prefix of another. The difficulties above force to use variation of the dynamic programming algorithm + filtration methods • Layout. This is the hardest step in DNA assembly, which becomes even more computationally demanding with increasing number of fragments. The most difficult is deciding whether two fragments with a good overlap really overlap or represent a repeat or something else. • Consensus. This step is devoted to finding the most frequent character in the stringing layout that is constructed after the layout step is completed. More sophisticated algorithms align substrings in small windows along the layout or use a mosaic of the best (high probabilistic scores) segments from the layout.

Genome assembly from smaller sequence fragments

TIGR Assembler • TIGR Assembler is an Open Source software. • The TIGR Assembler is a sequence fragment assembly program building contigs from small sequence reads. • It is versatile, offering a wide variety of options for tuning the assembly process and analyzing sequence data. The current assembly engine uses a greedy algorithm and heuristics to build contigs, find repeat regions, and target alignment regions. • Sequence overlaps are detected and scored using a 32-mer hash. • Sequence alignment and merging is done using a Smith-Waterman dynamic programming algorithm. • Gap penalties and score values corresponding to the bases and their quality values are predefined and hard coded into the program.

Genome assembly – contigs and suprcontigs alignment • It is very difficult to produce a finished continuous sequence having the level of redundancy typical for many high eukaryotes. • Instead, a draft sequence of about 150,000 contigs will be generated that could be combined to give a few thousand supercontigs. • The production, in parallel, of a dense RH map will not only facilitate the assembly of the contigs into supercontigs, but will also make it possible to order the supercontigs — a necessary step for understand genome rearrangements and synteny.

RH meiotic CFA5 AHTH68Ren REN283H21 *** REN92G21 AHTH248 AHTH248 FISH HSA Cytogénétic *** FH2594 REN111B12 REN109K18 FH2140 *** FH2140 H68 HuEST-D29618 11 *** REN51I08 H248 12 AHT141 *** REN42N13 13 11 ZUBECA6 THY1 REN78M01 *** 14.1 14.2 THY-1 11q22 14.3 *** REN265H13 SLC2A4 17 21 SLC2A4 CD3E 11q23 11q23 REN114G01 CO2608 22 23 AHT141 CPH18 1 REN12N03 DIO1 24 *** ZUBECA6 *** 31 REN285I23 REN137C07 32 C05.771 DIO1 1p32 K315 FH2383 16 33 *** CPH18 REN162F12 C05.414 34 *** REN192M20 H201 MSHR 35 16q24 AHTK315 36 C05.377 *** REN175P10 /REN213E01 *** REN134J18 C05.414 CPH14 C05.771 *** *** REN68H12 C05.377 99Mb REN287B11 *** REN122J03 AHTH201Ren CPH14 *** 650.2cR5000 85cM

Mouse Genome: sequencing and assembly • The mouse genome is about 14% smaller than the human genome (2.5 Gb compared with 2.9 Gb) probably due to higher rate of deletions. • Over 90 % of mouse and human genomes can be partitioned into corresponding regions of conserved synteny. • Sequencing strategy included four approaches: 1) construction of BAC-based physical map by fingerprinting and sequencing the clones ends, 2) Whole-Genome Shotgun sequencing to ~7 fold coverage and assembly to generate an initial draft, 3) hierarchical shotgun sequencing of BAC clones combined with WGS to create a hybrid WGS-BAC assembly, 4) production of finished sequence by using the BAC clones as template for direct finishing • About 41 million reads were generated by the project participants, of which 33.6 million passed quality checks and 29.7 were paired (opposite end of the same clone). Clone inserts provide ~47-fold physical coverage of the genome. • Genome assembly were achieved using two newly developed programs Arachne and Phusion. • The assembly contains 224,713 contigs, connected into 7,418 supercontigs. The 200 largest supercontigs span more that 98% of the assembled sequence, of which 3 % is within sequence gaps.

Ensembl: An Open-Source Tool • The Ensembl consists of two main parts: • 1) The analysis pipeline, which adds new data and analyses regularly to the core database. The DB contains DNA sequences, predicted features on the sequences and a complete body of evidence supporting these predictions. Ensembl known genes therefore are those predicted genes that have high similarity to genes confirmed by experimental evidence. • 2) The API (application programming interface), which gives structured access to the data. Easiness of retrieving information in meaningful form makes API an extremely powerful tool. The initial implementation of the API is in Perl, built upon layer of Bio-Perl objects. Other implementations and languages like Java and Python are also in use. • The Ensembl is based around two ideas: a golden path (the pathway through the data containing nonredundant sequence) and virtual contig (contig determined by the user, an arbitrary region of a chromosome). • NCBI and USCS web-sites contains systems similar to the Ensembl.

Lecture 14

Lecture 14

Presentation Transcript

Lecture 14

Lecture 14

Lecture #14

Lecture 14

Lecture 14

LECTURE 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture 14

Lecture (14)

Lecture 14