Download
genome sequence assembly concepts and methods shih jon wang may 13 2008 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008 PowerPoint Presentation
Download Presentation
Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

150 Views Download Presentation
Download Presentation

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

  2. OUTLINE • Assembly Process Overview • Assembly algorithms • Repeats • Scaffolding • Phred/Phrap/Consed • Assembly pipelines

  3. Assembly process overview

  4. A Genome Sequencing Project

  5. Building a Library • Break DNA into random fragments (8-10x)

  6. SHOTGUNs • Whole Genome Shotgun • Bac-Bac Shotgun • Size of inserts: • --Bac insert: ~150KB • --Fosmid insert: ~30KB • --Normal insert: ~3KB

  7. Clone and scaffold(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54

  8. Building a Library • Break DNA into random fragments (~10x) • Break DNA into random fragments (~10x) -- Amplify the fragments in a vector -- Sequence 800-1000 bases at each end

  9. Assembling the fragments

  10. Assembling the fragments • Break DNA into random fragments • Sequence the ends of the fragments • Assemble the sequenced ends

  11. Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known

  12. Building Scaffolds

  13. Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

  14. Finishing the Project

  15. Unifying View of Assembly

  16. Assembly Algorithms

  17. Assembly Methods • Overlap-layout-consensus – greedy (Phrap, CAP3, TIGR...) – graph-based (Euler)

  18. Phrap/CAP3 Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done !!! IDEAL CASE !!!

  19. Real World Problems • Sequencing errors • Chimera • Repeats • Contaminants • Polymorphism • Orientation

  20. Error Correction

  21. Overlap b/w two sequences

  22. All pairs alignment • Try all pairs – must consider ~ n^2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table)

  23. Repeats

  24. RepeatsequenceThe toprepresents the correctlayout of threeDNA sequences. Thebottom shows arepeat collapsed ina misassembly. Computer 35 (7):47-54

  25. 重覆序列 • ■重覆頻率分 • Interspersed repeats • Short interspersed element (SINE), • eg. Alu <300 bp • Long interspersed element (LINE), ca. 5 kb • Tandem repeats • Satellite DNA • Minisat. & Variable number of tandem repeats • Microsat.: mono-, di-, tri-, tetra-nucleotide • ■重覆方向分 • 同向重覆序列 • 反向重覆序列

  26. Repeat detection Pre-assembly: find fragments that belong to repeats • statistically (Reps) • repeat database (RepeatMasker)

  27. Statistical repeat detection • Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) • Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions • Problem 2: repeats with low copy number are missed

  28. Scaffolding

  29. Sequencing hierarchy • Random sequencing – unrelated reads ~700 pairs • Assembly – un-related contigs 5K-10K pairs • Scaffolding – unrelated scaffolds 30K~ 50K pairs • Finishing/gap closure – completed genomes millions-billions of base-pairs

  30. Definition

  31. Scaffolder output • order and orientation of contigs • size of gaps between contigs • linking evidence: mate-pairs spanning gaps

  32. Clone-mates

  33. Linking information

  34. Hierarchical scaffolding

  35. Ambiguous scaffold

  36. Phred/Phrap/Consed Analysis

  37. What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

  38. How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

  39. Phred Genome Research 8: 175-194

  40. Phred Phred is a program that performs several tasks: • Reads trace files – compatible with most file formats: SCF (standardchromatogram format), ABI, ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

  41. Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

  42. File Directories • chromat_dir/ • edit_dir/ • phd_dir/

  43. Trace FileHigh quality region – no ambiguities (Ns)

  44. Trace FileMedium quality region – some ambiguities (Ns)

  45. Trace FilePoor quality region – low confidence

  46. Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases)

  47. Base Calling • phred -id . -p -pd ../phd_dir • phred -view pf84c05.s1

  48. The structure of a phd file t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32

  49. phd2fasta • phd2fasta program • –converts .phdfiles to sequence in multifasta format • –writes .qualfile (quality scores) for each trace file • –phd2fasta -id ../phd_dir -os CLONE.fasta -oq CLONE.fasta.qual • Output: • –fasta.seqcontains fastasequences • –fasta.seq.qualcontains quality scores

  50. Vector Sequence Cleaning (1) • DNA sequence cleaning: quality trimming and vector removal---Lucy: • Lucy Steps: • Read input seq#, seq info, and quality info • Chop off splice sites • Remove vector insert • Produce output seq for fragment assembly.