1 / 77

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008. OUTLINE. Assembly Process Overview Assembly algorithms Repeats Scaffolding Phred/Phrap/Consed Assembly pipelines. Assembly process overview. A Genome Sequencing Project. Building a Library.

dana-sutton
Download Presentation

Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome sequence assembly concepts and methods Shih-Jon Wang May 13, 2008

  2. OUTLINE • Assembly Process Overview • Assembly algorithms • Repeats • Scaffolding • Phred/Phrap/Consed • Assembly pipelines

  3. Assembly process overview

  4. A Genome Sequencing Project

  5. Building a Library • Break DNA into random fragments (8-10x)

  6. SHOTGUNs • Whole Genome Shotgun • Bac-Bac Shotgun • Size of inserts: • --Bac insert: ~150KB • --Fosmid insert: ~30KB • --Normal insert: ~3KB

  7. Clone and scaffold(a) Clone inserts are sequenced from both ends, yielding mated sequence reads. (b) A scaffold uses linking information provided by the clone-pairing data to order and orient contiguous sequences, or contigs, in the genome under assembly. Computer 35 (7):47-54

  8. Building a Library • Break DNA into random fragments (~10x) • Break DNA into random fragments (~10x) -- Amplify the fragments in a vector -- Sequence 800-1000 bases at each end

  9. Assembling the fragments

  10. Assembling the fragments • Break DNA into random fragments • Sequence the ends of the fragments • Assemble the sequenced ends

  11. Forward-reverse constraints• The sequenced ends are facing towards each other• The distance between the two fragments is known

  12. Building Scaffolds

  13. Assembly Gaps --sequencing gap: know the order & orientation of the contigs and have at least one clone spanning the gap --physical gap: no information about adjacent contigs, nor about the DNA spanning the gap

  14. Finishing the Project

  15. Unifying View of Assembly

  16. Assembly Algorithms

  17. Assembly Methods • Overlap-layout-consensus – greedy (Phrap, CAP3, TIGR...) – graph-based (Euler)

  18. Phrap/CAP3 Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done !!! IDEAL CASE !!!

  19. Real World Problems • Sequencing errors • Chimera • Repeats • Contaminants • Polymorphism • Orientation

  20. Error Correction

  21. Overlap b/w two sequences

  22. All pairs alignment • Try all pairs – must consider ~ n^2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k-mer table)

  23. Repeats

  24. RepeatsequenceThe toprepresents the correctlayout of threeDNA sequences. Thebottom shows arepeat collapsed ina misassembly. Computer 35 (7):47-54

  25. 重覆序列 • ■重覆頻率分 • Interspersed repeats • Short interspersed element (SINE), • eg. Alu <300 bp • Long interspersed element (LINE), ca. 5 kb • Tandem repeats • Satellite DNA • Minisat. & Variable number of tandem repeats • Microsat.: mono-, di-, tri-, tetra-nucleotide • ■重覆方向分 • 同向重覆序列 • 反向重覆序列

  26. Repeat detection Pre-assembly: find fragments that belong to repeats • statistically (Reps) • repeat database (RepeatMasker)

  27. Statistical repeat detection • Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) • Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions • Problem 2: repeats with low copy number are missed

  28. Scaffolding

  29. Sequencing hierarchy • Random sequencing – unrelated reads ~700 pairs • Assembly – un-related contigs 5K-10K pairs • Scaffolding – unrelated scaffolds 30K~ 50K pairs • Finishing/gap closure – completed genomes millions-billions of base-pairs

  30. Definition

  31. Scaffolder output • order and orientation of contigs • size of gaps between contigs • linking evidence: mate-pairs spanning gaps

  32. Clone-mates

  33. Linking information

  34. Hierarchical scaffolding

  35. Ambiguous scaffold

  36. Phred/Phrap/Consed Analysis

  37. What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector & repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

  38. How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?

  39. Phred Genome Research 8: 175-194

  40. Phred Phred is a program that performs several tasks: • Reads trace files – compatible with most file formats: SCF (standardchromatogram format), ABI, ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower errorrate than the staard base calling programs.

  41. Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

  42. File Directories • chromat_dir/ • edit_dir/ • phd_dir/

  43. Trace FileHigh quality region – no ambiguities (Ns)

  44. Trace FileMedium quality region – some ambiguities (Ns)

  45. Trace FilePoor quality region – low confidence

  46. Phred value formula q = - 10 x log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases)

  47. Base Calling • phred -id . -p -pd ../phd_dir • phred -view pf84c05.s1

  48. The structure of a phd file t 16 8191 g 19 8200 t 13 8211 c 13 8229 g 4 8241 n 4 8253 c 4 8263 t 10 8276 t 9 8286 c 12 8301 t 16 8313 c 12 8329 c 12 8336 c 15 8343 t 19 8356 c 9 8371 g 13 8386 g 14 8397 a 7 8417 g 9 8427 g 4 8445 t 6 11908 a 6 11921 g 6 11927 t 6 11947 c 6 11953 a 6 11964 g 6 11981 c 4 11994 n 4 12015 c 4 12037 n 4 12044 n 4 12058 n 4 12071 n 4 12085 n 4 12098 n 4 12111 n 4 12124 c 4 12144 n 4 12151 END_DNA END_SEQUENCE t 24 2221 a 24 2232 a 22 2245 a 27 2261 g 25 2272 c 19 2286 c 12 2302 t 19 2314 g 12 2324 g 15 2331 g 19 2346 g 23 2363 t 33 2378 g 36 2390 c 44 2404 c 44 2419 t 39 2433 a 39 2446 a 34 2460 t 35 2470 g 34 2482 BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: 0.990722.g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18:58 2001 TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: 12153 TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c 13 17 a 19 26 c 19 32

  49. phd2fasta • phd2fasta program • –converts .phdfiles to sequence in multifasta format • –writes .qualfile (quality scores) for each trace file • –phd2fasta -id ../phd_dir -os CLONE.fasta -oq CLONE.fasta.qual • Output: • –fasta.seqcontains fastasequences • –fasta.seq.qualcontains quality scores

  50. Vector Sequence Cleaning (1) • DNA sequence cleaning: quality trimming and vector removal---Lucy: • Lucy Steps: • Read input seq#, seq info, and quality info • Chop off splice sites • Remove vector insert • Produce output seq for fragment assembly.

More Related