1 / 43

DNA Sequencing and Assembly

DNA Sequencing and Assembly. DNA sequencing. How we obtain the sequence of nucleotides of a species. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…. Which representative of the species?.

bronwyn
Download Presentation

DNA Sequencing and Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DNA Sequencingand Assembly

  2. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

  3. Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 – 1/10,000 Other organisms have much higher polymorphism rates

  4. DNA sequencing – vectors DNA Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + =

  5. Different types of vectors

  6. DNA sequencing – gel electrophoresis Start at primer (restriction site) Grow DNA chain Include dideoxynucleoside (modified a, c, g, t) Stops reaction at all possible points Separate products with length, using gel electrophoresis

  7. Electrophoresis diagrams

  8. Output of gel electrophoresis: a read A read: 500-700 nucleotides A C G A A T C A G …. A 16 18 21 23 25 15 28 30 32 21 Quality scores: -10log10Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: Both leftmost & rightmost ends are sequenced

  9. Method to sequence segments longer than 500 genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~500 bp ~500 bp

  10. Reconstructing the Sequence (Fragment Assembly) reads Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region

  11. Definition of Coverage Length of genomic segment: L Number of reads: n Length of each read: l Definition:Coverage C = nl/L How much coverage is enough? (Lander-Waterman model): Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides C

  12. Challenges with Fragment Assembly • Sequencing errors ~1-2% of bases are wrong • Repeats • Computation: ~ O( N2 ) where N = # reads false overlap due to repeat

  13. Repeats Bacterial genomes: 5% Mammals: 50% Repeat types: Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats: (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Common Repeat Families SINE (Short Interspersed Nuclear Elements) (e.g. ALU: ~300-long, 106 copies) LINE (Long Interspersed Nuclear Elements) ~500-5,000-long, 200,000 copies MIR LTR/Retroviral Other -Genes that are duplicated & then diverge (paralogs) -Recent duplications, ~100,000-long, very similar copies

  14. What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads

  15. What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads

  16. Strategies for sequencing a whole genome • Hierarchical – Clone-by-clone • Break genome into many long pieces • Map each long piece onto the genome • Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat • Online version of (1) – Walking • Break genome into many long pieces • Start sequencing each piece with shotgun • Construct map as you go Example: Rice genome • Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

  17. Hierarchical Sequencing

  18. Hierarchical Sequencing Strategy a BAC clone map • Obtain a large collection of BAC clones • Map them onto the genome (Physical Mapping) • Select a minimum tiling path • Sequence each clone in the path with shotgun • Assemble • Put everything together genome

  19. Methods of physical mapping Goal: Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence Methods: • Hybridization • Digestion

  20. 1. Hybridization Short words, the probes, attach to complementary words • Construct many probes • Treat each BAC with all probes • Record which ones attach to it • Same words attaching to BACS X, Y  overlap p1 pn

  21. Hybridization – Computational Challenge p1p2 …………………….pm Matrix: m probes  n clones (i, j): 1, if pi hybridizes to Cj 0, otherwise Definition: Consecutive ones matrix A matrix 1s are consecutive Computational problem: Reorder the probes so that matrix is in consecutive-ones form Can be solved in O(m3) time (m >> n) Unfortunately, data is not perfect 0 0 1 …………………..1 C1C2 ……………….Cn 1 1 0 …………………..0 1 0 1…………………...0 pi1pi2…………………….pim 1 1 1 0 0 0……………..0 0 1 1 1 1 1……………..0 0 0 1 1 1 0……………..0 Cj1Cj2 ……………….Cjn 0 0 0 0 0 0………1 1 1 0 0 0 0 0 0 0………0 1 1 1

  22. 2. Digestion Restriction enzymes cut DNA where specific words appear • Cut each clone separately with an enzyme • Run fragments on a gel and measure length • Clones Ca, Cb have fragments of length { li, lj, lk }  overlap Double digestion: Cut with enzyme A, enzyme B, then enzymes A + B

  23. Whole-Genome Shotgun Sequencing

  24. Whole Genome Shotgun Sequencing cut many times at random genome plasmids (2 – 10 Kbp) forward-reverse linked reads known dist cosmids (40 Kbp) ~500 bp ~500 bp

  25. The Overlap-Layout-Consensus approach 1. Find overlapping reads 2. Merge good pairs of reads into longer contigs 3. Link contigs to form supercontigs ..ACGATTACAATAGGTT.. 4. Derive consensus sequence + many heuristics

  26. 1. Find Overlapping Reads T GA TACA | || || TAGA TAGT • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer • Extend to full alignment – throw away if not >95% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC

  27. 1. Find Overlapping Reads One caveat: repeats A k-mer that appears N times, initiates N2 comparisons ALU: 1,000,000 times Solution: Discard all k-mers that appear more than c  Coverage, (c ~ 10)

  28. 1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

  29. 1. Find Overlapping Reads (cont’d) • Correcterrors using multiple alignment C: 20 C: 20 C: 35 C: 35 C: 0 T: 30 C: 35 C: 35 TAGATTACACAGATTACTGA C: 40 C: 40 TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA A: 15 A: 15 A: 25 A: 25 - A: 0 A: 40 A: 40 A: 25 A: 25 • Score alignments • Accept alignments with good scores

  30. Basic principle of assembly Repeats confuse us Ability to merge two reads ability to detect repeats We can dismiss as repeat any overlap of < t% similarity Role of error correction: Discards ~90% of single-letter sequencing errors  Threshold t% increases

  31. 2. Merge Reads into Contigs (cont’d) repeat region Merge reads up to potential repeat boundaries (Myers, 1995)

  32. 2. Merge Reads into Contigs (cont’d) repeat region • Ignore non-maximal reads • Merge only maximal reads into contigs

  33. 2. Merge Reads into Contigs (cont’d) repeat boundary??? sequencing error • Ignore “hanging” reads, when detecting repeat boundaries b a

  34. 2. Merge Reads into Contigs (cont’d) ????? Unambiguous • Insert non-maximal reads whenever unambiguous

  35. 3. Link Contigs into Supercontigs Normal density Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed?

  36. 3. Link Contigs into Supercontigs (cont’d) Find all links between unique contigs Connect contigs incrementally, if  2 links

  37. 3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of overcollapsed contigs

  38. 3. Link Contigs into Supercontigs d ( A, B ) Contig A Contig B • Define G = ( V, E ) • V := contigs • E := ( A, B ) such that d( A, B ) < C • Reason to do so: Efficiency; full shortest paths cannot be computed

  39. 3. Link Contigs into Supercontigs Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T

  40. 4. Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting

  41. Mouse Genome Several heuristics of iteratively: Breaking supercontigs that are suspicious Rejoining supercontigs Size of problem: 32,000,000 reads Time: 15 days, 1 processor Memory: 28 Gb N50 Contig size: 16.3 Kb  24.8 Kb N50 Supercontig size: .265 Mb  16.9 Mb

  42. Mouse Assembly

  43. Sequencing in the (near) future Inlet Outlet Microfluidic Chip 10mm 10mm 6mm … 4mm 6mm CMOS Chip Photodiodes

More Related