1 / 71

CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly. http://cs.brown.edu/courses/csci2950-c/. DNA sequencing. How we obtain the sequence of nucleotides of a species. 5’. 3’. …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG

Download Presentation

CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCI2950-C Lecture 2DNA Sequencing and Fragment Assembly http://cs.brown.edu/courses/csci2950-c/

  2. DNA sequencing How we obtain the sequence of nucleotides of a species 5’ 3’ …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… 3’ 5’

  3. Outline • DNA Sequencing Technology • Fragment Assembly Problem • Overlap-Layout-Consensus • Sequencing by Hybridization • Eulerian and Hamiltonian Graphs • Next-generation DNA Sequencing

  4. DNA Replication

  5. DNA Sequencing – gel electrophoresis • Start at fixed location (primer) • Grow DNA chain • Include dideoxynucleoside (modified a, c, g, t) • Stops reaction at all possible points • Separate products with length, using gel electrophoresis

  6. Technical Limitations • Need a lot of DNA • Reaction only works for 500bp Solutions Biology Computer Science

  7. DNA Sequencing – Cloning DNA Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + = Many host cells  DNA amplification

  8. Different types of vectors

  9. DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time

  10. Electrophoresis diagrams

  11. Challenging to read answer

  12. Challenging to read answer

  13. Challenging to read answer

  14. Reading an electropherogram • Filtering • Smoothening • Correction for length compressions • A method for calling the letters – PHRED PHRED – PHil’s Read EDitor (by Phil Green) Several better methods exist, but labs are reluctant to change

  15. Output of PHRED: a read A read: 500-700 nucleotides A C G A A T C A G …A 16 18 21 23 25 15 28 30 32 …21 Quality scores: -10log10Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: (1990) Both leftmost & rightmost ends are sequenced, reads are paired

  16. Method to sequence longer regions genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~500 bp ~500 bp

  17. cut many times at random Whole Genome Shotgun Sequencing genome plasmids (2 – 10 Kbp) forward-reverse paired reads (mate pair) known dist cosmids (40 Kbp) ~500 bp ~500 bp

  18. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Sequencing and Fragment Assembly 3x109 nucleotides

  19. Fragment Assembly • Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“superstring”)

  20. Shortest Common Superstring Problem (SCS) • Problem: Given a set of strings, find a shortest string that contains all of them • Input: Strings s1, s2,…., sn • Output: A string s that contains all strings s1, s2,…., sn as substrings, such that the length of s is minimized Note: this formulation does not take into account sequencing errors

  21. Shortest Common Superstring Problem: Example

  22. Greedy Algorithm for SCS • Find pair of strings si, sj with longest overlap. • Merge(si, sj) • Recurse How “good” is the greedy algorithm?

  23. Algorithm Analysis S = {s1, s2,…., sn } Opt(S ) = length of SCS Claim: Greedy solution ≤ 4 Opt(S ). Conjecture: Greedy solution ≤ 2 Opt(S ). [Best known bound 2.5] Theorem: SCS is NP–complete

  24. SCS and Overlap Graph Build directed graph G = (V,E) V = {s1, s2,…., sn } e = (si, sj) if prefix of sjmatches suffix of si w(si, sj) = -length of overlap b/w si, sj Goal: Find a minimum weight path visiting every VERTEX exactly once in the OVERLAP graph: Travelling Salesman (Hamiltonian path) problem

  25. S = { ATC, CCA, CAG, TCC, AGT } SCS AGT CCA ATC ATCCAGT TCC CAG SCS to TSP: An Example TSP ATC 2 0 1 1 AGT 1 CCA 1 2 2 2 1 TCC CAG ATCCAGT

  26. Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm

  27. Strategies for whole-genome sequencing • Hierarchical – Clone-by-clone • Break genome into many long pieces • Map each long piece onto the genome • Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat • Online version of (1) – Walking • Break genome into many long pieces • Start sequencing each piece with shotgun • Construct map as you go Example: Rice genome • Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Dog Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

  28. Reconstructing the Sequence (Fragment Assembly) reads Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region

  29. Definition of Coverage C Length of genomic segment: G Number of reads: N Length of each read: L Definition:Coverage c = n L / G How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region /1,000,000 nucleotides

  30. Lander-Waterman Statistics Given: N reads of length L from a genome of size G P(ζ covered by read) = 1 – (1 – L/G)N ≈1 – e-c, where c = N L / G is coverage P(ζ covered by read) ≈1 – e-c

  31. Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

  32. Repeat Repeat Repeat Green and yellow fragments are interchangeable when assembling repetitive DNA Challenges in Fragment Assembly • Repeats: A major problem for fragment assembly • > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer)

  33. Repeat Types Bacterial genomes: 5% Mammals: 50% Repeat types: • Low-Complexity DNA (e.g. ATATATATACATA…) • Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons • SINE(Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies • LINE(Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies • LTRretroposons(Long Terminal Repeats (~700 bp) at each end) cousins of HIV • Gene Families genes duplicate & then diverge (paralogs) • Recent duplications ~100,000-long, very similar copies

  34. Triazzle: A Fun Example The puzzle looks simple BUT there are repeats!!! The repeats make it very difficult. Try it – only $7.99 at www.triazzle.com

  35. AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Sequencing and Fragment Assembly 3x109 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions

  36. What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads

  37. Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..

  38. Overlap • Find the best match between the suffix of one read and the prefix of another • Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment O(N2 L2) for N reads of length L • Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

  39. T GA TACA | || || TAGA TAGT Overlapping Reads • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer via hashing • Extend to full alignment – throw away if not >95% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC

  40. Overlapping Reads and Repeats • A k-mer that appears M times, initiates M2 comparisons • For an Alu that appears 106 times  1012 comparisons – too much • Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10)

  41. Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

  42. Find Overlapping Reads • Correcterrors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A correlated errors— probably caused by repeats  disentangle overlaps replace T with C TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors TAG-TTACACAGATTATTGA TAG-TTACACAGATTATTGA

  43. Layout • Repeats are a major challenge • Do two aligned fragments really overlap, or are they from two copies of a repeat? • Solution: repeat masking – hide the repeats!!! • Masking results in high rate of misassembly (up to 20%) • Misassembly means alot more work at the finishing step

  44. repeat region Merge Reads into Contigs We want to merge reads up to potential repeat boundaries Unique Contig Overcollapsed Contig

  45. Repeats, errors, and contig lengths • Repeats shorter than read length are OK • Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK • We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: • Increase read length • Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate  decreases effective repeat content  increases contig length

  46. Removing Repeats • A-statistic (Myers) used in Celera assembler • Use Lander-Waterman statistics to estimate likelihood ratio of unique region vs. over-collapsed repeat Normal density Too dense  Overcollapsed

  47. Link Contigs into Supercontigs Normal density Too dense  Overcollapsed Inconsistent links Overcollapsed? Scaffolding Problem: various heuristic approaches

  48. Link Contigs into Supercontigs (cont’d) Find all links between unique contigs Connect contigs incrementally, if  2 links supercontig (aka scaffold)

  49. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs Use length/distance constraints

  50. Consensus • A consensus sequence is derived from a profile of the assembled fragments • A sufficient number of reads is required to ensure a statistically significant consensus • Reading errors are corrected

More Related