1 / 39

CSE182-L10

CSE182-L10. LW statistics/Assembly. Whole Genome Shotgun. Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics & Repeats argue against the success of such an approach.

albert
Download Presentation

CSE182-L10

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L10 LW statistics/Assembly

  2. Whole Genome Shotgun • Break up the entire genome into pieces • Sequence ends, and assemble using a computer • LW statistics & Repeats argue against the success of such an approach Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

  3. Algorithmic: How do you put the genome back together from the pieces? Statistical? How many pieces do you need to sequence, etc.? The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman. Questions

  4. Lander Waterman Statistics • The fragments are falling randomly on the genome • Overlapping fragments form islands of contiguous sequence. • Ideally, we want one island for each chromosome. How many fragments should we sequence? L G

  5. Lander Waterman Statistics L G

  6. LW statistics: questions • As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. • Q1: What is the expected number of islands? • Ans: N exp(-c) • The number increases at first, and gradually decreases.

  7. Analysis: Expected Number Islands • Computing Expected # islands. • Let Xi=1 if an island ends at position i, Xi=0 otherwise. • Number of islands = ∑i Xi • Expected # islands = E(∑i Xi) = ∑i E(Xi)

  8. Prob. of an island ending at i L i T • E(Xi) = Prob (Island ends at pos. i) • =Prob(clone began at position i-L+1 AND no clone began in the next L-T positions)

  9. LW statistics • Pr[Island contains exactly j clones]? • Consider an island that has already begun. With probability e-c, it will never be continued. Therefore • Pr[Island contains exactly j clones]= • Expected # j-clone islands

  10. Expected # of clones in an island • Expected # of clones in an island = Q: How? Why do we care? Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.

  11. Expected length of an island

  12. Whole Genome Sequencing & Assembly

  13. Whole Genome Shotgun • Break up the entire genome into pieces • Sequence ends, and assemble using a computer • LW statistics & Repeats argue against the success of such an approach

  14. Assembly Basics • Three main components: • Overlap • Layout • Consensus

  15. Yes, if a prefix of s2 matches a suffix of s1 Overlap • Given a pair of fragments s1 and s2, do they belong together? • How would you compute such a match?

  16. Overlap • S[i,j] = optimum score of an alignment of s1[1..i] against a suffix of s2[1..j] j i • The best prefix-suffix alignment is given by: • Maxi {S[i,n] }

  17. Overlap Detection • Compute the best prefix-suffix alignments between each pair of fragments. • Keep the “high-scoring” ones as evidence of true overlap. • What is the problem?

  18. Overlap detection problem • Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. • G = 3000Mb, L=500 • Coverage LN/G = 10 • N = 10*3*109/500 = 6*107 • Number of comparisons needed = 3.6 * 1015 • Not good! (Only a small fraction are true overlaps)

  19. k-mer based overlap (Piegeonhole principle again) • Consider a 25bp sequence. • Expected number of occurrences in the genome • 3*109*4-25 = 2*10-6 • A 25-bp sequence appears is unique to the genome! • Two overlapping sequences should share a 25-mer • Two non-overlapping sequences should not! 25bp

  20. Sorting k-mers • Build a list of k-mers that appear in the sequences and their reverse complements • Create a record with 4 entries: • K-mer • Sequence number • Position in the sequence • Reverse complementation flag • Sort a vector of these according to k-mer • How many records per k-mer are expected? • If number of records exceeds threshold, discard (why?) K-mer S.id Pos.

  21. Alignment module • Coalesce k-mer hits into longer, gap-free partial alignments. • These extended k-mer hits are saved. • For each pair of sequences, form a directed graph. • For each maximal path in the graph, construct an alignment. • Refine alignment via banded DP

  22. Problem2: Size • Islands might simply be too small in length •  = (1-T/L) = (1-50/500) = 0.9, c = 8. • #Islands = N e-c = 45K • Size of an island = 54K • Not enough to make it an acceptable assembly! • PLUS, there is the problem of Repeats, Chimerism etc.

  23. Recall that we sequence about 1000bp of the end of a clone If we sequenced both ends, we get extra information, particularly if we know the length of the original clone. Solution 2: Clones can have mate-pairs

  24. Mate-pairs allow you to merge islands (contigs) into super-contigs Mate Pairs

  25. Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small. Use the mate-pairs to order and orient the contigs, and make super-contigs. Super-contigs are quite large

  26. Whole genome shotgun • Input: • Shotgun sequence fragments (reads) • Mate pairs • Output: • A single sequence created by consensus of overlapping reads • First generation of assemblers did not include mate-pairs (Phrap, CAP..) • Second generation: CA, Arachne, Euler • We will discuss Arachne, a freely available sequence assembler (2nd generation)

  27. Problem 3: Repeats

  28. 40-50% of the human genome is made up of repetitive elements. Repeats can cause great problems in the assembly! Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly Repeats & Chimerisms

  29. Repeat detection • Lander Waterman strikes again! • The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig). • Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands. Repeat

  30. Detecting Repeat Contigs 1: Read Density • Compute the log-odds ratio of two hypotheses: • H1: The contig is from a unique region of the genome. • The contig is from a region that is repeated at least twice

  31. Detecting Chimeric reads • Chimeric reads: Reads that contain sequence from two genomic locations. • Good overlaps: G(a,b) if a,b overlap with a high score • Transitive overlap: T(a,c) if G(a,b), and G(b,c) • Find a point x across which only transitive overlaps occur. X is a point of chimerism

  32. Contig assembly • Reads are merged into contigs upto repeat boundaries. • (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, • shift(a,c)=shift(a,b)+shift(b,c) • Most of the contigs are unique pieces of the genome, and end at some Repeat boundary. • Some contigs might be entirely within repeats. These must be detected

  33. Creating Super Contigs

  34. Supercontig assembly • Supercontigs are built incrementally • Initially, each contig is a supercontig. • In each round, a pair of super-contigs is merged until no more can be performed. • Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’. • Score has two terms: • A reward for multiple mate-pair links • A penalty for distance between the links.

  35. Supercontig merging • Remove the top scoring pair (S1,S2) from the priority queue. • Merge (S1,S2) to form contig T. • Remove all pairs in Q containing S1 or S2 • Find all supercontigs W that share mate-pair links with T and insert (T,W) into the priority queue. • Detect Repeated Supercontigs and remove

  36. Repeat Supercontigs • If the distance between two super-contigs is not correct, they are marked as Repeated • If transitivity is not maintained, then there is a Repeat

  37. Filling gaps in Supercontigs

  38. Consensus Derivation • Consensus sequence is created by converting pairwise read alignments into multiple-read alignments

  39. Summary • Whole genome shotgun is now routine: • Human, Mouse, Rat, Dog, Chimpanzee.. • Many Prokaryotes (One can be sequenced in a day) • Plant genomes: Arabidopsis, Rice • Model organisms: Worm, Fly, Yeast • A lot is not known about genome structure, organization and function. • Comparative genomics offers low hanging fruit

More Related