1 / 55

CSCI2950-C Lecture 10 Cancer Genomics: Duplications

CSCI2950-C Lecture 10 Cancer Genomics: Duplications. October 21, 2008. http://cs.brown.edu/courses/csci2950-c/. Outline. Cancer Genomes Duplications and End Sequence Profiling Comparative Genomic Hybridization Cancer Progression Models.

mburger
Download Presentation

CSCI2950-C Lecture 10 Cancer Genomics: Duplications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCI2950-CLecture 10Cancer Genomics: Duplications October 21, 2008 http://cs.brown.edu/courses/csci2950-c/

  2. Outline • Cancer Genomes • Duplications and End Sequence Profiling • Comparative Genomic Hybridization • Cancer Progression Models

  3. End Sequence Profiling (ESP)C. Collins and S. Volik (2003) • Pieces of cancer genome: clones (100-250kb). Cancer DNA Sequence ends of clones (500bp). Map end sequences to human genome. x y Human DNA Each clone corresponds to pair of end sequences (ES pair)(x,y). Retain clones that correspond to a unique ES pair.

  4. ValidES pairs • Lmin ≤ y – x ≤ Lmax, min (max) size of clone. • Convergent orientation. End Sequence Profiling (ESP)C. Collins and S. Volik (2003) • Pieces of cancer genome: clones (100-250kb). Cancer DNA Sequence ends of clones (500bp). L Map end sequences to human genome. x y Human DNA

  5. End Sequence Profiling (ESP)C. Collins and S. Volik (2003) • Pieces of cancer genome: clones (100-250kb). Cancer DNA Sequence ends of clones (500bp). L Map end sequences to human genome. x y a b Human DNA • InvalidES pairs • Putative rearrangement in cancer • ES directions toward breakpoints (a,b): • Lmin ≤ |x-a| + |y-b| ≤ Lmax x y a b

  6. ESP of Normal Cell All ES pairs valid. x y Human DNA • Lmin ≤ y – x ≤ Lmax • 2D Representation • Each point (x,y) is ES pair. Genome Coordinate Genome Coordinate

  7. ESP of Tumor Cell • Valid ES pairs • satisfy length/direction • constraints • Lmin ≤ y – x ≤ Lmax • Invalid ES pairs • indicate rearrangements • experimental errors

  8. What about duplications? • 11240 ES pairs • 10453 valid (black) • 737 invalid • 489 isolated (red) • 248 form 70 clusters (blue) 33/70 clusters Total length: 31Mb

  9. Structure of Duplications in Tumors? • Mechanisms not well understood. • Duplicated segments may co-localize • (Guan et al. Nat.Gen.1994) Human genome Tumor genome

  10. Structure of Duplications: Approaches • Model free • Assemble tumor genome • Model based • Use knowledge of duplication mechanisms Human genome Tumor genome

  11. Duplication by Amplisome (Maurer, et al. 1987; Wahl, 1989…) • Other terms: • Episome • Amplicon • Double-minute

  12. Amplisome Reconstruction Problem • Assume • Tumor genome sequence is known. • Insertions are independent, • i.e. no insertions within insertions Approach • Identify duplicated sequences A1, …, Am • Amplisome is shortest common superstring of A1, …, Am

  13. D B C D E A v w u A B C D E v w u A B C D v w u Analyzing Duplications Tumor Human duplication ????

  14. D B C D E A v w u A B C D E v w u ?? A B C D A C D B v w u u Analyzing Duplications Tumor Human duplication Overlapping duplication and rearrangement

  15. D B C D E A v w u A B C D E v w u A B C D v w u Analyzing Duplications Tumor Human duplication A C D B u Overlapping duplication and rearrangement Additional ES pair resolves duplication

  16. D B C D E A v w u A B C D E v w u A B C D v w u Duples and Boundary Elements Tumor Human duplication A C D B u Call this configuration aduplewith boundary elementsv and w.

  17. D B C D E A v w u Duplications in ESP graph duplication A B C D E v w u boundary elements v,w are vertices in ESP graph E w duple D v C B u A

  18. D B C D E A v w u Duplications in ESP graph duplication A B C D E v w u boundary elements v,w are vertices in ESP graph E w duple D v C Path between boundary elements resolves duple. B u A

  19. A B C E v w u Duplication Complications ???? w These configurations frequent in MCF7 data. v u

  20. Resolving Duplication as Paths A B C D E A B v w u u w Path between boundary elements resolves duple. v u

  21. Resolving Duplications as Paths A B C E A B v w u u w Multiple paths between duple boundary elements. v u

  22. A B C E v w u ESP Amplisome Reconstruction Problem • Assume • Insertions are independent, • i.e. no insertions within insertions • Approach • Identify endpoints of duplications: • (v1, w1), , …, (vm, wm) • Amplisome is shortest common superpath in ESP graph containing subpaths: • v1…w1, v2…w2, …, vm…wm

  23. Reconstructed MCF7 amplisome Chromosomes 1 3 17 20 Amplisome model explains 24/33 invalid clusters. 33 clusters Total length: 31Mb Raphael and Pevzner (2004) Bioinformatics.

  24. DNA Basepairing

  25. DNA Microarrays

  26. Measuring Mutations in Cancer Comparative Genomic Hybridization (CGH)

  27. Deletion Amplification 0.5 0.5 0 0 -0.5 -0.5 CGH Analysis (1) • Divide genome into segments of equal copy number Genomic position Log2(R/G) Genomic position

  28. CGH Analysis (2) • Identify aberrations common to multiple samples Chromosome 3 of 26 lung tumor samples on mid-density cDNA array. Common deletion located in 3p21 and common amplification – in 3q. Samples

  29. Copy number Genome coordinate CGH Analysis (1) • Divide genome into segments of equal copy number Copy number profile Numerous methods (e.g. clustering, Hidden Markov Model, Bayesian, etc.) Segmentation Input: yi = log2 Ti / Ri , clone i = 1, …, N Output: Assignment s(yi)  {S1, …, SK} Si represent copy number states

  30. Deletion Amplification 0.5 0 -0.5 An Approach to CGH Segmentation • Circular Binary Segmentation (CBS), Olshen et al. 2004 • Use hypothesis test to compare means of two intervals using t-test Genomic position

  31. Interval Score Lipson, et al. J. Computational Biology, 2006 • Assume: • Xi are independent, normally distributed • µ and  denote the mean and standard deviation of the normal genomic data. • Given an interval I spanning k probes, we define its score as:

  32. Significance of Interval Score Lipson, et al. J. Computational Biology, 2006 • Assume: • Xi ~ N(µ, )

  33. Input: A vector X=(X1…Xn) Output: An interval I  [1…n], that maximizes S(I) The MaxInterval Problem Other intervals with high scores may be found by recursively calling this function. Exhaustive algorithm: O(n2)

  34. I = [i,…,i+k-1] I’ = [i,…,i+k+r-1] sum s = jI Xj s+mr length k k+r 8 Benchmarking score Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively. MaxInterval Algorithm I: LookAhead • Assume given: • m : An upper bound for the value of a single element Xi • t : A lower bound on the maximum score Solve for first r for which S(I ) may exceed t. Complexity: Expected O(n1.5) (unproved)

  35. (j1) (j2) (j3) j kj MaxInterval Algorithm I:LookAhead • Assume you are given: • m – An upper bound for the value of a single element ci • t – A lower bound on the maximum score I I’ If we are currently considering an interval I=[i,…,i+k-1] with a sum of s = jI cj, then the score of I is: The score of an interval I’ = [i,…,i+k+x-1] is then bounded by: sum s s+mx length k k+x 8 Benchmarking score Solve for first x for which S(I ) may exceed t. Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively. Complexity: Expected O(n1.5) (unproved) MaxInterval Algorithm II:Geometric Family Approximation (GFA) For >0 define the following geometric family of intervals: Theorem: Let I* be the optimal scoringinterval. Let J be the leftmost longest interval of  fully contained in I*. Then S(J) ≥ S(I*)/, where   (1- -2 )-1. Complexity: O(n) 6

  36. Benchmarking synthetic vectors of varying lengths Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively.

  37. 1 A2BP1 FRA16B 0 Log2(ratio) -1 0 25 50 75 Mbp ERBB2 1 0 1 0 Log2(ratio) 1 0 1 0 0 25 50 75 Mbp Applications: Single Samples Chromosome 16 of HCT116 colon carcinoma cell line on high-density oligo array (n=5,464). Chromosome 17 of several breast carcinoma cell lines on mid-density cDNA array (n=364).

  38. Deletion Amplification 0.5 0 -0.5 Another Approach to CGH Segmentation • Use Hidden Markov Model (HMM) to “parse” sequence of probes into copy number states Genomic position

  39. 1 2 2 1 1 1 1 … 2 2 2 2 … K … … … … x1 K K K K x2 x3 xK … Hidden Markov Models

  40. Example: The Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: • You bet $1 • You roll (always with a fair die) • Casino player rolls (maybe with fair die, maybe with loaded die) • Highest number wins $2

  41. Question # 1 – Evaluation GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Prob = 1.3 x 10-35

  42. Question # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs. This is what we want to solve for CGH analysis FAIR LOADED FAIR

  43. Question # 3 – Learning GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Prob(6) = 64%

  44. Definition of a hidden Markov model Definition: A hidden Markov model (HMM) • Alphabet = { b1, b2, …, bM } • Set of states Q = { 1, ..., K } • Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilities a0i a01 + … + a0K = 1 • Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…K 1 2 K …

  45. The dishonest casino model 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05

  46. A HMM is memory-less At each time step t, the only thing that affects future states is the current state t P(t+1 = k | “whatever happened so far”) = P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) = P(t+1 = k | t) 1 2 K …

  47. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, A parse of x is a sequence of states  = 1, ……, N 1 2 2 K x1 x2 x3 xK

  48. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a Parse Simply, multiply all the orange arrows! (transition probs and emission probs) 1 2 2 K x1 x2 x3 xK

  49. 1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a parse A compact way to write a01 a12……aN-1N e1(x1)……eN(xN) Number all parameters aij and ei(b); n params Example: a0Fair : 1; a0Loaded : 2; … eLoaded(6) = 18 Then, count in x and  the # of times each parameter j = 1, …, n occurs F(j, x, ) = # parameter j occurs in (x, ) (call F(.,.,.) the feature counts) Then, P(x, ) = j=1…n jF(j, x, ) = = exp[j=1…nlog(j)F(j, x, )] 1 Given a sequence x = x1……xN and a parse  = 1, ……, N, To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) = a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xK

  50. Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of  = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a0Fair = ½, aoLoaded = ½) ½  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½  (1/6)10  (0.95)9 = .00000000521158647211 ~= 0.5  10-9

More Related