Duplications in Cancer Genomes through End Sequence Profiling

CSCI2950-CLecture 10Cancer Genomics: Duplications October 21, 2008 http://cs.brown.edu/courses/csci2950-c/

Outline • Cancer Genomes • Duplications and End Sequence Profiling • Comparative Genomic Hybridization • Cancer Progression Models

End Sequence Profiling (ESP)C. Collins and S. Volik (2003) • Pieces of cancer genome: clones (100-250kb). Cancer DNA Sequence ends of clones (500bp). Map end sequences to human genome. x y Human DNA Each clone corresponds to pair of end sequences (ES pair)(x,y). Retain clones that correspond to a unique ES pair.

ValidES pairs • Lmin ≤ y – x ≤ Lmax, min (max) size of clone. • Convergent orientation. End Sequence Profiling (ESP)C. Collins and S. Volik (2003) • Pieces of cancer genome: clones (100-250kb). Cancer DNA Sequence ends of clones (500bp). L Map end sequences to human genome. x y Human DNA

End Sequence Profiling (ESP)C. Collins and S. Volik (2003) • Pieces of cancer genome: clones (100-250kb). Cancer DNA Sequence ends of clones (500bp). L Map end sequences to human genome. x y a b Human DNA • InvalidES pairs • Putative rearrangement in cancer • ES directions toward breakpoints (a,b): • Lmin ≤ |x-a| + |y-b| ≤ Lmax x y a b

ESP of Normal Cell All ES pairs valid. x y Human DNA • Lmin ≤ y – x ≤ Lmax • 2D Representation • Each point (x,y) is ES pair. Genome Coordinate Genome Coordinate

ESP of Tumor Cell • Valid ES pairs • satisfy length/direction • constraints • Lmin ≤ y – x ≤ Lmax • Invalid ES pairs • indicate rearrangements • experimental errors

What about duplications? • 11240 ES pairs • 10453 valid (black) • 737 invalid • 489 isolated (red) • 248 form 70 clusters (blue) 33/70 clusters Total length: 31Mb

Structure of Duplications in Tumors? • Mechanisms not well understood. • Duplicated segments may co-localize • (Guan et al. Nat.Gen.1994) Human genome Tumor genome

Structure of Duplications: Approaches • Model free • Assemble tumor genome • Model based • Use knowledge of duplication mechanisms Human genome Tumor genome

Duplication by Amplisome (Maurer, et al. 1987; Wahl, 1989…) • Other terms: • Episome • Amplicon • Double-minute

Amplisome Reconstruction Problem • Assume • Tumor genome sequence is known. • Insertions are independent, • i.e. no insertions within insertions Approach • Identify duplicated sequences A1, …, Am • Amplisome is shortest common superstring of A1, …, Am

D B C D E A v w u A B C D E v w u A B C D v w u Analyzing Duplications Tumor Human duplication ????

D B C D E A v w u A B C D E v w u ?? A B C D A C D B v w u u Analyzing Duplications Tumor Human duplication Overlapping duplication and rearrangement

D B C D E A v w u A B C D E v w u A B C D v w u Analyzing Duplications Tumor Human duplication A C D B u Overlapping duplication and rearrangement Additional ES pair resolves duplication

D B C D E A v w u A B C D E v w u A B C D v w u Duples and Boundary Elements Tumor Human duplication A C D B u Call this configuration aduplewith boundary elementsv and w.

D B C D E A v w u Duplications in ESP graph duplication A B C D E v w u boundary elements v,w are vertices in ESP graph E w duple D v C B u A

D B C D E A v w u Duplications in ESP graph duplication A B C D E v w u boundary elements v,w are vertices in ESP graph E w duple D v C Path between boundary elements resolves duple. B u A

A B C E v w u Duplication Complications ???? w These configurations frequent in MCF7 data. v u

Resolving Duplication as Paths A B C D E A B v w u u w Path between boundary elements resolves duple. v u

Resolving Duplications as Paths A B C E A B v w u u w Multiple paths between duple boundary elements. v u

A B C E v w u ESP Amplisome Reconstruction Problem • Assume • Insertions are independent, • i.e. no insertions within insertions • Approach • Identify endpoints of duplications: • (v1, w1), , …, (vm, wm) • Amplisome is shortest common superpath in ESP graph containing subpaths: • v1…w1, v2…w2, …, vm…wm

Reconstructed MCF7 amplisome Chromosomes 1 3 17 20 Amplisome model explains 24/33 invalid clusters. 33 clusters Total length: 31Mb Raphael and Pevzner (2004) Bioinformatics.

DNA Basepairing

DNA Microarrays

Measuring Mutations in Cancer Comparative Genomic Hybridization (CGH)

Deletion Amplification 0.5 0.5 0 0 -0.5 -0.5 CGH Analysis (1) • Divide genome into segments of equal copy number Genomic position Log2(R/G) Genomic position

CGH Analysis (2) • Identify aberrations common to multiple samples Chromosome 3 of 26 lung tumor samples on mid-density cDNA array. Common deletion located in 3p21 and common amplification – in 3q. Samples

Copy number Genome coordinate CGH Analysis (1) • Divide genome into segments of equal copy number Copy number profile Numerous methods (e.g. clustering, Hidden Markov Model, Bayesian, etc.) Segmentation Input: yi = log2 Ti / Ri , clone i = 1, …, N Output: Assignment s(yi)  {S1, …, SK} Si represent copy number states

Deletion Amplification 0.5 0 -0.5 An Approach to CGH Segmentation • Circular Binary Segmentation (CBS), Olshen et al. 2004 • Use hypothesis test to compare means of two intervals using t-test Genomic position

Interval Score Lipson, et al. J. Computational Biology, 2006 • Assume: • Xi are independent, normally distributed • µ and  denote the mean and standard deviation of the normal genomic data. • Given an interval I spanning k probes, we define its score as:

Significance of Interval Score Lipson, et al. J. Computational Biology, 2006 • Assume: • Xi ~ N(µ, )

Input: A vector X=(X1…Xn) Output: An interval I  [1…n], that maximizes S(I) The MaxInterval Problem Other intervals with high scores may be found by recursively calling this function. Exhaustive algorithm: O(n2)

I = [i,…,i+k-1] I’ = [i,…,i+k+r-1] sum s = jI Xj s+mr length k k+r 8 Benchmarking score Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively. MaxInterval Algorithm I: LookAhead • Assume given: • m : An upper bound for the value of a single element Xi • t : A lower bound on the maximum score Solve for first r for which S(I ) may exceed t. Complexity: Expected O(n1.5) (unproved)

(j1) (j2) (j3) j kj MaxInterval Algorithm I:LookAhead • Assume you are given: • m – An upper bound for the value of a single element ci • t – A lower bound on the maximum score I I’ If we are currently considering an interval I=[i,…,i+k-1] with a sum of s = jI cj, then the score of I is: The score of an interval I’ = [i,…,i+k+x-1] is then bounded by: sum s s+mx length k k+x 8 Benchmarking score Solve for first x for which S(I ) may exceed t. Benchmarking results of the Exhaustive, LookAhead and GFA algorithms on synthetic vectors of varying lengths. Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively. Complexity: Expected O(n1.5) (unproved) MaxInterval Algorithm II:Geometric Family Approximation (GFA) For >0 define the following geometric family of intervals: Theorem: Let I* be the optimal scoringinterval. Let J be the leftmost longest interval of  fully contained in I*. Then S(J) ≥ S(I*)/, where   (1- -2 )-1. Complexity: O(n) 6

Benchmarking synthetic vectors of varying lengths Linear regression suggests that the complexities of the Exhaustive, LookAhead and GFA algorithms are O(n2), O(n1.5), O(n), respectively.

1 A2BP1 FRA16B 0 Log2(ratio) -1 0 25 50 75 Mbp ERBB2 1 0 1 0 Log2(ratio) 1 0 1 0 0 25 50 75 Mbp Applications: Single Samples Chromosome 16 of HCT116 colon carcinoma cell line on high-density oligo array (n=5,464). Chromosome 17 of several breast carcinoma cell lines on mid-density cDNA array (n=364).

Deletion Amplification 0.5 0 -0.5 Another Approach to CGH Segmentation • Use Hidden Markov Model (HMM) to “parse” sequence of probes into copy number states Genomic position

1 2 2 1 1 1 1 … 2 2 2 2 … K … … … … x1 K K K K x2 x3 xK … Hidden Markov Models

Example: The Dishonest Casino A casino has two dice: • Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 • Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: • You bet $1 • You roll (always with a fair die) • Casino player rolls (maybe with fair die, maybe with loaded die) • Highest number wins $2

Question # 1 – Evaluation GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Prob = 1.3 x 10-35

Question # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs. This is what we want to solve for CGH analysis FAIR LOADED FAIR

Question # 3 – Learning GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Prob(6) = 64%

Definition of a hidden Markov model Definition: A hidden Markov model (HMM) • Alphabet = { b1, b2, …, bM } • Set of states Q = { 1, ..., K } • Transition probabilities between any two states aij = transition prob from state i to state j ai1 + … + aiK = 1, for all states i = 1…K • Start probabilities a0i a01 + … + a0K = 1 • Emission probabilities within each state ei(b) = P( xi = b | i = k) ei(b1) + … + ei(bM) = 1, for all states i = 1…K 1 2 K …

The dishonest casino model 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05

A HMM is memory-less At each time step t, the only thing that affects future states is the current state t P(t+1 = k | “whatever happened so far”) = P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) = P(t+1 = k | t) 1 2 K …

1 1 1 1 … 2 2 2 2 … … … … … K K K K … A parse of a sequence Given a sequence x = x1……xN, A parse of x is a sequence of states  = 1, ……, N 1 2 2 K x1 x2 x3 xK

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a Parse Simply, multiply all the orange arrows! (transition probs and emission probs) 1 2 2 K x1 x2 x3 xK

1 1 1 1 … 2 2 2 2 … … … … … K K K K … Likelihood of a parse A compact way to write a01 a12……aN-1N e1(x1)……eN(xN) Number all parameters aij and ei(b); n params Example: a0Fair : 1; a0Loaded : 2; … eLoaded(6) = 18 Then, count in x and  the # of times each parameter j = 1, …, n occurs F(j, x, ) = # parameter j occurs in (x, ) (call F(.,.,.) the feature counts) Then, P(x, ) = j=1…n jF(j, x, ) = = exp[j=1…nlog(j)F(j, x, )] 1 Given a sequence x = x1……xN and a parse  = 1, ……, N, To find how likely is the parse: (given our HMM) P(x, ) = P(x1, …, xN, 1, ……, N) = P(xN, N | N-1) P(xN-1, N-1 | N-2)……P(x2, 2 | 1) P(x1, 1) = P(xN | N)P(N | N-1) ……P(x2 | 2)P(2 | 1)P(x1 | 1)P(1) = a01 a12……aN-1Ne1(x1)……eN(xN) 2 2 K x1 x2 x3 xK

Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4 Then, what is the likelihood of  = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a0Fair = ½, aoLoaded = ½) ½  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½  (1/6)10  (0.95)9 = .00000000521158647211 ~= 0.5  10-9

Duplications in Cancer Genomes through End Sequence Profiling

Duplications in Cancer Genomes through End Sequence Profiling

Presentation Transcript

“Cancer Genomics”

BCB 444/544

Cancer genomics and cancer epidemiology

Genomics and Personalized Healthcare Lecture 2

Lecture 9. Functional Genomics at the Protein Level: Proteomics

34 Cancer

BCB 444/544

Gene 210 Cancer Genomics

MICROBIAL GENETICS GENOMICS

Lecture 16 Gene expression and the transcriptome I

Genomics and You

IL10 related cytokines and IFN 

Distinguished Lecture Series in Biomolecular NMR and Structural Genomics

Cancer Metabolomics and Its Applications

Computational Genomics and Proteomics

Implementation of Genomics Research in an Undergraduate Course

Genomics

Chemical Genomics – Biol503

Genetics and Genomics Research to Enable the Practice of Personalized Cancer Care

Computational Genomics Lecture #3a

2 nd Johns Hopkins Ovarian Cancer Symposium

Genetic Cancer Susceptibility (GCS) Genomics in I4C