Algorithms for Smoothing Array CGH data

106 Views

Download Presentation
## Algorithms for Smoothing Array CGH data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Algorithms forSmoothing Array CGH data**Kees Jong (VU, CS and Mathematics) Elena Marchiori (VU, Computer Science) Aad van der Vaart (VU, Mathematics) Gerrit Meijer (VUMC) Bauke Ylstra (VUMC) Marjan Weiss (VUMC)**Tumor Cell**Chromosomes of tumor cell:**CGH Data** C o p y # Clones/Chromosomes **“Discrete” Smoothing**Copy numbers are integers**Why Smoothing ?**• Noise reduction • Detection of Loss, Normal, Gain, Amplification • Breakpoint analysis • Recurrent (over tumors) aberrations may indicate: • an oncogene or • a tumor suppressor gene**Is Smoothing Easy?**Measurements are relative to a reference sample Printing, labeling and hybridization may be uneven Tumor sample is inhomogeneous • vertical scale is relative • do expect only few levels**Problem Formalization**A smoothing can be described by • a number of breakpoints • corresponding levels A fitness function scores each smoothing according to fitness to the data An algorithm finds the smoothing with the highest fitness score.**Smoothing**breakpoints variance levels**Fitness Function**We assume that data are a realization of a Gaussian noise process and use the maximum likelihood criterion adjusted with a penalization term for taking into account model complexity We could use better models given insight in tumor pathogenesis**Fitness Function (2)**CGH values: x1 , ... , xn breakpoints: 0 < y1< … < yN < xN levels: m1, . . ., mN error variances: s12, . . ., sN2 likelihood:**Fitness Function (3)**Maximum likelihood estimators of μ and s2 can be found explicitly Need to add a penalty to log likelihood to control number N of breakpoints penalty**Algorithms**Maximizing Fitness is computationally hard Use genetic algorithm + local search to find approximation to the optimum**Algorithms: Local Search**choose N breakpoints at random while (improvement) - randomly select a breakpoint - move the breakpoint one position to left or to the right**Genetic Algorithm**Given a “population” of candidate smoothings create a new smoothing by - select two “parents” at random from population - generate “offspring” by combining parents (e.g. “uniform crossover” or “union”) - apply mutation to each offspring - apply local search to each offspring - replace the two worst individuals with the offspring**Experiments**• Comparison of • GLS • GLSo • Multi Start Local Search (mLS) • Multi Start Simulated Annealing (mSA) • GLS is significantly better than the other algorithms.**Comparison to Expert**algorithm expert**Algorithms forSmoothing Array CGH data**Kees Jong (VU, CS and Mathematics) Elena Marchiori (VU, CS) Aad van der Vaart (VU, Mathematics) Gerrit Meijer (VUMC) Bauke Ylstra (VUMC) Marjan Weiss (VUMC)**Conclusion**• Breakpoint identification as model fitting to search for most-likely-fit model given the data • Genetic algorithms + local search perform well • Results comparable to those produced by hand by the local expert • Future work: • Analyse the relationship between Chromosomal aberrations and Gene Expression**Example of a-CGH Tumor** V a l u e Clones/Chromosomes **a-CGH**DNA In Nucleus Same for every cell DNA on slide Measure Copy Number Variation Expression RNA In Cytoplasm Different per cell cDNA on slide Measure Gene Expression a-CGH vs. Expression**Breakpoint Detection**• Identify possibly damaged genes: • These genes will not be expressed anymore • Identify recurrent breakpoint locations: • Indicates fragile pieces of the chromosome • Accuracy is important: • Important genes may be located in a region with (recurrent) breakpoints**Experiments**• Both GAs are Robust: • Over different randomly initialized runs breakpoints are (mostly) placed on the same location • Both GAs Converge: • The “individuals” in the pool are very similar • Final result looks very much like (mean error = 0.0513) smoothing conducted by the local expert**Genetic Algorithm 1 (GLS)**initialize population of candidate solutions randomly while (termination criterion not satisfied) - select two parents using roulette wheel - generate offspring using uniform crossover - apply mutation to each offspring - apply local search to each offspring - replace the two worst individuals with the offspring**Genetic Algorithm 2 (GLSo)**initialize population of candidate solutions randomly while (termination criterion not satisfied) - select 2 parents using roulette wheel - generate offspring using OR crossover - apply local search to offspring - apply “join” to offspring - replace worst individual with offspring**Fitness function (2)**CGH values: x1 , ... , xn breakpoints: 0 < y1< … < yN < xN levels: m1, . . ., mN error variances: s12, . . ., sN2 likelihood: