STAC: A multi-experiment method for analyzing array-based genomic copy number data

STAC:A multi-experiment method for analyzing array-based genomic copy number data Sharon J. Diskin, Thomas Eck, Joel P. Greshock, Yael P. Mosse, Tara Naylor, Christian J. Stoeckert, Jr., Barbara L. Weber, John M. Maris, Gregory R. Grant University of Pennsylvania Children’s Hospital of Philadelphia MGED 8 Meeting Bergen, Norway September 11-13, 2005

Background • Gain and loss of chromosomal DNA occurs in many cancers • Regions of recurrent gain or loss contain genes critical to the genesis and/or progression of cancer • Accurate identification of such regions is essential for prioritizing follow-up efforts • Array Comparative Genomic Hybridization (aCGH) is a method for detecting genomic copy number variation on a genome-wide scale with high resolution • BAC, cDNA, ROMA, Affymetrix SNP chips, Agilent technology

Selecting significant aberrations across samples Researchers traditionally rely on a simple frequency threshold to identify “significant” regions of gain/loss This is followed by tedious manual review of the regions to define boundaries This process is time consuming at best, lacks statistical control, is subject to investigator bias, and may miss essential regions Samples Chromosome 8

Research Goal Develop a statistical method for assessing the significance of consistent copy number aberrations across multiple samples Validate this method using known biology and comparison to traditional methods

Example Data and Terminology A location is a fixed width stretch of genomic DNA (eg. 1 Mb) Experiments/samples are plotted along the vertical axis; one per row A sequence of one or more aberrant locations is called an aberrant interval We call a set of intervals for a given sample a profilefor that sample

The Problem • Find locations which have more intervals (gains/losses) covering them than would be expected by chance • True underlying aberration rate is unknown • Take the observed aberrations as given and test for the significance of consistentaberrations across samples

Statistical Approach Null Model:observed intervals of aberration are equally likely to occur anywhere in the stretch of the genome being considered General Approach: (1) Choose an appropriate statistic (2) Apply a permutation procedure under the null model to estimate a null distribution of the statistic (3) Assess the (multiple testing corrected) significance of observed values of the statistic by comparing to the null distribution Permutation: random rearrangement of intervals within each profile

freq = 9 Frequency statistic results Need statistic sensitive to tight alignment, even if the aberration is not significantly frequent

The footprint statistic • Stack:set Sof aligned intervals containing at most one interval per profile and with at least one location common to all intervals • Footprint: • F(S) = the number of locations c such that c is contained in some interval of stackS • In practice, F(S) is normalized: • NF(S) = F(S)/E(F(S)) • Null Distributions: Find the minimal NF(S) for each (sample) subset size using aheuristic search • use distributions to assign (multiple testing corrected) p-values to locations (details omitted)

p-value = 0.0050 p-value = 0.0001 footprint statistic coupled with search strategy reveals locations significantly consistent within subsets Footprint statistic results

STAC Algorithm Specification INPUT: matrix ofbinary gain/no change (or loss/no change) calls for each location along a chromosome arm OUTPUT: for each location along chromosome arm: a) the best stack covering that location b) two p-values for that location (one for each statistic) . . . . . .

Validation Data Publicly available data sets: 42 Neuroblastoma cell lines (Mosse et al. 2005, Genes Chr Cancer) 47 Primary sporadic breast tumors (Naylor et al. 2005, submitted) • UPenn BAC Array(Greshock et al. 2004, Gen. Res.) • ~4,200 BAC Clones 1. 69% BAC end sequenced 2. 28% STS Mapping 3. 3% Full BAC Sequence • Spacing: ~0.91 Mb (chrs 1-X) aCGH BAC Coverage (chr13)

Traditional Processing – Many Samples 1. Define regions of aberration for each sample 2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

Traditional Processing – Many Samples 70% 90% 60% 90% Example Common Regions of Aberration 1. Define regions of aberration for each sample 2. Determine frequency of aberration at each location 3. Threshold frequency (eg. NBL = 25%, Breast = 30%)

Validation Neuroblastoma 83% (19/22) gain regions 100% (12/12) loss regions Avg pval gain: 0.00447 loss: 0.00719 Breast Cancer 92% (11/12) gain regions 85% (11/13) loss regions also 86% (47/55) of the gains (suppl. data) Avg pval gain: 0.00549 loss: 0.00899 2p gain STAC identifies prognostically relevant regions in neuroblastoma. Shown: MYCN amplification at 2p24. Boundaries differ by < 1 Mb on average and in several cases are narrowed by STAC

Neuroblastoma • 94 Gains covering 341 Mb • 80 Losses covering 305 Mb Additional Regions Identified Neuroblastoma 94 Gains covering 341 Mb 80 Losses covering 305 Mb Breast Cancer • 149 Gains covering 525 Mb • 124 Losses covering 384 Mb

Regions segregate with known biology Neuroblastoma Cell Lines 646 Mb of significant locations scored (gain, loss, no change) Agglomerative hierarchical, Pearson correlation, complete linkage Evidence for 2 sample clusters - Cluster 1 characterized by pattern of loss - Cluster 2 characterized by pattern of gain * missed by traditional method

Future Plans • Release stand alone Java version of STAC • Extend STAC to account for high-level gains and homozygous deletions • Extend STAC to handle stacks with 2 or more intervals per profile (co-occurring aberrations) http://www.cbil.upenn.edu/STAC

STAC: A multi-experiment method for analyzing array-based genomic copy number data