Shantanu Dutt, Yang Dai Huan Ren, Joel Fontanarosa University of Illinois at Chicago

Selection of Multiple SNPs in Case-Control Association Study Using a Discretized Network Flow Approach Shantanu Dutt, Yang Dai Huan Ren, Joel Fontanarosa University of Illinois at Chicago

Outline • Background: Genome Wide Association Study • Problem Definition • Previous Work • Our Work: • MIP Formulations • Discretized Network Flow (DNF) Opt. Method • DNF Solutions for k-SNP Selection w/ Clustering/Classification • Experimental Results • Conclusions

Genetic Association Studies Goal: Find markers of variation that reliably distinguish individuals with a disease from a healthy population Single Nucleotide Polymorphisms (SNPs) are the simplest and most common form of variation in the human genome. Each chromosome has one of two alleles for each SNP Possible Genotypes = {0/0, 0/1, 1/1} Variations measured at specific SNP loci have been shown to be associated with numerous traits and diseases. Person 1 Person 2 Person 3 SNP SNP SNP chrom 1 chrom 1 chrom 1 chrom 2 chrom 2 chrom 2

Genetic Association Studies (contd) Genomic Variation Gene, Protein, or Cellular Alteration/Regulation Altered Phenotype - Individual traits (eg height, hair color) - Causal factors for disease - Increased risk factors for complex disease Images: pdb (ww.rcsb.org) Robbins and Cotran, 7th Ed 2005

Genetic Association Studies (contd) Complex traits cannot be mapped to a single genetic locus Multiple interacting genetic influences combine with environmental factors to produce an outcome Gene Networks Environment A B ... X Disease

Genetic Association Studies (contd) Genome Wide Association Study (GWAS): Measure a large number of SNPs (typically 500K-1M) across the genome in a large case-control study (often >1000 patients) Results are commonly reported based on individual χ2 values, ignoring potentially powerful interaction effects It remains an open computational and statistical challenge to reliably analyze epistasis, or gene-gene interactions, in large-scale GWAS. Different genetic variations  common complex disease Problem Definition: For a given set P of cases and Q of controls, classify the cases into different clusters and simultaneously selectk significant marker SNPs for them (those that strongly distinguish these cases from the set Q) In this paper, we present a new optimization technique called discretized network flow (DNF) for the above problem

Examples of Epistasis Methods Combinatorial MDR = multifactor dimensionality reduction CSP = combinatorial search based prediction CPM = combinatorial partitioning method Probabilistic BEAM = Bayesian Epistasis Association Mapping Bayesian partitioning model resolved by Markov Chain Monte Carlo (MCMC) methods megaSNPhunter Hierarchical learning algorithm (regression trees) Primarily considers local interaction effects MDR: Ritchie et al, Gen Epid, 2003 CSP: Brinza et al., WABI’06 CPM: Nelson et al, Genome Research, 2001 BEAM: Zhang and Liu, Nature Genetics, 2007 megaSNPhunter: Wan et al, BMC Bioinformatics, 2009

MDR Divide data into training and testing sets Select a set of N factors If (affected/unaffected) > T (e.g. T = 1.0)  high risk; o/w low risk Select model with best misclassification error 5-6. Estimate the model prediction error using the testing data set. Repeat these steps for each cross validation iteration, and for each possible combination of factors. Adapted from Ritchie et al, Gen Epid, 2003

CSP: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Risk/resistance factor  multi-SNP combination (MSC) Problem: Find all MSCs significantly associated with the disease Cluster C: subset of S with an MSC, S : the original SNP set d(C) : # of diseased, h(C) : # of non-diseased Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of non-disease individuals. Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters Finds faster associated MSCs but still too slow Tagging: compress the SNP set by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method for tagging Brinza, D., Zelikovsky, WABI’06

Claim • bi,j(x) is consistent with the specificity provided by selecting marker mi,jpi,j(x) • When pi,j(x)=1: bi,j(x) lower fraction of non-patients have pi,j=1= pi,j(x) higher fraction of non-patients have pi,j=0= pi,j(x) • When pi,j(x)=0: bi,j(x) higher fraction of non-patients have pi,j=1= pi,j(x) Our Work: MIP Formulation • Notations: • pi,j(x) (0≤j ≤2): =1 if allele j present on SNP i for individual x; =0, otherwise. • Marker mi,jval (val=0,1): mi,j1 means presence of allele j in SNP i mi,j0means absenceof allele j in SNP i • Per-case benefit function of SNP i and allele j nc is # of controls

MIP Formulation Otherwise (indicating mx,yval is not a common marker for patients x and y) • d(mi,jval) =1 if maker mi,jval is selected; np is the # of patients/cases • At most k markers will be selected • Linear MIP; MIP can be solved with commercial tools such as CPLEX/LINGO. However, very time consuming. • The similarity definition ensures that only common markers among patients will be selected. Benefit-based case-pair similarity metric MIP formulation for selecting one marker set for all patients:

MIP Formulation (contd) • Issue 1: • Genetic reasons of a disease for diff. patient sets (e.g., w/ different ethnicity) can be different. • Hence, selecting only one marker set is not appropriate (it artificially forces one marker set on the entire patient pop). • Solution: Simultaneously cluster patients and select different markers for different clusters • bxg: if x is in cluster g dg(mi,jval): if marker mi,jval is selected for cluster g. At most G cluster will be generated. • Cubic MIP!

Mismatch marker 2 Mismatch marker 3 Mismatch marker 4 Mismatch marker 1 Control set • Individually, marker 1 and 2 provide larger speicfity than marker 3 and 4 (mismatch more controls). • However, the mismatch set of marker 1 and 2 have larger overlap. MIP Formulation (contd) • Issue 2: the sum of benefit is not consistent with the specifity of a set of markers • Essentially, the previous formulation will select five common markers with the highest benefit. • However, it is not optimal. Select marker 3 and 4 as the marker set gives overall higher specifity

MIP Formulation (contd) • Adding accurate specifity terms to the obj. func. for each control z : • Mi(z) : whether control z matches the marker set selected for cluster i; Mi(z) is the mod 2 addition (Boolean OR) of various 0/1 vars • gmis: objective function gain for mismatching a control. Final objective function • At least cubic MIP (if G <= 3) • gmisis determined so that specificity and sensitivity are given the same weight. • Average gain for a patient matching a marker set: 2kbavgα(np/G), where np is the number of patients, and G is the number of groups. • gmis =2kbavgα(np/G)*np/nc

MEA Capacity cost (2,0) (1,4) (2,0) (2,0) (1,1) f=1 s T (2,0) (2,0) (2,0) (1,2) Invalid flow Valid flow Discretized Network Flow (DNF) • Standard min-cost network flow • Find a min cost way to send a certain amount of flow from the source node (S) to the sink node (T). • Solves certain LP problems (continuous solns) • Some discrete constraints have to be staisfied in order to solve discrete opt. problems like MIP • One such constraint: Mutually exclusive arc set (MEA): At most one arc of a subset of arcs in this set can have flow on it.

MEA sets Discretized Network Flow (contd) • Satisfying MEA requirements • Adding a flow-amount-independent cost C’ to each arc in the set, • A constant C’ cost is incurred whenever there is flow on the arc c Standard linear flow cost C’inv: total C’-related cost for invalid flow C’val: total C’-related cost for valid flow f C’ C’ Cap(e) c With C’ cost C’ C’ C’inv≥C’val+C’ C’ f Cap(e)

Obtain min-cost flow of cost Cinvmin w/o discretization constraints Set C’=Cval-Cinvmin+1 Without C’ With C’ Cinv Cvalmin Cval Cvalmin+C’val Cval+C’val Cinv+C’inv Discretized Network Flow (contd) • Determining C’: • In the standard network flow graph Heuristically select a valid flow & determine its cost Cval Since C’inv≥C’val+C’, a valid flow is guaranteed to have a smaller cost than any invalid flow. Theorem [Ren et al., ICCAD’08]: A min-cost flow with C’-costs on MEA arcs ensures MEA satisfaction

Discretized Network Flow (contd) • Discrete network flow has been applied to VLSI CAD problems [Ren et al., ICCAD’08], [Ren et al., IWLS’08], [Dutt et al., ICCAD’06] • Good run time and scalability. • At least 10x to 60x times faster than CPLEX with similar quality • Example: determine optimal cell sizes in a circuit under an area constraint • Four sizes available. The number of 0/1 variables is about four times the number of cells considered. Run time vs. the number of cells from [Ren et al., IWLS’08]

(1, -s(x,y,pi,jci,j(x))) if ci,j(x)=ci,j(y) No connection otherwise P1 Pm P1 Pm p1,1 f=1 p1,1 (np,0) cost cap S1 p1,2 f=np (np*k,0) p1,3 f=np*k p1,3 … To T … MEA MEA pN,1 From S pN,1 (np,0) MEA: only k arcs can have flow SN pN,3 pN,3 Py Px DNF Model for Single-Cluster Marker Selection T S … … Complete bi-partite graph with meta arcs • Flow through pi,j node in Px means d(mi,jpi,j(x))=1 • Pairwise connection between pi,j nodes ensures the same marker set is selected for all Px • The flow cost incurred for selecting a common marker between two patients is: -s(x,y,mi,jpi,j(x))

P1 P2 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P4 P3 MEA MEA Marker Selection for Multiple Clusters Cluster 1 • Type 1 invalid flow: Flow puts P1 in both cluster 1 and 2 • Use multiple copies of the single cluster network model • Type 2 invalid flow: Flow thru P1 passes thru P2 that is not in the same cluster, incurring false costs. Complete bipartite S • MEA prevents invalid flows T • Example valid flow: Puts patients {1,4} in cluster 1, and {3,2} in cluster 2. Choice nodes • For a G clusters will have G copies of the 2-level compl. bipartite graph; not all G clusters may be formed Cluster 2

Final solution Meet termination condition Marker Selection for Multiple Clusters • Issue: When G is large, the network flow graph become very complex • We use iterative bi-partitioning instead • Much harder bi-part prob than standard bi-part; bi-part criterion needs to be selected simultaneously w/ bi-part! • Another run-time reduction technique: Patient pre-clustering • Group patients before using DNF. • Greedy iterative grouping method • Initially, each patient is a subgroup • Each time merge the two subgroups with most common SNP-allele pairs. • Termination condition: patients in one group must have at least 70% SNP-allele pairs incommon. • Each group is taken as a “meta patient” in DNF • Groups opened up after DNF, and metrics eval. at the individual level Condition for stopping the bi-partitioning of a cluster: The spec+sens deteriorates Meet termination condition

MEA MEA From S MM chain cost=-gmis T A1 cost=0 Chain structure for control z A2 M chain Ag (1,0) Cluster 1 Cluster 2 Cluster g (cap, cost) Chain Structure for Improving Specificity • One chain structure for each controls. • Two subchains: mismatched (MM) chain and matched (M) chain. • One injection arc to M subchain from each cluster: A1......Ag. • Injection flow on arc Ai means z matches the selected marker set of cluster i (Mi(z)=1). Any injection flow causes the MEA condition to force chain flow into M chain, and never switch back. Hence, incur 0 cost. Chain flow stays on the MM chain if no injection arc has flow, and incurs cost of -gmis

TP: correctly predicted as sick FP: falsely predicted as sick Marker set 1 Match Mismatch TN: correctly predicted as healthy Test 1 FN: falsely predicted as healthy Test 2 Sensitivity=TP/(FN+TP) Mismatch Marker set 2 Mismatch Specifity=TN/(FP+TN) Accuracy=(TN+TP)/(FP+TN+FN+TP) Experimental Results • Data set we use • Crohn’s disease: 144 cases, 243 controls and 103 SNPs • Autoimmune disorder: 384 cases, 652 controls and 108 SNPs • Tick-borne encephalitis: 21 cases, 54 controls and 41 SNPs • Rheumatoid arthritis: 460 cases, 460 controls and 2300 SNPs • Lung cancer: 322 cases, 273 controls and 141 SNPs • Rheumatoid arthritis (large): 868 cases, 1194 controls and 5000 SNPs • Prediction scheme with multiple cluster marker sets • Machine configurations: 3G cpu, 1G mem, Windows machine. Predict as healthy Predict as sick

Experimental Results • Five-fold cross validation • K=10 results for Rheum. (large, no comparisons available): sens: 85; spec: 80; accuracy: 82 ;10 clusters; 21.5 h per training run • Comparisons to MDR: 78.1 81.9 56.7 38% relatively Specifity 87.6 48.8 88.4 79% relatively Sensitivity

Experimental Results • Comparisons to CSP [Brinza commun. 4/09, Brinza et al., WABI’06 ppt: http://www.cs.ucsd.edu/~dbrinza/cv/present/brinza_wabi06.ppt] • Leave-one-out validation • For DNF, 20 runs are performed with randomly chosen left-out individuals • CSP performs n runs for n individuals (cases+controls) 96.6 71.1 83.1 36% relatively 85 2.4% relatively Specifity Sensitivity 76.8 90.6 18% relatively 24k 8 times 3k Geometric mean of sens. and spec. Run time (ksecs, per leave-out run)

Experimental Results • Leave-one-out validation 76.6 90.8 19% relatively Average number of clusters Accuracy

Experimental Results • Comparing to LINGO (<= 20% from optimal setting) • Same MIP formulation is solved by LINGO, and we compare the MIP objective function value and run time with DNF. • Comparisons are for 1 iteration of bi-partitioning and quad-partitioning (i.e. G=2,4) 0.95 0.96 Bi-p normalized quality (DNF is 1, the larger the better) Quad-p normalized quality (DNF is 1) 23 15 Bi-p normalized run time (DNF is 1, smaller is better) Quad-p normalized run time (DNF is 1, smaller is better)

Experimental Results • Run time vs. number of SNPs • Rheumatoid arthritis data set is used • Randomly chosen 100, 200, 400, 800, 1600, 2300 SNPs • Run time vs. number of patients • Crohn’s disease data set is used • No patient pre-clustering. Randomly chosen 30, 60, 90, 120, 144, patients from the data set

Conclusions • We proposed 0/1 non-linear MIP formulations to identify disease markers. • We consider patient clustering to identify most appropriate marker sets • The discretized network flow (DNF) method is used to efficiently solve the MIP formulations. • A chain structure is used for improving specificity • Significant improvements compared to MDR and CSP • Also much faster run times • Can apply DNF to other computationally challenging bioinfo problems since: • DNF can efficiently & near-optimally solve polynomial and Boolean MIPs • DNF can also efficiently & near-optimally solve other discrete optimization problems

Ak If there is flow on Ak Flow towards Ak is shunted to sink Appendix: Generating Injection Flow MM chain M chain • First a complementary injection flow is generated on a complementary arc Ak, which is 1 if any mismatched marker for NPz is selected (1,C’) (1,C’) (2,C’) cost Ak (1,-inf) To T cap (1,0) Draining arc (1,0) • Ak and Ak are coupled by a draining arc. S (1,0) …… To T Mi,jval nodes that mismatch NPz Cluster k If there is no flow on Ak Flow will be drained from Ak, and cause injection flow to the chain

Shantanu Dutt, Yang Dai Huan Ren, Joel Fontanarosa University of Illinois at Chicago