Three feature selection problems (with solutions)

Three feature selection problems(with solutions) Jose M. Peña Computational Biology Linköping University Sweden jmp@ifm.liu.se www.ifm.liu.se/~jmp Joint work with Roland Nilsson Johan Björkegren Jesper Tegnér JMP at IDAMAP 2007

Outline • Problem I: Posterior distribution. • Solution: Markov boundary. • Peña, J. M., Nilsson, R., Björkegren, J. and Tegnér, J. (2007). Towards Scalable and Data Efficient Learning of Markov Boundaries. International Journal of Approximate Reasoning, 45(2), 211-232. • Problem II: Class label. • Solution: Bayes relevant features. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Consistent Feature Selection for Pattern Recognition in Polynomial Time. Journal of Machine Learning Research, 8, 589-612. • Problem III: All relevant features. • Solution: RIT algorithm. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Detecting Multivariate Differentially Expressed Genes. BMC Bioinformatics, 8:150. JMP at IDAMAP 2007

Preliminaries • Classifier, g:X->Y. • Bayes classifier, g*(X) = arg maxy p(y|X). • Risk, R(g) = p(g(X)  Y). JMP at IDAMAP 2007

Problem I: Posterior distribution • The Markov boundary of Y, SM, is the minimal set of features such that p(p(Y|X) = p(Y| SM)) = 1. • If p(X) > 0 then SM is unique. • If p(X) > 0 then Z  SM iff p(p(Y|X)  p(Y|X\Z)) > 0. Data inefficient Z is strongly relevant JMP at IDAMAP 2007

Algorithms for SM • Satisfied by • Gaussian distributions. • Distributions perfect to some graph. • Closed under marginalizacion and conditioning*. (Tsamardinos et al., 2003) • IAMB is consistent under the composition property assumption (X ╨ Y | Z ٨ X ╨ W | Z → X ╨ YW | Z). JMP at IDAMAP 2007

Algorithms for SM • Consistent under the same conditions as IAMB. JMP at IDAMAP 2007

Data provided by DuPont Pharmaceuticals for KDD Cup 2001. 1909 training instances + 634 testing instances 139351 binary features (3-D properties of a drug compound tested for binding to thrombin, a key receptor in blood clotting) Thrombin data JMP at IDAMAP 2007

Problem II: Class label • Z is Bayes relevant iff p(g*(X)g*(X\Z)) > 0. • Let S* denote the set of Bayes relevant features. Then, • S* is unique if g* is unique, and • g* is unique if p(p(Y=0|X) = p(Y=1|X)) = 0 (Devroye et al. 1996). • If p(X) > 0 then S* is the minimal set of features such that p(g*(X) = g*(S*)) = 1. Assumption JMP at IDAMAP 2007

S* may differ from SM • S*  SM. • But the converse may not be true. JMP at IDAMAP 2007

Algorithm for S* • Polynomial in the number of features if ĉ is so (e.g., empirical risk of the k-NN classifier on some testing data). JMP at IDAMAP 2007

UCI data sets JMP at IDAMAP 2007

Problem III: All relevant features • Z is weakly relevant iff p(p(Y|X) = p(Y|X\Z)) = 1 but p(p(Y|S)  p(Y|S,Z)) > 0 with S  X\Z. • The set of relevant features, SA, is the set of strongly and weakly relevant features. JMP at IDAMAP 2007

Why is this important ? JMP at IDAMAP 2007

Satisfied by • Gaussian distributions. • Distributions perfect to some graph. • Closed under marginalizacion and conditioning*. Algorithm for SA • There exists f(X,Y) > 0 such that searching for SA implies an exhaustive search. • RIT is consistent under the following assumptions: • strictly positivity (f(X)>0), • composition (X ╨ Y | Z ٨ X ╨ W | Z → X ╨ YW | Z), and • weak transitivity (X ╨ Y | Z٨X ╨ Y | ZV→ X ╨ V | Z ٧V ╨ Y | Z). • RIT performs at most |SA||X| tests (|SA|<|X|). JMP at IDAMAP 2007

Algorithm for SA JMP at IDAMAP 2007

Algorithm for SA with FDR control JMP at IDAMAP 2007

Diabetes data Data from Gunton et al. (2005) Cell, 122. 7 Normal vs. 15 type 2 diabetic patients, and 5000 genes kept after filtering out those with low variance. 3 genes are univariately differentially expressed: Arnt, Cdc14a and Ddx3Y (370 if no control for multiplicity). Dopey1 was recently shown to be active in the vesicle traffic system, the mechanism that delivers insulin receptors to the cell surface. 4 genes encoded TFs, which is intriguing since a large fraction of previously discovered diabetes-related genes are TFs. So does Ddx3Y (only 6 genes annotated with this function). JMP at IDAMAP 2007

Summary • Problem I: Posterior distribution. • Solution: Markov boundary. • Peña, J. M., Nilsson, R., Björkegren, J. and Tegnér, J. (2007). Towards Scalable and Data Efficient Learning of Markov Boundaries. International Journal of Approximate Reasoning, 45(2), 211-232. • Problem II: Class label. • Solution: Bayes relevant features. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Consistent Feature Selection for Pattern Recognition in Polynomial Time. Journal of Machine Learning Research, 8, 589-612. • Problem III: All relevant features. • Solution: RIT algorithm. • Nilsson, R., Peña, J. M., Björkegren, J. and Tegnér, J. (2007). Detecting Multivariate Differentially Expressed Genes. BMC Bioinformatics, 8:150. JMP at IDAMAP 2007

Three feature selection problems (with solutions)

Three feature selection problems (with solutions)

Presentation Transcript

Problems with Solutions

Feature selection

Feature Selection

Feature selection

Feature Selection

Feature Selection and Extraction

Feature Selection for Regression Problems

Feature Selection

Feature Selection

FEATURE SELECTION = GENE SELECTION

Feature selection

Feature Selection

Feature Selection

Feature Selection, Feature Extraction

Feature Selection

Feature selection

Steganalysis with Streamwise Feature Selection

Feature Selection

Feature Selection

Feature selection