Maximizing Information Gain for Feature Selection in Data Analysis

Y“Sick” X1“Fever” X2“Rash” X3“Male” Uncertaintybefore knowing XA Uncertaintyafter knowing XA Example: Feature selection • Given random variables Y, X1, … Xn • Want to predict Y from subset XA = (Xi1,…,Xik) Want k most informative features: A* = argmax IG(XA; Y) s.t. |A| · k where IG(XA; Y) = H(Y) - H(Y | XA) Problem inherently combinatorial! Naïve BayesModel

Key property: Diminishing returns Selection A = {} Selection B = {X2,X3} Y“Sick” Y“Sick” X2“Rash” X3“Male” X1“Fever” Adding X1will help a lot! Adding X1doesn’t help much New feature X1 + s B Large improvement Submodularity: A + s Small improvement For Aµ B, z(A [ {s}) – z(A) ¸ z(B [ {s}) – z(B)

A [ B AÅB Submodular set functions • Set function z on V is called submodular if For all A,B µ V: z(A)+z(B) ¸ z(A[B)+z(AÅB) • Equivalent diminishing returns characterization: + ¸ + B A + S B Large improvement Submodularity: A + S Small improvement For AµB, sB, z(A [ {s}) – z(A) ¸ z(B [ {s}) – z(B)

Example: Set cover Want to cover floorplan with discs Place sensorsin building Possiblelocations V For A µ V: z(A) = “area covered by sensors placed at A” Node predicts values of positions with some radius Formally: W finite set, collection of n subsets Siµ W For A µ V={1,…,n} define z(A) = |i2 A Si|

S’ S’ Set cover is submodular A={S1,S2} S1 S2 z(A[{S’})-z(A) ¸ z(B[{S’})-z(B) S1 S2 S3 S4 B = {S1,S2,S3,S4}

Maximizing Information Gain for Feature Selection in Data Analysis