630 likes | 811 Views
Maximizing Submodular Functions and Applications in Machine Learning. Andreas Krause, Carlos Guestrin Carnegie Mellon University. Tutorial Outline. Introduction Submodularity: properties and examples Maximizing submodular function Applications in Machine Learning. Y “Sick”. X 1 “Fever”.
E N D
Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University
Tutorial Outline • Introduction • Submodularity: properties and examples • Maximizing submodular function • Applications in Machine Learning
Y“Sick” X1“Fever” X2“Rash” X3“Male” Uncertaintybefore knowing XA Uncertaintyafter knowing XA Feature selection • Given random variables Y, X1, … Xn • Want to predict Y from subset XA = (Xi1,…,Xik) Want k most informative features: A* = argmax IG(XA; Y) s.t. |A| · k where IG(XA; Y) = H(Y) - H(Y | XA) Problem inherently combinatorial! Naïve BayesModel
X3 X2 X1 X3 X7 X1 X2 X7 X6 X5 X4 X6 X4 X5 Factoring distributions V • Given random variables X1,…,Xn • Partition variables V into sets A and VnA as independent as possible Formally: Want A* = argminA I(XA; XVnA) s.t. 0<|A|<n where I(XA,XB) = H(XB) - H(XBj XA) Fundamental building block in structure learning[Narasimhan&Bilmes, UAI ’04] Problem inherently combinatorial! A VnA
Combinatorial problems in ML Given a (finite) set V, function F: 2V! R, want A* = argmin F(A) s.t. some constraints on A Solving combinatorial problems: • Mixed integer programming? Often difficult to scale to large problems • Relaxations? (e.g., L1 regularization, etc.) Not clear when they work • This talk: Fully combinatorial algorithms (spanning tree, matching, …)
Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(XA; Y) Want: A*µ V such that NP-hard! How well can this simple heuristic do? Y“Sick” X1“Fever” X2“Rash” X3“Male” M Greedy algorithm: Start with A = ; For i = 1 to k s* := argmaxs F(A [ {s}) A := A [ {s*}
Key property: Diminishing returns Selection A = {} Selection B = {X2,X3} Y“Sick” Y“Sick” X2“Rash” X3“Male” X1“Fever” Adding X1will help a lot! Adding X1doesn’t help much Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in Naïve Bayes models is submodular! New feature X1 + s B Large improvement Submodularity: A + s Small improvement For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
~63% Why is submodularity useful? Theorem [Nemhauser et al ‘78] Greedy maximization algorithm returns Agreedy: F(Agreedy) ¸ (1-1/e) max|A|·k F(A) • Greedy algorithm gives near-optimal solution! • More details and exact statement later • For info-gain: Guarantees best possible unless P = NP! [Krause, Guestrin UAI ’05]
Submodularity Properties and Examples
Y“Sick” Y“Sick” X1“Fever” X2“Rash” X2“Rash” X3“Male” F({X1,X2}) = 0.9 F({X2,X3}) = 0.5 Set functions • Finite set V = {1,2,…,n} • Function F: 2V! R • Will always assume F(;) = 0 (w.l.o.g.) • Assume black-box that can evaluate F for any input A • Approximate (noisy) evaluation of F is ok (e.g., [37]) • Example: F(A) = IG(XA; Y) = H(Y) – H(Y | XA) = y,xA P(xA) [log P(y | xA) – log P(y)]
A [ B AÅB Submodular set functions • Set function F on V is called submodular if For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB) • Equivalent diminishing returns characterization: + ¸ + B A + S B Large improvement Submodularity: A + S Small improvement For AµB, sB, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
Submodularity and supermodularity • Set function F on V is called submodular if 1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB) 2) For all AµB, sB, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) • F is called supermodular if –F is submodular • F is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) = i2A w(i)
Example: Set cover Want to cover floorplan with discs Place sensorsin building Possiblelocations V For A µ V: F(A) = “area covered by sensors placed at A” Node predicts values of positions with some radius
S’ S’ Set cover is submodular A={S1,S2} S1 S2 F(A[{S’})-F(A) ¸ F(B[{S’})-F(B) S1 S2 S3 S4 B = {S1,S2,S3,S4}
0.2 0.5 0.4 0.2 0.3 0.5 0.5 Example: Influence in social networks[Kempe, Kleinberg, Tardos KDD ’03] Dr. Han Thanh Who should get free cell phones? V = {Dr. Z, Dr. H, Khuong, Huy, Thanh, Nam} F(A) = Expected number of people influenced when targeting A Dr. Zheng Prob. ofinfluencing Khuong Nam Huy
Influence in social networks is submodular[Kempe, Kleinberg, Tardos KDD ’03] Dr. Han Thanh Dr. Zheng 0.2 0.5 0.4 0.2 0.3 0.5 Khuong 0.5 Nam Huy Key idea: Flip coins c in advance “live” edges Fc(A) = People influenced under outcome c (set cover!) F(A) = c P(c) Fc(A) is submodular as well!
Submodularity and Concavity Suppose g: N ! R and F(A) = g(|A|) Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” g(|A|) |A|
Maximum of submodular functions Suppose F1(A) and F2(A) submodular. Is F(A) = max(F1(A),F2(A)) submodular? F(A) = max(F1(A),F2(A)) F1(A) F2(A) |A| max(F1,F2) not submodular in general!
Minimum of submodular functions Well, maybe F(A) = min(F1(A),F2(A)) instead? F({b}) – F(;)=0 < F({a,b}) – F({a})=1 min(F1,F2) not submodular in general!
Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable! Maximizing convex functions: NP hard! Maximizing submodular functions: NP hard! But can get approximation guarantees
maximum |A| Maximizing non-monotonic functions • Suppose we want for not monotonic FA* = argmax F(A) s.t. AµV • Example: • F’(A) = F(A) – C(A) where F(A) is submodular utility, and C(A) is linear cost functionE.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08] • In general: NP hard. Moreover: • If F(A) can take negative values:As hard to approximate as maximum independent set (i.e., NP hard to get O(n1-) approximation)
Maximizing positive submodular functions[Feige, Mirrokni, Vondrak FOCS ’07] • picking a random set gives ¼ approximation • we cannot get better than ¾ approximation unless P = NP Theorem There is an efficient randomized local search procedure,that, given a positive submodular function F, F(;)=0, returns set ALS such that F(ALS) ¸ (2/5) maxA F(A)
Option 1: maxA F(A) – C(A) s.t. A µ V Option 2: maxA F(A) s.t. C(A) · B Scalarization vs. constrained maximization Given monotonic utility F(A) and cost C(A), optimize: “Scalarization” “Constrained maximization” Can get 2/5 approx… if F(A)-C(A) ¸ 0 for all A µ V coming up… Positiveness is a strong requirement
Selected set Monotonic submodular Selection cost Budget Constrained maximization: Outline Subset selection: C(A) = |A| Complex constraints Robust optimization
Monotonicity • A set function is called monotonic if AµBµV ) F(A) · F(B) • Examples: • Influence in social networks [Kempe et al KDD ’03] • For discrete RVs, entropy F(A) = H(XA) is monotonic:Suppose B=A [ C. Then F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A) • Information gain: F(A) = H(Y)-H(Y | XA) • Set cover • Matroid rank functions (dimension of vector spaces, …) • …
Subset selection Given: Finite set V, monotonic submodular function F, F(;) = 0 Want: A*µ V such that NP-hard!
M Exact maximization of monotonic submodular functions 1) Mixed integer programming [Nemhauser et al ’81] 2) Branch-and-bound: “Data-correcting algorithm” [Goldengorin et al ’99] max s.t. · F(B) + s2VnBss(B) for all B µ S ss· k s2 {0,1} where s(B) = F(B [ {s}) – F(B) Solved using constraint generation Both algorithms worst-case exponential!
Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A*µ V such that NP-hard! Y“Sick” M Greedy algorithm: Start with A0 = ; For i = 1 to k si := argmaxs F(Ai-1[ {s}) - F(Ai-1) Ai := Ai-1[ {si} X1“Fever” X2“Rash” X3“Male”
~63% Performance of greedy algorithm Theorem [Nemhauser et al ‘78] Given a monotonic submodular function F, F(;)=0, the greedy maximization algorithm returns Agreedy F(Agreedy) ¸ (1-1/e) max|A|· k F(A)
Y1 Y2 Y3 X1 X2 X3 X4 Example: Submodularity of info-gain Y1,…,Ym, X1, …, Xn discrete RVs F(A) = IG(Y; XA) = H(Y)-H(Y | XA) • F(A) is always monotonic • However, NOT always submodular Theorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular! Hence, greedy algorithm works! In fact, NO algorithm can do better than (1-1/e) approximation!
Leanforward Leanleft Slouch Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07] • People sit a lot • Activity recognition inassistive technologies • Seating pressure as user interface Equipped with 1 sensor per cm2! Costs $16,000! Can we get similar accuracy with fewer, cheaper sensors? 82% accuracy on 10 postures! [Tan et al]
How to place sensors on a chair? • Sensor readings at locations V as random variables • Predict posture Y using probabilistic model P(Y,V) • Pick sensor locations A* µ V to minimize entropy: Possible locations V Placed sensors, did a user study: Similar accuracy at <1% of the cost!
Hach Sensor Monitoring water networks[Krause et al, J Wat Res Mgt 2008] • Contamination of drinking watercould affect millions of people Contamination Sensors Simulator from EPA ~$14K Place sensors to detect contaminations “Battle of the Water Sensor Networks” competition Where should we place sensors to quickly detect contamination?
Model-based sensing • Utility of placing sensors based on model of the world • For water networks: Water flow simulator from EPA • F(A)=Expected impact reduction placing sensors at A Model predicts High impact Lowimpactlocation Theorem [Krause et al., J Wat Res Mgt ’08]: Impact reduction F(A) in water networks is submodular! Contamination Medium impactlocation S3 S1 S2 S3 S4 S2 Sensor reducesimpact throughearly detection! S1 S4 Set V of all network junctions S1 Low impact reduction F(A)=0.01 High impact reduction F(A) = 0.9
Battle of the Water Sensor Networks Competition • Real metropolitan area network (12,527 nodes) • Water flow simulator provided by EPA • 3.6 million contamination events • Multiple objectives: • Detection time, affected population, … • Place sensors that detect well “on average”
Offline (Nemhauser)bound 1.4 1.2 1 0.8 0.6 Greedysolution 0.4 0.2 0 0 5 10 15 20 Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08] (1-1/e) bound quite loose… can we get better bounds? Population protected F(A) Higher is better Water networksdata Number of sensors placed
M Data dependent bounds[Minoux ’78] • Suppose A is candidate solution to argmax F(A) s.t. |A| · k and A* = {s1,…,sk} be an optimal solution • Then F(A*) · F(A [ A*) = F(A)+i F(A[{s1,…,si})-F(A[ {s1,…,si-1}) · F(A) + i (F(A[{si})-F(A)) = F(A) + isi For each s 2 VnA, let s = F(A[{s})-F(A) Order such that 1 ¸2¸ … ¸n Then: F(A*) · F(A) + i=1ki
Offline (Nemhauser)bound 1.4 Data-dependentbound 1.2 1 0.8 0.6 Greedysolution 0.4 0.2 0 0 5 10 15 20 Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08] Submodularity gives data-dependent bounds on the performance of any algorithm Sensing quality F(A) Higher is better Water networksdata Number of sensors placed
30 25 E H G 20 G E H 15 Total ScoreHigher is better G 10 G D H D G 5 0 Gueli Trachtman Guan et al. Berry et al. Dorini et al. Huang et al. Wu & Walski Preis & Ostfeld Propato & Piller Krause et al. Ghimire & Barkdoll Ostfeld & Salomons Eliades & Polycarpou BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] • 13 participants • Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge E: “Exact” method (MIP) H: Other heuristic 24% better performance than runner-up!
“Lazy” greedy algorithm[Minoux ’78] M b b c e d e Benefit s(A) a a a b d c d c e Note: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] Lazy greedyalgorithm: • First iteration as usual • Keep an ordered list of marginal benefits ifrom previous iteration • Re-evaluate ionlyfor top element • If i stays on top, use it,otherwise re-sort
300 Exhaustive search (All subsets) 200 Naive Running time (minutes) greedy ubmodularity to the rescue: 100 Fast greedy 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected Result of lazy evaluation , 16 GB in main memory (compressed) Simulated all on 2 weeks / 40 processors 152 GB data on disk Very accurate computation of F(A) 3.6M contaminations Very slow evaluation of F(A) 30 hours/20 sensors 6 weeks for all30 settings Lower is better Using “lazy evaluations”:1 hour/20 sensors Done after 2 days!
S3 S2 S4 S1 What about worst-case?[Krause et al., NIPS ’07] Knowing the sensor locations, an adversary contaminates here! S3 S2 S4 S1 Placement detects well on “average-case” (accidental) contamination Very different average-case impact,Same worst-case impact Where should we place sensors to quickly detect in the worst case?
Selected set Utility function Selection cost Budget Constrained maximization: Outline Subset selection Complex constraints Robust optimization
Optimizing for the worst case • Separate utility function Fi for each contamination i • Fi(A) = impact reduction by sensors A for contamination i Want to solve Each of the Fi is submodular Unfortunately, mini Fi not submodular! How can we solve this robust optimization problem? Contamination at node s Sensors A Fs(A) is high Fs(B) is high Sensors B Contamination at node r Fr(A) is low Fr(B) is high
How does the greedy algorithm do? V={ , , } Greedy does arbitrarily badly. Is there something better? Can only buy k=2 Greedy picks first Hence we can’t find any approximation algorithm. Optimalsolution Then, canchoose only or Optimal score: 1 Or can we? Greedy score: Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A) does not admit any approximation unless P=NP
Alternative formulation If somebody told us the optimal value, can we recover the optimal solution A*? Need to find Is this any easier? Yes, if we relax the constraint |A| · k
c Problem 1 (last slide) Problem 2 Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) Remains submodular! F’i,c(A) |A| Same optimal solutions!Solving one solves the other Non-submodular Don’t know how to solve Submodular!But appears as constraint?
Maximization vs. coverage Previously: Wanted A* = argmax F(A) s.t. |A| · k Now need to solve: A* = argmin |A| s.t. F(A) ¸ Q Greedy algorithm: Start with A := ;; While F(A) < Q and |A|< n s* := argmaxs F(A [ {s}) A := A [ {s*} Theorem [Wolsey et al]: Greedy will return Agreedy|Agreedy| · (1+log maxs F({s})) |Aopt| M For bound, assume F is integral.If not, just round it.
c Problem 1 (last slide) Problem 2 Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) F’i,c(A) |A| Non-submodular Don’t know how to solve Submodular!Can use greedy algorithm!