1 / 63

Maximizing Submodular Functions and Applications in Machine Learning

Maximizing Submodular Functions and Applications in Machine Learning. Andreas Krause, Carlos Guestrin Carnegie Mellon University. Tutorial Outline. Introduction Submodularity: properties and examples Maximizing submodular function Applications in Machine Learning. Y “Sick”. X 1 “Fever”.

roman
Download Presentation

Maximizing Submodular Functions and Applications in Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maximizing Submodular Functions and Applications in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University

  2. Tutorial Outline • Introduction • Submodularity: properties and examples • Maximizing submodular function • Applications in Machine Learning

  3. Y“Sick” X1“Fever” X2“Rash” X3“Male” Uncertaintybefore knowing XA Uncertaintyafter knowing XA Feature selection • Given random variables Y, X1, … Xn • Want to predict Y from subset XA = (Xi1,…,Xik) Want k most informative features: A* = argmax IG(XA; Y) s.t. |A| · k where IG(XA; Y) = H(Y) - H(Y | XA) Problem inherently combinatorial! Naïve BayesModel

  4. X3 X2 X1 X3 X7 X1 X2 X7 X6 X5 X4 X6 X4 X5 Factoring distributions V • Given random variables X1,…,Xn • Partition variables V into sets A and VnA as independent as possible Formally: Want A* = argminA I(XA; XVnA) s.t. 0<|A|<n where I(XA,XB) = H(XB) - H(XBj XA) Fundamental building block in structure learning[Narasimhan&Bilmes, UAI ’04] Problem inherently combinatorial! A VnA

  5. Combinatorial problems in ML Given a (finite) set V, function F: 2V! R, want A* = argmin F(A) s.t. some constraints on A Solving combinatorial problems: • Mixed integer programming? Often difficult to scale to large problems • Relaxations? (e.g., L1 regularization, etc.) Not clear when they work • This talk: Fully combinatorial algorithms (spanning tree, matching, …)

  6. Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(XA; Y) Want: A*µ V such that NP-hard! How well can this simple heuristic do? Y“Sick” X1“Fever” X2“Rash” X3“Male” M Greedy algorithm: Start with A = ; For i = 1 to k s* := argmaxs F(A [ {s}) A := A [ {s*}

  7. Key property: Diminishing returns Selection A = {} Selection B = {X2,X3} Y“Sick” Y“Sick” X2“Rash” X3“Male” X1“Fever” Adding X1will help a lot! Adding X1doesn’t help much Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in Naïve Bayes models is submodular! New feature X1 + s B Large improvement Submodularity: A + s Small improvement For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)

  8. ~63% Why is submodularity useful? Theorem [Nemhauser et al ‘78] Greedy maximization algorithm returns Agreedy: F(Agreedy) ¸ (1-1/e) max|A|·k F(A) • Greedy algorithm gives near-optimal solution! • More details and exact statement later • For info-gain: Guarantees best possible unless P = NP! [Krause, Guestrin UAI ’05]

  9. Submodularity Properties and Examples

  10. Y“Sick” Y“Sick” X1“Fever” X2“Rash” X2“Rash” X3“Male” F({X1,X2}) = 0.9 F({X2,X3}) = 0.5 Set functions • Finite set V = {1,2,…,n} • Function F: 2V! R • Will always assume F(;) = 0 (w.l.o.g.) • Assume black-box that can evaluate F for any input A • Approximate (noisy) evaluation of F is ok (e.g., [37]) • Example: F(A) = IG(XA; Y) = H(Y) – H(Y | XA) = y,xA P(xA) [log P(y | xA) – log P(y)]

  11. A [ B AÅB Submodular set functions • Set function F on V is called submodular if For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB) • Equivalent diminishing returns characterization: + ¸ + B A + S B Large improvement Submodularity: A + S Small improvement For AµB, sB, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)

  12. Submodularity and supermodularity • Set function F on V is called submodular if 1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)  2) For all AµB, sB, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) • F is called supermodular if –F is submodular • F is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) = i2A w(i)

  13. Example: Set cover Want to cover floorplan with discs Place sensorsin building Possiblelocations V For A µ V: F(A) = “area covered by sensors placed at A” Node predicts values of positions with some radius

  14. S’ S’ Set cover is submodular A={S1,S2} S1 S2 F(A[{S’})-F(A) ¸ F(B[{S’})-F(B) S1 S2 S3 S4 B = {S1,S2,S3,S4}

  15. 0.2 0.5 0.4 0.2 0.3 0.5 0.5 Example: Influence in social networks[Kempe, Kleinberg, Tardos KDD ’03] Dr. Han Thanh Who should get free cell phones? V = {Dr. Z, Dr. H, Khuong, Huy, Thanh, Nam} F(A) = Expected number of people influenced when targeting A Dr. Zheng Prob. ofinfluencing Khuong Nam Huy

  16. Influence in social networks is submodular[Kempe, Kleinberg, Tardos KDD ’03] Dr. Han Thanh Dr. Zheng 0.2 0.5 0.4 0.2 0.3 0.5 Khuong 0.5 Nam Huy Key idea: Flip coins c in advance  “live” edges Fc(A) = People influenced under outcome c (set cover!) F(A) = c P(c) Fc(A) is submodular as well!

  17. Submodularity and Concavity Suppose g: N ! R and F(A) = g(|A|) Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” g(|A|) |A|

  18. Maximum of submodular functions Suppose F1(A) and F2(A) submodular. Is F(A) = max(F1(A),F2(A)) submodular? F(A) = max(F1(A),F2(A)) F1(A) F2(A) |A| max(F1,F2) not submodular in general!

  19. Minimum of submodular functions Well, maybe F(A) = min(F1(A),F2(A)) instead? F({b}) – F(;)=0 < F({a,b}) – F({a})=1 min(F1,F2) not submodular in general!

  20. Maximizing submodular functions

  21. Maximizing submodular functions Minimizing convex functions: Polynomial time solvable! Minimizing submodular functions: Polynomial time solvable! Maximizing convex functions: NP hard! Maximizing submodular functions: NP hard! But can get approximation guarantees 

  22. maximum |A| Maximizing non-monotonic functions • Suppose we want for not monotonic FA* = argmax F(A) s.t. AµV • Example: • F’(A) = F(A) – C(A) where F(A) is submodular utility, and C(A) is linear cost functionE.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08] • In general: NP hard. Moreover: • If F(A) can take negative values:As hard to approximate as maximum independent set (i.e., NP hard to get O(n1-) approximation)

  23. Maximizing positive submodular functions[Feige, Mirrokni, Vondrak FOCS ’07] • picking a random set gives ¼ approximation • we cannot get better than ¾ approximation unless P = NP Theorem There is an efficient randomized local search procedure,that, given a positive submodular function F, F(;)=0, returns set ALS such that F(ALS) ¸ (2/5) maxA F(A)

  24. Option 1: maxA F(A) – C(A) s.t. A µ V Option 2: maxA F(A) s.t. C(A) · B Scalarization vs. constrained maximization Given monotonic utility F(A) and cost C(A), optimize: “Scalarization” “Constrained maximization” Can get 2/5 approx… if F(A)-C(A) ¸ 0 for all A µ V coming up… Positiveness is a strong requirement 

  25. Selected set Monotonic submodular Selection cost Budget Constrained maximization: Outline Subset selection: C(A) = |A| Complex constraints Robust optimization

  26. Monotonicity • A set function is called monotonic if AµBµV ) F(A) · F(B) • Examples: • Influence in social networks [Kempe et al KDD ’03] • For discrete RVs, entropy F(A) = H(XA) is monotonic:Suppose B=A [ C. Then F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A) • Information gain: F(A) = H(Y)-H(Y | XA) • Set cover • Matroid rank functions (dimension of vector spaces, …) • …

  27. Subset selection Given: Finite set V, monotonic submodular function F, F(;) = 0 Want: A*µ V such that NP-hard!

  28. M Exact maximization of monotonic submodular functions 1) Mixed integer programming [Nemhauser et al ’81] 2) Branch-and-bound: “Data-correcting algorithm” [Goldengorin et al ’99] max s.t. · F(B) + s2VnBss(B) for all B µ S ss· k s2 {0,1} where s(B) = F(B [ {s}) – F(B) Solved using constraint generation Both algorithms worst-case exponential!

  29. Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A*µ V such that NP-hard! Y“Sick” M Greedy algorithm: Start with A0 = ; For i = 1 to k si := argmaxs F(Ai-1[ {s}) - F(Ai-1) Ai := Ai-1[ {si} X1“Fever” X2“Rash” X3“Male”

  30. ~63% Performance of greedy algorithm Theorem [Nemhauser et al ‘78] Given a monotonic submodular function F, F(;)=0, the greedy maximization algorithm returns Agreedy F(Agreedy) ¸ (1-1/e) max|A|· k F(A)

  31. Y1 Y2 Y3 X1 X2 X3 X4 Example: Submodularity of info-gain Y1,…,Ym, X1, …, Xn discrete RVs F(A) = IG(Y; XA) = H(Y)-H(Y | XA) • F(A) is always monotonic • However, NOT always submodular Theorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular! Hence, greedy algorithm works! In fact, NO algorithm can do better than (1-1/e) approximation!

  32. Leanforward Leanleft Slouch Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07] • People sit a lot • Activity recognition inassistive technologies • Seating pressure as user interface Equipped with 1 sensor per cm2! Costs $16,000!  Can we get similar accuracy with fewer, cheaper sensors? 82% accuracy on 10 postures! [Tan et al]

  33. How to place sensors on a chair? • Sensor readings at locations V as random variables • Predict posture Y using probabilistic model P(Y,V) • Pick sensor locations A* µ V to minimize entropy: Possible locations V Placed sensors, did a user study: Similar accuracy at <1% of the cost!

  34. Hach Sensor Monitoring water networks[Krause et al, J Wat Res Mgt 2008] • Contamination of drinking watercould affect millions of people Contamination Sensors Simulator from EPA ~$14K Place sensors to detect contaminations “Battle of the Water Sensor Networks” competition Where should we place sensors to quickly detect contamination?

  35. Model-based sensing • Utility of placing sensors based on model of the world • For water networks: Water flow simulator from EPA • F(A)=Expected impact reduction placing sensors at A Model predicts High impact Lowimpactlocation Theorem [Krause et al., J Wat Res Mgt ’08]: Impact reduction F(A) in water networks is submodular! Contamination Medium impactlocation S3 S1 S2 S3 S4 S2 Sensor reducesimpact throughearly detection! S1 S4 Set V of all network junctions S1 Low impact reduction F(A)=0.01 High impact reduction F(A) = 0.9

  36. Battle of the Water Sensor Networks Competition • Real metropolitan area network (12,527 nodes) • Water flow simulator provided by EPA • 3.6 million contamination events • Multiple objectives: • Detection time, affected population, … • Place sensors that detect well “on average”

  37. Offline (Nemhauser)bound 1.4 1.2 1 0.8 0.6 Greedysolution 0.4 0.2 0 0 5 10 15 20 Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08] (1-1/e) bound quite loose… can we get better bounds? Population protected F(A) Higher is better Water networksdata Number of sensors placed

  38. M Data dependent bounds[Minoux ’78] • Suppose A is candidate solution to argmax F(A) s.t. |A| · k and A* = {s1,…,sk} be an optimal solution • Then F(A*) · F(A [ A*) = F(A)+i F(A[{s1,…,si})-F(A[ {s1,…,si-1}) · F(A) + i (F(A[{si})-F(A)) = F(A) + isi For each s 2 VnA, let s = F(A[{s})-F(A) Order such that 1 ¸2¸ … ¸n Then: F(A*) · F(A) + i=1ki

  39. Offline (Nemhauser)bound 1.4 Data-dependentbound 1.2 1 0.8 0.6 Greedysolution 0.4 0.2 0 0 5 10 15 20 Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08] Submodularity gives data-dependent bounds on the performance of any algorithm Sensing quality F(A) Higher is better Water networksdata Number of sensors placed

  40. 30 25 E H G 20 G E H 15 Total ScoreHigher is better G 10 G D H D G 5 0 Gueli Trachtman Guan et al. Berry et al. Dorini et al. Huang et al. Wu & Walski Preis & Ostfeld Propato & Piller Krause et al. Ghimire & Barkdoll Ostfeld & Salomons Eliades & Polycarpou BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] • 13 participants • Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge E: “Exact” method (MIP) H: Other heuristic 24% better performance than runner-up! 

  41. “Lazy” greedy algorithm[Minoux ’78] M b b c e d e Benefit s(A) a a a b d c d c e Note: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] Lazy greedyalgorithm: • First iteration as usual • Keep an ordered list of marginal benefits ifrom previous iteration • Re-evaluate ionlyfor top element • If i stays on top, use it,otherwise re-sort

  42. 300 Exhaustive search (All subsets) 200 Naive Running time (minutes) greedy ubmodularity to the rescue: 100 Fast greedy 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected Result of lazy evaluation , 16 GB in main memory (compressed) Simulated all on 2 weeks / 40 processors 152 GB data on disk  Very accurate computation of F(A) 3.6M contaminations Very slow evaluation of F(A)  30 hours/20 sensors 6 weeks for all30 settings  Lower is better Using “lazy evaluations”:1 hour/20 sensors Done after 2 days! 

  43. S3 S2 S4 S1 What about worst-case?[Krause et al., NIPS ’07] Knowing the sensor locations, an adversary contaminates here! S3 S2 S4 S1 Placement detects well on “average-case” (accidental) contamination Very different average-case impact,Same worst-case impact Where should we place sensors to quickly detect in the worst case?

  44. Selected set Utility function Selection cost Budget Constrained maximization: Outline Subset selection Complex constraints Robust optimization

  45. Optimizing for the worst case • Separate utility function Fi for each contamination i • Fi(A) = impact reduction by sensors A for contamination i Want to solve Each of the Fi is submodular Unfortunately, mini Fi not submodular! How can we solve this robust optimization problem? Contamination at node s Sensors A Fs(A) is high Fs(B) is high Sensors B Contamination at node r Fr(A) is low Fr(B) is high

  46. How does the greedy algorithm do? V={ , , }  Greedy does arbitrarily badly. Is there something better? Can only buy k=2 Greedy picks first Hence we can’t find any approximation algorithm. Optimalsolution Then, canchoose only or Optimal score: 1 Or can we? Greedy score:  Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A) does not admit any approximation unless P=NP

  47. Alternative formulation If somebody told us the optimal value, can we recover the optimal solution A*? Need to find Is this any easier? Yes, if we relax the constraint |A| · k

  48. c Problem 1 (last slide) Problem 2 Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) Remains submodular! F’i,c(A) |A| Same optimal solutions!Solving one solves the other Non-submodular Don’t know how to solve Submodular!But appears as constraint?

  49. Maximization vs. coverage Previously: Wanted A* = argmax F(A) s.t. |A| · k Now need to solve: A* = argmin |A| s.t. F(A) ¸ Q Greedy algorithm: Start with A := ;; While F(A) < Q and |A|< n s* := argmaxs F(A [ {s}) A := A [ {s*} Theorem [Wolsey et al]: Greedy will return Agreedy|Agreedy| · (1+log maxs F({s})) |Aopt| M For bound, assume F is integral.If not, just round it.

  50. c Problem 1 (last slide) Problem 2 Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) F’i,c(A) |A| Non-submodular Don’t know how to solve Submodular!Can use greedy algorithm!

More Related