1 / 32

Decision Trees and Influences

Decision Trees and Influences. Ryan O’Donnell - Microsoft. Mike Saks - Rutgers. Oded Schramm - Microsoft. Rocco Servedio - Columbia. Part I: Decision trees have large influences. Printer troubleshooter. Does anything print?. Right size paper?. Can print from Notepad?. Network printer?.

Download Presentation

Decision Trees and Influences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision Trees and Influences Ryan O’Donnell - Microsoft Mike Saks - Rutgers Oded Schramm - Microsoft Rocco Servedio - Columbia

  2. Part I: Decision trees have large influences

  3. Printer troubleshooter Does anything print? Right size paper? Can print from Notepad? Network printer? Printer mis-setup? File too complicated? Solved Solved Driver OK? Driver OK? Solved Solved Call tech support

  4. Decision tree complexity f : {Attr1} × {Attr2} × ∙∙∙ × {Attrn} → {−1,1}. What’s the “best” DT for f, and how to find it? Depth = worst case # of questions. Expected depth = avg. # of questions.

  5. Building decision trees • Identify the most ‘influential’/‘decisive’/‘relevant’ variable. • Put it at the root. • Recursively build DTs for its children. Almost all real-world learning algs based on this – CART, C4.5, … Almost no theoretical (PAC-style) learning algs based on this – [Blum92, KM93, BBVKV97, PTF-folklore, OS04] – no; [EH89, SJ03] – sorta. Conj’d to be good for some problems (e.g., percolation [SS04]) but unprovable…

  6. x2 x3 Boolean DTs f : {−1,1}n → {−1,1}. D(f) = min depth of a DT for f. 0 ≤ D(f) ≤ n. x1 x2 Maj3 −1 x3 1 −1 1 −1 1

  7. Boolean DTs • {−1,1}n viewed as a probability space, with uniform probability distribution. • uniformly random path down a DT, plus a uniformly random setting of the unqueried variables, defines a uniformly random input • expected depth : δ(f).

  8. Influences influence of coordinate j on f = the probability that xj is relevant for f Ij(f) = Pr[ f(x) ≠ f(x (⊕j)) ]. 0 ≤ Ij(f) ≤ 1.

  9. Main question: If a function f has a “shallow” decision tree, does it have a variable with “significant” influence?

  10. Main question: No. But for a silly reason: Suppose f is highly biased; say Pr[f = 1] = p ≪ 1. Then for any j, Ij(f) = Pr[f(x) = 1, f(x(j)) = −1] + Pr[f(x) = −1, f(x(j)) = 1] ≤ Pr[f(x) = 1] + Pr[f(x(j)) = 1] ≤ p + p = 2p.

  11. Variance ⇒Influences are always at most 2 min{p,q}. Analytically nicer expression: Var[f]. • Var[f] = E[f2] – E[f]2 = 1 – (p – q)2 = 1 – (2p − 1)2 = 4p(1 – p) = 4pq. • 2 min{p,q} ≤ 4pq ≤ 4 min{p,q}. • It’s 1 for balanced functions. So Ij(f) ≤ Var[f], and it is fair to say Ij(f) is “significant” if it’s a significant fraction of Var[f].

  12. Main question: If a function f has a “shallow” decision tree,does it have a variable with influence at leasta “significant” fraction of Var[f]?

  13. Notation τ(d) = min max { Ij(f) / Var[f] }. f : D(f) ≤ d j

  14. n Σ j = 1 Known lower bounds Suppose f : {−1,1}n → {−1,1}. • An elementary old inequality states Var[f] ≤ Ij(f). Thus f has a variable with influence at least Var[f]/n. • A deep inequality of [KKL88] shows there is always a coord. j such that Ij(f) ≥ Var[f] · Ω(log n / n). If D(f) = d then f really has at most 2d variables. Hence we get τ(d) ≥ 1/2d from the first, and τ(d) ≥ Ω(d/2d) from KKL.

  15. x1 x3 x2 −1 1 1 −1 Our result τ(d) ≥ 1/d. This is tight: Then Var[SEL] = 1, d = 2, all three variables have infl. ½. (Form recursive version, SEL(SEL, SEL, SEL) etc., gives Var 1 fcn with d = 2h, all influences 2−h for any h.) “SEL”

  16. n n n Σ Σ Σ j = 1 j = 1 j = 1 Our actual main theorem Given a decision tree f, let δj(f) = Pr[tree queries xj]. Then Var[f] ≤ δj(f)Ij(f). Cor: Fix the tree with smallest expected depth. Then δj(f) = E[depth of a path] =: δ(f) ≤ D(f). ⇒ Var[f] ≤ max Ij ·δj = max Ij · δ(f) ⇒ max Ij ≥ Var[f] / δ(f) ≥ Var[f] / D(f).

  17. Proof Pick a random path in the tree. This gives some set of variables, P = (xJ1, … , xJT), along with an assignment to them, βP. Call the remaining set of variables P and pick a random assignment βP for them too. Let X be the (uniformly random string) given by combining these two assignments, (βP, βP). Also, define JT+1, … , Jn = ┴.

  18. P P Proof Let β’P be an independent random asgn to vbls in P. Let Z = (β’P, βP). Note: Z is also uniformly random. JT+1= ··· = Jn = ┴ xJ1= –1 J1 J2 J3 JT xJ2= 1 X = (-1, 1, -1, …, 1, ) 1, -1, 1, -1 xJ3= -1 Z = ( , ) 1,-1, -1, …,-1 1, -1, 1, -1 xJT= 1 –1

  19. Proof Finally, for t = 0…T, let Yt be the same string as X, except that Z’s assignments (β’P) for variables xJ1, … , xJt are swapped in. Note: Y0 = X, YT = Z. Y0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 ) Y1= ( 1, 1, -1, …, 1, 1, -1, 1, -1 ) Y2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 ) · · · · YT = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 ) Also define YT+1 = · · · = Yn = Z.

  20. Σ Σ Σ Σ Σ Σ Σ t = 1..n t = 1..n t = 1..n t = 1..n j = 1..n t = 1..n j = 1..n Var[f] = E[f2] – E[f]2 = E[ f(X)f(X) ] – E[ f(X)f(Z) ] = E[ f(X)f(Y0) – f(X)f(Yn) ] = E[ f(X) (f(Yt−1) – f(Yt)) ] ≤ E[ |f(Yt−1) – f(Yt)| ] = 2 Pr[f(Yt−1) ≠ f(Yt)] = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j] = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

  21. Σ Σ j = 1..n t = 1..n Proof … = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j] Utterly Crucial Observation: Conditioned on Jt = j, (Yt−1, Yt) are jointly distributed exactly as (W, W’), where W is uniformly random, and W’ is W with jth bit rerandomized.

  22. P P JT+1= ··· = Jn = ┴ xJ1= –1 J1 J2 J3 JT xJ2= 1 Y0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 ) Y1= ( 1, 1, -1, …, 1, 1, -1, 1, -1 ) Y2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 ) · · · · YT = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 ) X = (-1, 1, -1, …, 1, ) 1, -1, 1, -1 xJ3= 1 Z = ( , ) 1,-1, -1, …,-1 1, -1, 1, -1 xJT= 1 –1

  23. Σ Σ Σ Σ Σ Σ Σ Σ Σ j = 1..n j = 1..n j = 1..n j = 1..n t = 1..n t = 1..n j = 1..n t = 1..n t = 1..n Proof … = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j] = Pr[Jt = j] · 2 Pr[f(W) ≠ f(W’)] = Pr[Jt = j] · Ij(f) = Ij · Pr[Jt = j] = Ij δj.

  24. Part II: Lower bounds for monotone graph properties

  25. Monotone graph properties v2 Consider graphs on v vertices; let n = (). “Nontrivial monotone graph property”: • “nontrivial property”: a (nonempty, nonfull) subset of all v-vertex graphs • “graph property”: closed under permutations of the vertices ( no edge is ‘distinguished’) • monotone: adding edges can only put you into the property, not take you out e.g.: Contains-A-Triangle, Connected, Has-Hamiltonian-Path, Non-Planar, Has-at-least-n/2-edges, …

  26. Aanderaa-Karp-Rosenberg conj. Every nontrivial monotone graph propery has D(f) = n. [Rivest-Vuillemin-75]: ≥ v2/16. [Kleitman-Kwiatowski-80] ≥ v2/9. [Kahn-Saks-Sturtevant-84] ≥ n/2, = n, if v is a prime power. [Topology + group theory!] [Yao-88] = n in the bipartite case.

  27. Randomized DTs • Have ‘coin flip’ nodes in the trees that cost nothing. • Or, probability distribution over deterministic DTs. Note: We want both 0-sided error and worst-case input. R(f) = min, over randomized DTs that compute f with 0-error, of max over inputs x, of expected # of queries. The expectation is only over the DT’s internal coins.

  28. Maj3: D(Maj3) = 3. Pick two inputs at random, check if they’re the same. If not, check the 3rd.  R(Maj3) ≤ 8/3. Let f = recursive-Maj3 [Maj3 (Maj3 , Maj3 , Maj3 ), etc…] For depth-h version (n = 3h), D(f) = 3h. R(f) ≤ (8/3)h. (Not best possible…!)

  29. Randomized AKR / Yao conj. Yao conjectured in ’77 that every nontrivial monotone graph property f has R(f) ≥ Ω(v2). Lower bound Ω( · )Who v [Yao-77] v log 1/12 v [Yao-87] v5/4 [King-88] v4/3 [Hajnal-91] v4/3 log 1/3 v [Chakrabarti-Khot-01] min{ v/p, v2/log v } [Fried.-Kahn-Wigd.-02] v4/3 / p1/3 [us]

  30. Outline • Extend main inequality to the p-biased case. (Then LHS is 1.) • Use Yao’s minmax principle: Show that under p-biased {−1,1}n, δ = Σ δj = avg # queries is large for any tree. • Main inequality: max influence is small ⇒ δ is large. • Graph property  all vbls have the same influence. • Hence: sum of influences is small ⇒ δ is large. • [OS04]: f monotone ⇒ sum of influences ≤ √δ. • Hence: sum of influences is large ⇒ δ is large. • So either way, δ is large.

  31. Generalizing the inequality n Σ Var[f] ≤ δj(f)Ij(f). Generalizations (which basically require no proof change): • holds for randomized DTs • holds for randomized “subcube partitions” • holds for functions on any product probability space f : Ω1× ∙∙∙ ×Ωn → {−1,1} (with notion of “influence” suitably generalized) • holds for real-valued functions with (necessary) loss of a factor, at most √δ j = 1

  32. Closing thought It’s funny that our bound gets stuck roughly at the same level as Hajnal / Chakrabarti-Khot, n2/3 = v4/3. Note that n2/3 [I believe] cannot be improved by more than a log factor merely for monotone transitive functions, due to [BSW04]. Thus to get better than v4/3 for monotone graph properties, you must use the fact that it’s a graph property. Chakrabarti-Khot does definitely use the fact that it’s a graph property (all sorts of graph packing lemmas). Or do they? Since they get stuck at essentially v4/3, I wonder if there’s any chance their result doesn’t truly need the fact that it’s a graph property…

More Related