1 / 32

# Decision Trees and Influences - PowerPoint PPT Presentation

Decision Trees and Influences. Ryan O’Donnell - Microsoft. Mike Saks - Rutgers. Oded Schramm - Microsoft. Rocco Servedio - Columbia. Part I: Decision trees have large influences. Printer troubleshooter. Does anything print?. Right size paper?. Can print from Notepad?. Network printer?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' Decision Trees and Influences' - jillian-lara

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Ryan O’Donnell - Microsoft

Mike Saks - Rutgers

Oded Schramm - Microsoft

Rocco Servedio - Columbia

Part I: Decision trees have large influences

Does anything print?

Right size paper?

Network printer?

Printer mis-setup?

File too complicated?

Solved

Solved

Driver OK?

Driver OK?

Solved

Solved

Call tech support

f : {Attr1} × {Attr2} × ∙∙∙ × {Attrn} → {−1,1}.

What’s the “best” DT for f, and how to find it?

Depth = worst case # of questions.

Expected depth = avg. # of questions.

• Identify the most ‘influential’/‘decisive’/‘relevant’ variable.

• Put it at the root.

• Recursively build DTs for its children.

Almost all real-world learning algs based on this – CART, C4.5, …

Almost no theoretical (PAC-style) learning algs based on this –

[Blum92, KM93, BBVKV97, PTF-folklore, OS04] – no;

[EH89, SJ03] – sorta.

Conj’d to be good for some problems (e.g., percolation [SS04]) but unprovable…

x2

x3

Boolean DTs

f : {−1,1}n → {−1,1}.

D(f) = min depth of a DT for f.

0 ≤ D(f) ≤ n.

x1

x2

Maj3

−1

x3

1

−1

1

−1

1

• {−1,1}n viewed as a probability space, with uniform probability distribution.

• uniformly random path down a DT, plus a uniformly random setting of the unqueried variables, defines a uniformly random input

• expected depth : δ(f).

influence of coordinate j on f

= the probability that xj is relevant for f

Ij(f) = Pr[ f(x) ≠ f(x (⊕j)) ].

0 ≤ Ij(f) ≤ 1.

If a function f has a “shallow” decision tree, does it have a variable with “significant” influence?

No.

But for a silly reason:

Suppose f is highly biased; say Pr[f = 1] = p ≪ 1.

Then for any j,

Ij(f) = Pr[f(x) = 1, f(x(j)) = −1] + Pr[f(x) = −1, f(x(j)) = 1]

≤ Pr[f(x) = 1] + Pr[f(x(j)) = 1]

≤ p + p

= 2p.

⇒Influences are always at most 2 min{p,q}.

Analytically nicer expression: Var[f].

• Var[f] = E[f2] – E[f]2

= 1 – (p – q)2 = 1 – (2p − 1)2 = 4p(1 – p) = 4pq.

• 2 min{p,q} ≤ 4pq ≤ 4 min{p,q}.

• It’s 1 for balanced functions.

So Ij(f) ≤ Var[f], and it is fair to say Ij(f) is “significant” if it’s a significant fraction of Var[f].

If a function f has a “shallow” decision tree,does it have a variable with influence at leasta “significant” fraction of Var[f]?

τ(d) = min max { Ij(f) / Var[f] }.

f : D(f) ≤ d

j

Σ

j = 1

Known lower bounds

Suppose f : {−1,1}n → {−1,1}.

• An elementary old inequality states

Var[f] ≤ Ij(f).

Thus f has a variable with influence at least Var[f]/n.

• A deep inequality of [KKL88] shows there is always a coord. j such that Ij(f) ≥ Var[f] · Ω(log n / n).

If D(f) = d then f really has at most 2d variables.

Hence we get τ(d) ≥ 1/2d from the first, and τ(d) ≥ Ω(d/2d) from KKL.

x1

x3

x2

−1

1

1

−1

Our result

τ(d) ≥ 1/d.

This is tight:

Then Var[SEL] = 1, d = 2, all three variables have infl. ½.

(Form recursive version, SEL(SEL, SEL, SEL) etc., gives Var 1 fcn with d = 2h, all influences 2−h for any h.)

“SEL”

n

n

Σ

Σ

Σ

j = 1

j = 1

j = 1

Our actual main theorem

Given a decision tree f, let δj(f) = Pr[tree queries xj].

Then

Var[f] ≤ δj(f)Ij(f).

Cor: Fix the tree with smallest expected depth.

Then δj(f) = E[depth of a path] =: δ(f) ≤ D(f).

⇒ Var[f] ≤ max Ij ·δj = max Ij · δ(f)

⇒ max Ij ≥ Var[f] / δ(f) ≥ Var[f] / D(f).

Pick a random path in the tree. This gives some set of variables, P = (xJ1, … , xJT), along with an assignment to them, βP.

Call the remaining set of variables P and pick a random assignment βP for them too.

Let X be the (uniformly random string) given by combining these two assignments, (βP, βP).

Also, define JT+1, … , Jn = ┴.

P

Proof

Let β’P be an independent random asgn to vbls in P.

Let Z = (β’P, βP).

Note: Z is also uniformly random.

JT+1= ··· = Jn = ┴

xJ1= –1

J1

J2

J3

JT

xJ2= 1

X = (-1, 1, -1, …, 1, )

1, -1, 1, -1

xJ3= -1

Z = ( , )

1,-1, -1, …,-1

1, -1, 1, -1

xJT= 1

–1

Finally, for t = 0…T, let Yt be the same string as X, except that Z’s assignments (β’P) for variables xJ1, … , xJt are swapped in.

Note: Y0 = X, YT = Z.

Y0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 )

Y1= ( 1, 1, -1, …, 1, 1, -1, 1, -1 )

Y2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 )

· · · ·

YT = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 )

Also define YT+1 = · · · = Yn = Z.

Σ

Σ

Σ

Σ

Σ

Σ

t = 1..n

t = 1..n

t = 1..n

t = 1..n

j = 1..n

t = 1..n

j = 1..n

Var[f] = E[f2] – E[f]2

= E[ f(X)f(X) ] – E[ f(X)f(Z) ]

= E[ f(X)f(Y0) – f(X)f(Yn) ]

= E[ f(X) (f(Yt−1) – f(Yt)) ]

≤ E[ |f(Yt−1) – f(Yt)| ]

= 2 Pr[f(Yt−1) ≠ f(Yt)]

= Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

= Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

Σ

j = 1..n

t = 1..n

Proof

… = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

Utterly Crucial Observation:

Conditioned on Jt = j,

(Yt−1, Yt) are jointly distributed exactly as

(W, W’), where W is uniformly random, and W’ is W with jth bit rerandomized.

P

JT+1= ··· = Jn = ┴

xJ1= –1

J1

J2

J3

JT

xJ2= 1

Y0 = X = (-1, 1, -1, …, 1, 1, -1, 1, -1 )

Y1= ( 1, 1, -1, …, 1, 1, -1, 1, -1 )

Y2 = ( 1,-1, -1, …, 1, 1, -1, 1, -1 )

· · · ·

YT = Z = ( 1,-1, -1, …,-1, 1, -1, 1, -1 )

X = (-1, 1, -1, …, 1, )

1, -1, 1, -1

xJ3= 1

Z = ( , )

1,-1, -1, …,-1

1, -1, 1, -1

xJT= 1

–1

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

j = 1..n

j = 1..n

j = 1..n

j = 1..n

t = 1..n

t = 1..n

j = 1..n

t = 1..n

t = 1..n

Proof

… = Pr[Jt = j] · 2 Pr[f(Yt−1) ≠ f(Yt) | Jt = j]

= Pr[Jt = j] · 2 Pr[f(W) ≠ f(W’)]

= Pr[Jt = j] · Ij(f)

= Ij · Pr[Jt = j]

= Ij δj.

v2

Consider graphs on v vertices; let n = ().

“Nontrivial monotone graph property”:

• “nontrivial property”: a (nonempty, nonfull) subset of all v-vertex graphs

• “graph property”: closed under permutations of the vertices ( no edge is ‘distinguished’)

• monotone: adding edges can only put you into the property, not take you out

e.g.: Contains-A-Triangle, Connected, Has-Hamiltonian-Path, Non-Planar, Has-at-least-n/2-edges, …

Every nontrivial monotone graph propery has D(f) = n.

[Rivest-Vuillemin-75]: ≥ v2/16.

[Kleitman-Kwiatowski-80] ≥ v2/9.

[Kahn-Saks-Sturtevant-84] ≥ n/2, = n, if v is a prime power.

[Topology + group theory!]

[Yao-88] = n in the bipartite case.

• Have ‘coin flip’ nodes in the trees that cost nothing.

• Or, probability distribution over deterministic DTs.

Note: We want both 0-sided error and worst-case input.

R(f) = min, over randomized DTs that compute f with 0-error, of max over inputs x, of expected # of queries.

The expectation is only over the DT’s internal coins.

Maj3:

D(Maj3) = 3.

Pick two inputs at random, check if they’re the same. If not, check the 3rd.

 R(Maj3) ≤ 8/3.

Let f = recursive-Maj3 [Maj3 (Maj3 , Maj3 , Maj3 ), etc…]

For depth-h version (n = 3h),

D(f) = 3h.

R(f) ≤ (8/3)h.

(Not best possible…!)

Yao conjectured in ’77 that every nontrivial monotone graph property f has R(f) ≥ Ω(v2).

Lower bound Ω( · )Who

v [Yao-77]

v log 1/12 v [Yao-87]

v5/4 [King-88]

v4/3 [Hajnal-91]

v4/3 log 1/3 v [Chakrabarti-Khot-01]

min{ v/p, v2/log v } [Fried.-Kahn-Wigd.-02]

v4/3 / p1/3 [us]

• Extend main inequality to the p-biased case. (Then LHS is 1.)

• Use Yao’s minmax principle: Show that under p-biased {−1,1}n, δ = Σ δj = avg # queries is large for any tree.

• Main inequality: max influence is small ⇒ δ is large.

• Graph property  all vbls have the same influence.

• Hence: sum of influences is small ⇒ δ is large.

• [OS04]: f monotone ⇒ sum of influences ≤ √δ.

• Hence: sum of influences is large ⇒ δ is large.

• So either way, δ is large.

n

Σ

Var[f] ≤ δj(f)Ij(f).

Generalizations (which basically require no proof change):

• holds for randomized DTs

• holds for randomized “subcube partitions”

• holds for functions on any product probability space f : Ω1× ∙∙∙ ×Ωn → {−1,1} (with notion of “influence” suitably generalized)

• holds for real-valued functions with (necessary) loss of a factor, at most √δ

j = 1

It’s funny that our bound gets stuck roughly at the same level as Hajnal / Chakrabarti-Khot, n2/3 = v4/3.

Note that n2/3 [I believe] cannot be improved by more than a log factor merely for monotone transitive functions, due to [BSW04].

Thus to get better than v4/3 for monotone graph properties, you must use the fact that it’s a graph property.

Chakrabarti-Khot does definitely use the fact that it’s a graph property (all sorts of graph packing lemmas).

Or do they? Since they get stuck at essentially v4/3, I wonder if there’s any chance their result doesn’t truly need the fact that it’s a graph property…