Decision Trees and more!

1 / 38

Decision Trees and more! - PowerPoint PPT Presentation

Decision Trees and more!. Learning OR with few attributes. Target function: OR of k literals Goal: learn in time polynomial in k and log n e and d constants ELIM makes “slow” progress might disqualifies only one literal per round Might remain with O(n) candidate literals.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Decision Trees and more!' - amora

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Decision Trees and more!

Learning OR with few attributes
• Target function:OR of k literals
• Goal: learn in time
• polynomial in k and log n
• e and d constants
• ELIM makes “slow” progress
• might disqualifies only one literal per round
• Might remain with O(n) candidate literals
ELIM: Algorithm for learning OR
• Keep a list of all candidate literals
• For every example whose classification is 0:
• Erase all the literals that are 1.
• Correctness:
• Our hypothesis h: An OR of our set of literals.
• Our set of literals includes the target OR literals.
• Every time h predicts zero: we are correct.
• Sample size:
• m > (1/e) ln (3n/d)= O (n/e +1/e ln (1/d))
Set Cover - Definition
• Input: S1 , … , St and Si U
• Output: Si1, … , Sik and j Sjk=U
• Question:Are there k sets that cover U?
• NP-complete
Set Cover: Greedy algorithm
• j=0 ; Uj=U; C=
• While Uj 
• Let Sibe arg max |Si Uj|
• Let Uj+1 = Uj – Si
• j = j+1
Set Cover: Greedy Analysis
• At termination, C is a cover.
• Assume there is a cover C* of size k.
• C* is a cover for every Uj
• Some S in C* covers Uj/k elements of Uj
• Analysis of Uj: |Uj+1|  |Uj| - |Uj|/k
• Solving the recursion.
• Number of sets j  k ln ( |U|+1)
Building an Occam algorithm
• Given a sample T of size m
• Run ELIM on T
• Let LIT be the set of remaining literals
• Assume there exists k literals in LIT that classify correctly all the sample T
• Negative examples T-
• any subset of LIT classifies T- correctly
Building an Occam algorithm
• Positive examples T+
• Search for a small subset of LIT which classifies T+ correctly
• For a literal z build Sz={x | z satisfies x}
• Our assumption: there are k sets that cover T+
• Greedy finds k ln m sets that cover T+
• Output h = OR of the k ln m literals
• Size (h) < k ln m log 2n
• Sample size m =O( k log n log (k log n))
k-DNF
• Definition:
• A disjunction of terms at most k literals
• Term: T=x3 x1  x5
• DNF: T1 T2  T3  T4
• Example:
Learning k-DNF
• Extended input:
• For each AND of k literals define a “new” input T
• Example: T=x3 x1  x5
• Number of new inputs at most (2n)k
• Can compute the new input easily in time k(2n)k
• The k-DNF is an OR over the new inputs.
• Run the ELIM algorithm over the new inputs.
• Sample size O ((2n)k/e +1/e ln (1/d))
• Running time: same.
Learning Decision Lists
• Definition:

0

x4

x7

x1

1

+1

1

0

1

0

-1

+1

-1

Learning Decision Lists
• Similar to ELIM.
• Input: a sample S of size m.
• While S not empty:
• For a literal z build Tz={x | z satisfies x}
• Find a Tz which all have the same classification
• Add z to the decision list
• Update S = S-Tz
DL algorithm: correctness
• The output decision list is consistent.
• Number of decision lists:
• Length < n+1
• Node: 2n lirals
• Leaf: 2 values
• Total bound (2*2n)n+1
• Sample size:
• m = O (n log n/e +1/e ln (1/d))

x4 x2

1

1

+1

1

x3 x1

x5 x7

0

0

0

-1

+1

-1

k-DL
• Each node is a conjunction of k literals
• Includes k-DNF (and k-CNF)
Learning k-DL
• Extended input:
• For each AND of k literals define a “new” input
• Example: T=x3 x1  x5
• Number of new inputs at most (2n)k
• Can compute the new input easily in time k(2n)k
• The k-DL is a DL over the new inputs.
• Run the DL algorithm over the new inputs.
• Sample size
• Running time
Open Problems
• Attribute Efficient:
• Decision list: very limited results
• Parity functions: negative?
• k-DNF and k-DL

+1

-1

+1

Decision Trees

x1

1

0

x6

1

0

Learning Decision Trees Using DL
• Consider a decision tree T of size r.
• Theorem:
• There exists a log (r+1)-DL L that computes T.
• Claim: There exists a leaf in T of depth log (r+1).
• Learn a Decision Tree using a Decision List
• Running time: nlog s
• n number of attributes
• S Tree Size.

+1

-1

+1

Decision Trees

x1 > 5

x6 > 2

Decision Trees: Basic Setup.
• Basic class of hypotheses H.
• Input: Sample of examples
• Output: Decision tree
• Each internal node from H
• Each leaf a classification value
• Goal (Occam Razor):
• Small decision tree
• Classifies all (most) examples correctly.
Decision Tree: Why?
• Efficient algorithms:
• Construction.
• Classification
• Performance: Comparable to other methods
• Software packages:
• CART
• C4.5 and C5
Decision Trees: This Lecture
• Algorithms for constructing DT
• A theoretical justification
• Using boosting
• Future lecture:
• DT pruning.
Decision Trees Algorithm: Outline
• A natural recursive procedure.
• Decide a predicate h at the root.
• Split the data using h
• Build right subtree (for h(x)=1)
• Build left subtree (for h(x)=0)
• Running time
• T(s) = O(s) + T(s+) + T(s-) = O(s log s)
• s= Tree size
DT: Selecting a Predicate
• Basic setting:
• Clearly: q=up + (1-u)r

Pr[f=1]=q

v

h

Pr[h=0]=u

Pr[h=1]=1-u

0

1

v1

v2

Pr[f=1| h=0]=p

Pr[f=1| h=1]=r

Potential function: setting
• Compare predicates using potential function.
• Inputs: q, u, p, r
• Output: value
• Node dependent:
• For each node and predicate assign a value.
• Given a split: u val(v1) + (1-u) val(v2)
• For a tree: weighted sum over the leaves.
PF: classification error
• Let val(v)=min{q,1-q}
• Classification error.
• The average potential only drops
• Termination:
• When the average is zero
• Perfect Classification

q=Pr[f=1]=0.8

v

h

1-u=Pr[h=1]=0.5

u=Pr[h=0]=0.5

0

1

v1

v2

p=Pr[f=1| h=1]=1

p=Pr[f=1| h=0]=0.6

PF: classification error
• Is this a good split?
• Initial error 0.2
• After Split 0.4 (1/2) + 0.6(1/2) = 0.2
Potential Function: requirements
• When zero perfect classification.
• Strictly convex.
Potential Function: requirements
• Every Change in an improvement

val(r)

val(q)

val(p)

1-u

u

p

q

r

Potential Functions: Candidates
• Potential Functions:
• val(q) = Ginni(q)=2q(1-q) CART
• val(q)=etropy(q)= -q log q –(1-q) log (1-q) C4.5
• val(q) = sqrt{2 q (1-q) }
• Assumption:
• Symmetric: val(q) = val(1-q)
• Convex
• val(0)=val(1) = 0 and val(1/2) =1
DT: Construction Algorithm

Procedure DT(S) : S- sample

• If all the examples in S have the classification b
• Create a leaf of value b and return
• For each h compute val(h,S)
• val(h,S) = uhval(ph) + (1-uh) val(rh)
• Let h’ = arg minh val(h,S)
• Split S using h’ to S0 and S1
• Recursively invoke DT(S0) and DT(S1)
DT: Analysis
• Potential function:
• val(T) = Svleaf of T Pr[v] val(qv)
• For simplicity: use true probability
• Bounding the classification error
• error(T)  val(T)
• study how fast val(T) drops
• Given a tree T define T(l,h) where
• h predicate
• l leaf.

T

h

Top-Down algorithm
• Input: s = size; H= predicates; val();
• T0= single leaf tree
• For t from 1 to s do
• Let (l,h) = arg max(l,h){val(Tt) – val(Tt(l,h))}
• Tt+1 = Tt(l,h)
Theoretical Analysis
• Assume H satisfies the weak learning hypo.
• For each D there is an h s.t. error(h)<1/2-g
• Show, that in every step
• a significant drop in val(T)
• But algorithm never intended to do it!
• Use Weak Learning
• show a large drop in val(T) at each step
• Modify initial distribution to be unbiased.
Theoretical Analysis
• Let val(q) = 2q(1-q)
• Local drop at a node at least 16g2 [q(1-q)]2
• Claim: At every step t there is a leaf l s.t. :
• Pr[l]  et/2t
• error(l)= min{ql,1-ql} et/2
• whereetis the error at stage t
• Proof!
Theoretical Analysis
• Drop at time t at least:
• Pr[l] g2 [ql (1-ql)]2g2et3 / t
• For Ginni index
• val(q)=2q(1-q)
• q  q(1-q)  val(q)/2
• Drop at least O(g2 [val(qt)]3/ t )
Theoretical Analysis
• Need to solve when val(Tk) < e.
• Bound k.
• Time exp{O(1/g2 1/ e2)}