Machine Learning : Foundations. Yishay Mansour TelAviv University. Typical Applications. Classification/Clustering problems: Data mining Information Retrieval Web search Selfcustomization news mail. Typical Applications. Control Robots Dialog system Complicated software
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
IF OtherDelinquentAccounts > 2 and
NumberDelinquentBillingCycles >1
THEN DENAY CREDIT
IF OtherDelinquentAccounts = 0 and
Income > $30k
THEN GRANT CREDIT
Distribution D over domain X
Unknown target function f(x)
Goal: find h(x) such that h(x) approx. f(x)
Given H find heH that minimizes PrD[h(x) f(x)]
Example (x,f(x))
S  sample of m examples drawn i.i.d using D
True errore(h)= PrD[h(x)=f(x)]
Observed errore’(h)= 1/m { x eS  h(x) f(x) }
Basic question: How close is e(h)toe’(h)
Prior distribution over H
Given a sample S compute a posterior distribution:
Maximum Likelihood (ML)Pr[Sh]
Maximum A Posteriori (MAP)Pr[hS]
Bayesian PredictorS h(x) Pr[hS].
Classify using near examples.
Assume a “structured space” and a “metric”
+

+

?
+

+

How to find an h e H with low observed error.
Most cases computational tasks are provably hard.
Heuristic algorithm for specific classes.
S
wn
w1
x1
xn
Separating HyperplanePerceptron:sign( xiwi )
Find w1 .... wn
Limited representation
Limited Representation
Efficient Algorithms.
Aim: Find a small decision tree with
low observed error.
PHASE I:
Construct the tree greedy,
using a local index function.
Ginni Index : G(x) = x(1x), Entropy H(x) ...
PHASE II:
Prune the decision Tree
while maintaining low observed error.
Good experimental results
hypothesis complexity versus observed error.
More complex hypothesis have
lower observed error,
but might have higher true error.
Minimum Description Length
e’(h) + code length of h
Structural Risk Minimization:
e’(h) + sqrt{ log H / m }
A search Method.
Local mutation operations
Crossover operations
Keeps the “best” candidates.
Example: decision trees
Change a node in a tree
Replace a subtree by another tree
Keep trees with low observed error
Minimize the observed error.
Search for a small size classifier
Handtailored search method for specific classes.
Small class of predicates H
Weak Learning:
Assume that for any distribution D, there is some predicate heH
that predicts better than 1/2+e.
Strong Learning
Weak Learning
Functions: Weighted majority of the predicates.
Methodology:
Change the distribution to target “hard” examples.
Weight of an example is exponential in the number of
incorrect classifications.
Extremely good experimental results and efficient algorithms.
Project data to a high dimensional space.
Use a hyperplane in the LARGE space.
Choose a hyperplane with a large MARGIN.

+
+

+

+
f(x) = S az cz(x)
cz(x) = (1)<x,z>
Many Simple classes are well approximated using
large coefficients.
Efficient algorithms for finding large coefficients.
Control Problems.
Changing the parameters changes the behavior.
Search for optimal policies.