Machine learning foundations
1 / 41

Machine Learning : Foundations - PowerPoint PPT Presentation

  • Uploaded on

Machine Learning : Foundations. Yishay Mansour Tel-Aviv University. Typical Applications. Classification/Clustering problems: Data mining Information Retrieval Web search Self-customization news mail. Typical Applications. Control Robots Dialog system Complicated software

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Machine Learning : Foundations' - osanna

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Machine learning foundations

Machine Learning: Foundations

Yishay Mansour

Tel-Aviv University

Typical applications
Typical Applications

  • Classification/Clustering problems:

    • Data mining

    • Information Retrieval

    • Web search

  • Self-customization

    • news

    • mail

Typical applications1
Typical Applications

  • Control

    • Robots

    • Dialog system

  • Complicated software

    • driving a car

    • playing a game (backgammon)

Why now
Why Now?

  • Technology ready:

    • Algorithms and theory.

  • Information abundant:

    • Flood of data (online)

  • Computational power

    • Sophisticated techniques

  • Industry and consumer needs.

Example 1 credit risk analysis
Example 1: Credit Risk Analysis

  • Typical customer: bank.

  • Database:

    • Current clients data, including:

    • basic profile (income, house ownership, delinquent account, etc.)

    • Basic classification.

  • Goal: predict/decide whether to grant credit.

Example 1 credit risk analysis1
Example 1: Credit Risk Analysis

  • Rules learned from data:

    IF Other-Delinquent-Accounts > 2 and

    Number-Delinquent-Billing-Cycles >1


    IF Other-Delinquent-Accounts = 0 and

    Income > $30k


Example 2 clustering news
Example 2: Clustering news

  • Data: Reuters news / Web data

  • Goal: Basic category classification:

    • Business, sports, politics, etc.

    • classify to subcategories (unspecified)

  • Methodology:

    • consider “typical words” for each category.

    • Classify using a “distance “ measure.

Example 3 robot control
Example 3: Robot control

  • Goal: Control a robot in an unknown environment.

  • Needs both

    • to explore (new places and action)

    • to use acquired knowledge to gain benefits.

  • Learning task “control” what is observes!

A glimpse in to the future
A Glimpse in to the future

  • Today status:

    • First-generation algorithms:

    • Neural nets, decision trees, etc.

  • Well-formed data-bases

  • Future:

    • many more problems:

    • networking, control, software.

    • Main advantage is flexibility!

Relevant disciplines
Relevant Disciplines

  • Artificial intelligence

  • Statistics

  • Computational learning theory

  • Control theory

  • Information Theory

  • Philosophy

  • Psychology and neurobiology.

Type of models
Type of models

  • Supervised learning

    • Given access to classified data

  • Unsupervised learning

    • Given access to data, but no classification

  • Control learning

    • Selects actions and observes consequences.

    • Maximizes long-term cumulative return.

Learning complete information
Learning: Complete Information

  • Probability D1 over and probability D2 for

  • Equally likely.

  • Computing the probability of “smiley” given a point (x,y).

  • Use Bayes formula.

  • Let p be the probability.

Predictions and loss model
Predictions and Loss Model

  • Boolean Error

    • Predict a Boolean value.

    • each error we lose 1 (no error no loss.)

    • Compare the probability p to 1/2.

    • Predict deterministically with the higher value.

    • Optimal prediction (for this loss)

  • Can not recover probabilities!

Predictions and loss model1
Predictions and Loss Model

  • quadratic loss

    • Predict a “real number” q for outcome 1.

    • Loss (q-p)2 for outcome 1

    • Loss ([1-q]-[1-p])2 for outcome 0

    • Expected loss: (p-q)2

    • Minimized for p=q (Optimal prediction)

  • recovers the probabilities

  • Needs to know p to compute loss!

Predictions and loss model2
Predictions and Loss Model

  • Logarithmic loss

    • Predict a “real number” q for outcome 1.

    • Loss log 1/q for outcome 1

    • Loss log 1/(1-q) for outcome 0

    • Expected loss: -p log q -(1-p) log (1-q)

    • Minimized for p=q (Optimal prediction)

  • recovers the probabilities

  • Loss does not depend on p!

The basic pac model
The basic PAC Model

Distribution D over domain X

Unknown target function f(x)

Goal: find h(x) such that h(x) approx. f(x)

Given H find heH that minimizes PrD[h(x) f(x)]

Basic pac notions
Basic PAC Notions

Example (x,f(x))

S - sample of m examples drawn i.i.d using D

True errore(h)= PrD[h(x)=f(x)]

Observed errore’(h)= 1/m |{ x eS | h(x)  f(x) }|

Basic question: How close is e(h)toe’(h)

Bayesian theory
Bayesian Theory

Prior distribution over H

Given a sample S compute a posterior distribution:

Maximum Likelihood (ML)Pr[S|h]

Maximum A Posteriori (MAP)Pr[h|S]

Bayesian PredictorS h(x) Pr[h|S].

Nearest neighbor methods
Nearest Neighbor Methods

Classify using near examples.

Assume a “structured space” and a “metric”










Computational methods
Computational Methods

How to find an h e H with low observed error.

Most cases computational tasks are provably hard.

Heuristic algorithm for specific classes.

Separating hyperplane







Separating Hyperplane

Perceptron:sign(  xiwi )

Find w1 .... wn

Limited representation

Neural networks



Neural Networks

Sigmoidal gates:

a=  xiwi and

output = 1/(1+ e-a)

Back Propagation

Decision trees




Decision Trees

x1 > 5

x6 > 2

Decision trees1
Decision Trees

Limited Representation

Efficient Algorithms.

Aim: Find a small decision tree with

low observed error.

Decision trees2
Decision Trees


Construct the tree greedy,

using a local index function.

Ginni Index : G(x) = x(1-x), Entropy H(x) ...


Prune the decision Tree

while maintaining low observed error.

Good experimental results

Complexity versus generalization
Complexity versus Generalization

hypothesis complexity versus observed error.

More complex hypothesis have

lower observed error,

but might have higher true error.

Basic criteria for model selection
Basic criteria for Model Selection

Minimum Description Length

e’(h) + |code length of h|

Structural Risk Minimization:

e’(h) + sqrt{ log |H| / m }

Genetic programming
Genetic Programming

A search Method.

Local mutation operations

Cross-over operations

Keeps the “best” candidates.

Example: decision trees

Change a node in a tree

Replace a subtree by another tree

Keep trees with low observed error

General pac methodology
General PAC Methodology

Minimize the observed error.

Search for a small size classifier

Hand-tailored search method for specific classes.

Weak learning
Weak Learning

Small class of predicates H

Weak Learning:

Assume that for any distribution D, there is some predicate heH

that predicts better than 1/2+e.

Strong Learning

Weak Learning

Boosting algorithms
Boosting Algorithms

Functions: Weighted majority of the predicates.


Change the distribution to target “hard” examples.

Weight of an example is exponential in the number of

incorrect classifications.

Extremely good experimental results and efficient algorithms.

Support vector machine
Support Vector Machine

n dimensions

m dimensions

Support vector machine1
Support Vector Machine

Project data to a high dimensional space.

Use a hyperplane in the LARGE space.

Choose a hyperplane with a large MARGIN.








Other models
Other Models

Membership Queries



Fourier transform
Fourier Transform

f(x) = S az cz(x)

cz(x) = (-1)<x,z>

Many Simple classes are well approximated using

large coefficients.

Efficient algorithms for finding large coefficients.

Reinforcement learning
Reinforcement Learning

Control Problems.

Changing the parameters changes the behavior.

Search for optimal policies.

Basic concepts in probability
Basic Concepts in Probability

  • For a single hypothesis h:

    • Given an observed error

    • Bound the true error

  • Markov Inequality

  • Chebyshev Inequality

  • Chernoff Inequality

Basic concepts in probability1
Basic Concepts in Probability

  • Switching from h1 to h2:

    • Given the observed errors

    • Predict if h2 is better.

  • Total error rate

  • Cases where h1(x)  h2(x)

    • More refine