1 / 27

# 情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels - PowerPoint PPT Presentation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about '情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels' - chyna

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### 情報知識ネットワーク特論Prediction and Learning 2:Perceptron and Kernels

How to learn strings and graphs

C

C

C

C

X

H

H

X

X

N

X

N

H

H

H

H

H

H

H

H

H

H

H

H

• Learning problem
• unknown function f: Graphs → {+1, -1}

Classify

+1

-1

-1

TCGCGAGGT

+1

+1

-1

TCGCGAGGCTAGCT

Fe

H

+1

GCAGAGTAT

H

-1

TCGCGAGGCTAT

H

+1

TCGCGAGGCTAT

Learning Strings and Graphs
• Linear learning machines (this week)
• Classification by a hyperplane in N dimensional space RN
• Efficient learning methods minimizing the reguralized risk
• String and graph kernel methods (next week)
• Substring and subgraph features
• Efficient computation by dynamic programming (DP)
Prediction and Learning
• Training Data
• A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule
• Prediction
• Predict the output y given a new input x
• Learning
• Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.
An On-line Learning Framework
• Data
• A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule.
• Learning
• A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process.
• Goal
• Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.

[Littlestone 1987]

Linear Learning Machines
• N-dimensional Euclidean space
• The set of points w = (x1, ..., xN)∈RN
• hyperplane
• w = (x1, ..., xN) ∈RN: an weight vector
• b∈R : a bias
• the hyperplane determined by (w, b)

S = { x ∈RN : 〈w, x〉 + b = 0 }

• Notation
• 〈w, x〉 = w1x1+ ... +, wNxN = ∑i wi xi
• ||w||2= 〈w, w〉
Linear Learning Machines
• Linear threshold function f : RN→ {+1, -1}

f(x) = sgn(w1x1+...+, wNxN + b)= sgn(〈w, x〉 + b )

• function f(x) is determined by pair(w, b)
• weight vector w = (w1, ..., wN) ∈RN:
• bias b∈R

≡ Linear classifier

+1

+1

hyperplane

-1

〈w, x〉

+1

-1

point x

bias (b<0)

+1

weight vector w

-1

Margin
• Sample
• S = {(x1, y1), ..., (xm, ym) }
• Margin γ of a hyperplane w.r.t. sample S
• Scale invariance
• (w,b) and (cw,cb) define the same hyperplane (c>0)

+1

-1

margin γ

+1

-1

point x

bias (b<0)

+1

-1

weight vector w

An Online Learning Framework
• Data
• A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule.
• Learning
• A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs an mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process.
• Goal
• Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.
Perceptron Learning Algorithm
• Perceptron
• Linear classifiers as a model of a single neuron.
• Learning algorithm [Rosenblatt 1956]
• The first iterative algorithm for learning linear classification
• Online and mistake-driven
• Guaranteed to converge in the linearly separable case.The speed is given by the quantity called the margin [Novikoff 1962]. .
Perceptron Learning Algorithm

Initialization

• When a mistake occurs on (x, y)
• Positive mistake (if y=+1)
• the weight vector w is too weak
• update by w := w + x/||x|| (add normalized input)
• Negative mistake (if y=-1)
• the weight vector w is too strong
• update by w := w - x/||x||(add normalized input)
• Update rule
• If mistake occurs then update w by w := w + y·x/||x||
Perceptron Learning Algorithm
• Algorithm:
• Given m examples (x1, y1),..., (xm, ym);
• Initialize: w= 0 (= 0N);
• Repeat the followings:
• Receive the next input x.
• Predict: f(x)=sgn(〈w, x〉) ∈{+1, -1}.
• Receive the correct output y ∈{+1, -1}.
• If mistake occurs (y f(x)< 0) then update: w := w + y·x /||x||.
• variation w := w +η·y·x/||x||.
• η > 0: a learning parameter
• Assumption: b = 0
Perceptron Learning Algorithm
• Assumption (linear-separable case):
• The unknown linear-threshold function f* = 〈w, x〉+ bhas margin γ w.r.t. a sample S.
• Theorem (Novikoff 1962; ):
• The Perceptron learning algorithm makes at most
• mistakes where R = max(x,y)∈S ||x|| is the size of the maximum input vector.
• The mistake bound M of the algorithm is independent from the dimension N
Proof of Theorem (Novikoff)
• A mistake occurs: yf(x) < 0.
• Update: w' = w + y·x /||x||.
• Sketch
• Upperbound of ||w||
• Lowerbound of 〈w, w*〉
• Inequality: 〈w, w*〉≦ ||w||·||w*|| .
Finding a separating hyperplane
• Consistent Hypothesis Finder
• Find any hypothesis within C that separates positive examples from negative examples.
• If a class C is not complex then any consistent Hypothesis Finder learns class C.
• Excercise:
• Show the following: Let S be a sample of size m. We can modify the perceptron to find a consistent hypothesis with S in O(mnM) time, where M = (2R/γ)2 is the mistake bound of the perceptron.

+1

+1

-1

+1

-1

+1

-1

Littlestone, Learning quickly when irrelevant attributes abound: A new linear thereshold algorithm, Machine Learning, 2(4): 285-318,1988.

Kivinen and Warmuth, Exponentiated gradient versus gradient descent for linear predictors, Information and Computation, 132(1):1-63, 1997.

• Perceptron
• Weighted majority & Winnow
• Update: Multiplication
• Different merits...
• Presently, additive update algorithms are more popular (due to Kernel techniques).
Extensions of Perceptron

Kivinen, Smola, Williamson, "Online leanring with Kernels", IEEE Trans. Signal Processing.

Extensions of Perceptron Algorithm
• What the Perceptron algorithm does?
• Risk function + Gradient descend
• Perceptron's update rule
• If a mistake occurs then update w := w + y·x/||x||
• Otherwise, do nothing: w := w
• a mistake occurs iff y· f(x) < 0
• Risk function
• Risk = Expected Error + Penalty for Complexity
Risk minimization
• Loss function lo(f(x), y) = lo(y·f(x))
• Expected risk
• Emprical risk

lo(z)

error

correct

-1

+1

z = yf(x)

+1

+1

-1

+1

+1

-1

+1

-1

Online Risk Minimization for Perceptron
• Batch learning
• Minimizing the empirical risk by optimization methods
• Online learning (Derivation of Perceptron)
• Sample S = { (xt, yt) }. (The last example only)
• Minimization by classical gradient descend
• Same as perceptron's update rule

*1) minimization of the instantaneous risk on a single example

*2) η > 0: learning parameter

RegulaRisk minimization
• Soft margin loss function
• Problem of error and noises
• margin parameter ρ
• Regularized Emprical risk
• Problem of overfitting
• Control the complexity ofweight vector w

lo(z)

error

correct

z = yf(x)

+1

+1

-1

ρ

-1

+1

-1

+1

Introducing Kernels into Perceptron
• How the Perceptron algorithm works...
• mistake-driven
• update rule of the weight vector.
Perceptron Learning Algorithm

Initialization

• When a mistake occurs on (x, y)
• Positive mistake (if y=+1)
• the weight vector w is too weak
• update by w := w + x/||x|| (add normalized input)
• Negative mistake (if y=-1)
• the weight vector w is too strong
• update by w := w - x/||x||(add normalized input)
• Update rule
• If mistake occurs then update w by w := w + y·x/||x||
Online algorithm with Kernels
• Weight vector built by Perceptron alg.
• Weighted sum of input vectors
• Coefficient αi
• αi = 1 if mistake occurs at xi.
• αi = 0 otherwise.
• Prediction
• done by inner-product representation (or kernel computation)
• Kernel function:
Summary
• What the Perceptron algorithm does?
• Risk function + Gradient descend
• Instantaneous risk minimization (last step)
• Extensions
• Soft margin classification
• Regularized risk minimization
• Kernel trick
• Linear Learning Machine Family
• Perceptron, Winnow, Weighted majority
• SVM, Approximate maximal margin learner, ...