prediction and learning 2 perceptron and kernels n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels PowerPoint Presentation
Download Presentation
情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels

Loading in 2 Seconds...

play fullscreen
1 / 27

情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels - PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on

情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels. 有村 博紀 , 喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻 email: {arim,kida}@ist.hokudai.ac.jp http://www-ikn.ist.hokudai.ac.jp/~arim .. How to learn strings and graphs. C. C. C. C. X. H. H. X. X. N. X. N. H. H. H. H. H. H.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels' - chyna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
prediction and learning 2 perceptron and kernels

情報知識ネットワーク特論Prediction and Learning 2:Perceptron and Kernels

有村 博紀,喜田拓也

北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻email: {arim,kida}@ist.hokudai.ac.jphttp://www-ikn.ist.hokudai.ac.jp/~arim

how to learn strings and graphs
How to learn strings and graphs

C

C

C

C

X

H

H

X

X

N

X

N

H

H

H

H

H

H

H

H

H

H

H

H

  • Learning problem
    • unknown function f: Graphs → {+1, -1}

Classify

+1

-1

-1

TCGCGAGGT

+1

+1

-1

TCGCGAGGCTAGCT

Fe

H

+1

GCAGAGTAT

H

-1

TCGCGAGGCTAT

H

+1

TCGCGAGGCTAT

learning strings and graphs
Learning Strings and Graphs
  • Linear learning machines (this week)
    • Classification by a hyperplane in N dimensional space RN
    • Efficient learning methods minimizing the reguralized risk
  • String and graph kernel methods (next week)
    • Substring and subgraph features
    • Efficient computation by dynamic programming (DP)
prediction and learning
Prediction and Learning
  • Training Data
    • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule
  • Prediction
    • Predict the output y given a new input x
  • Learning
    • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.
an on line learning framework
An On-line Learning Framework
  • Data
    • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule.
  • Learning
    • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process.
  • Goal
    • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.

[Littlestone 1987]

linear learning machines
Linear Learning Machines
  • N-dimensional Euclidean space
    • The set of points w = (x1, ..., xN)∈RN
  • hyperplane
    • w = (x1, ..., xN) ∈RN: an weight vector
    • b∈R : a bias
    • the hyperplane determined by (w, b)

S = { x ∈RN : 〈w, x〉 + b = 0 }

  • Notation
    • 〈w, x〉 = w1x1+ ... +, wNxN = ∑i wi xi
    • ||w||2= 〈w, w〉
linear learning machines1
Linear Learning Machines
  • Linear threshold function f : RN→ {+1, -1}

f(x) = sgn(w1x1+...+, wNxN + b)= sgn(〈w, x〉 + b )

    • function f(x) is determined by pair(w, b)
      • weight vector w = (w1, ..., wN) ∈RN:
      • bias b∈R

≡ Linear classifier

+1

+1

hyperplane

-1

〈w, x〉

+1

-1

point x

bias (b<0)

+1

weight vector w

-1

margin
Margin
  • Sample
    • S = {(x1, y1), ..., (xm, ym) }
  • Margin γ of a hyperplane w.r.t. sample S
  • Scale invariance
    • (w,b) and (cw,cb) define the same hyperplane (c>0)

+1

-1

margin γ

+1

-1

point x

bias (b<0)

+1

-1

weight vector w

an online learning framework
An Online Learning Framework
  • Data
    • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule.
  • Learning
    • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs an mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process.
  • Goal
    • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.
perceptron learning algorithm
Perceptron Learning Algorithm
  • Perceptron
    • Linear classifiers as a model of a single neuron.
  • Learning algorithm [Rosenblatt 1956]
    • The first iterative algorithm for learning linear classification
    • Online and mistake-driven
    • Guaranteed to converge in the linearly separable case.The speed is given by the quantity called the margin [Novikoff 1962]. .
perceptron learning algorithm1
Perceptron Learning Algorithm

Initialization

    • Start with zero vector w :=0
  • When a mistake occurs on (x, y)
    • Positive mistake (if y=+1)
      • the weight vector w is too weak
      • update by w := w + x/||x|| (add normalized input)
    • Negative mistake (if y=-1)
      • the weight vector w is too strong
      • update by w := w - x/||x||(add normalized input)
  • Update rule
    • If mistake occurs then update w by w := w + y·x/||x||
perceptron learning algorithm2
Perceptron Learning Algorithm
  • Algorithm:
    • Given m examples (x1, y1),..., (xm, ym);
    • Initialize: w= 0 (= 0N);
    • Repeat the followings:
      • Receive the next input x.
      • Predict: f(x)=sgn(〈w, x〉) ∈{+1, -1}.
      • Receive the correct output y ∈{+1, -1}.
      • If mistake occurs (y f(x)< 0) then update: w := w + y·x /||x||.
  • variation w := w +η·y·x/||x||.
    • η > 0: a learning parameter
  • Assumption: b = 0
perceptron learning algorithm3
Perceptron Learning Algorithm
  • Assumption (linear-separable case):
    • The unknown linear-threshold function f* = 〈w, x〉+ bhas margin γ w.r.t. a sample S.
  • Theorem (Novikoff 1962; ):
    • The Perceptron learning algorithm makes at most
    • mistakes where R = max(x,y)∈S ||x|| is the size of the maximum input vector.
  • The mistake bound M of the algorithm is independent from the dimension N
proof of theorem novikoff
Proof of Theorem (Novikoff)
  • When update is made
    • A mistake occurs: yf(x) < 0.
    • Update: w' = w + y·x /||x||.
  • Sketch
    • Upperbound of ||w||
    • Lowerbound of 〈w, w*〉
    • Inequality: 〈w, w*〉≦ ||w||·||w*|| .
finding a separating hyperplane
Finding a separating hyperplane
  • Consistent Hypothesis Finder
    • Find any hypothesis within C that separates positive examples from negative examples.
    • If a class C is not complex then any consistent Hypothesis Finder learns class C.
  • Excercise:
    • Show the following: Let S be a sample of size m. We can modify the perceptron to find a consistent hypothesis with S in O(mnM) time, where M = (2R/γ)2 is the mistake bound of the perceptron.

+1

+1

-1

+1

-1

+1

-1

slide17
Addtion vs. Multiplication

Littlestone, Learning quickly when irrelevant attributes abound: A new linear thereshold algorithm, Machine Learning, 2(4): 285-318,1988.

Kivinen and Warmuth, Exponentiated gradient versus gradient descent for linear predictors, Information and Computation, 132(1):1-63, 1997.

addtion vs multiplication
Addtion vs. Multiplication
  • Perceptron
    • Update: Addition
  • Weighted majority & Winnow
    • Update: Multiplication
  • Different merits...
    • Presently, additive update algorithms are more popular (due to Kernel techniques).
slide19
Extensions of Perceptron

Kivinen, Smola, Williamson, "Online leanring with Kernels", IEEE Trans. Signal Processing.

extensions of perceptron algorithm
Extensions of Perceptron Algorithm
  • What the Perceptron algorithm does?
    • Risk function + Gradient descend
  • Perceptron's update rule
      • If a mistake occurs then update w := w + y·x/||x||
      • Otherwise, do nothing: w := w
      • a mistake occurs iff y· f(x) < 0
  • Risk function
    • Risk = Expected Error + Penalty for Complexity
risk minimization
Risk minimization
  • Loss function lo(f(x), y) = lo(y·f(x))
  • Expected risk
  • Emprical risk

lo(z)

error

correct

-1

+1

z = yf(x)

+1

+1

-1

+1

+1

-1

+1

-1

online risk minimization for perceptron
Online Risk Minimization for Perceptron
  • Batch learning
    • Minimizing the empirical risk by optimization methods
  • Online learning (Derivation of Perceptron)
    • Sample S = { (xt, yt) }. (The last example only)
    • Minimization by classical gradient descend
    • Same as perceptron's update rule

*1) minimization of the instantaneous risk on a single example

*2) η > 0: learning parameter

regularisk minimization
RegulaRisk minimization
  • Soft margin loss function
    • Problem of error and noises
    • margin parameter ρ
  • Regularized Emprical risk
    • Problem of overfitting
    • Control the complexity ofweight vector w

lo(z)

error

correct

z = yf(x)

+1

+1

-1

ρ

-1

+1

-1

+1

introducing kernels into perceptron
Introducing Kernels into Perceptron
  • How the Perceptron algorithm works...
    • mistake-driven
    • update rule of the weight vector.
    • additive update
perceptron learning algorithm4
Perceptron Learning Algorithm

Initialization

    • Start with zero vector w :=0
  • When a mistake occurs on (x, y)
    • Positive mistake (if y=+1)
      • the weight vector w is too weak
      • update by w := w + x/||x|| (add normalized input)
    • Negative mistake (if y=-1)
      • the weight vector w is too strong
      • update by w := w - x/||x||(add normalized input)
  • Update rule
    • If mistake occurs then update w by w := w + y·x/||x||
online algorithm with kernels
Online algorithm with Kernels
  • Weight vector built by Perceptron alg.
    • Weighted sum of input vectors
    • Coefficient αi
      • αi = 1 if mistake occurs at xi.
      • αi = 0 otherwise.
  • Prediction
    • done by inner-product representation (or kernel computation)
  • Kernel function:
summary
Summary
  • What the Perceptron algorithm does?
    • Risk function + Gradient descend
    • Instantaneous risk minimization (last step)
  • Extensions
    • Soft margin classification
    • Regularized risk minimization
    • Kernel trick
  • Linear Learning Machine Family
    • Perceptron, Winnow, Weighted majority
    • SVM, Approximate maximal margin learner, ...