1 / 27

情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels

情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels. 有村 博紀 , 喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻 email: {arim,kida}@ist.hokudai.ac.jp http://www-ikn.ist.hokudai.ac.jp/~arim .. How to learn strings and graphs. C. C. C. C. X. H. H. X. X. N. X. N. H. H. H. H. H. H.

chyna
Download Presentation

情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 情報知識ネットワーク特論Prediction and Learning 2:Perceptron and Kernels 有村 博紀,喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻email: {arim,kida}@ist.hokudai.ac.jphttp://www-ikn.ist.hokudai.ac.jp/~arim .

  2. How to learn strings and graphs C C C C X H H X X N X N H H H H H H H H H H H H • Learning problem • unknown function f: Graphs → {+1, -1} Classify +1 -1 -1 TCGCGAGGT +1 +1 -1 TCGCGAGGCTAGCT Fe H +1 GCAGAGTAT H -1 TCGCGAGGCTAT H +1 TCGCGAGGCTAT

  3. Learning Strings and Graphs • Linear learning machines (this week) • Classification by a hyperplane in N dimensional space RN • Efficient learning methods minimizing the reguralized risk • String and graph kernel methods (next week) • Substring and subgraph features • Efficient computation by dynamic programming (DP)

  4. Prediction and Learning • Training Data • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule • Prediction • Predict the output y given a new input x • Learning • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.

  5. An On-line Learning Framework • Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction. [Littlestone 1987]

  6. Linear Learning Machines

  7. Linear Learning Machines • N-dimensional Euclidean space • The set of points w = (x1, ..., xN)∈RN • hyperplane • w = (x1, ..., xN) ∈RN: an weight vector • b∈R : a bias • the hyperplane determined by (w, b) S = { x ∈RN : 〈w, x〉 + b = 0 } • Notation • 〈w, x〉 = w1x1+ ... +, wNxN = ∑i wi xi • ||w||2= 〈w, w〉

  8. Linear Learning Machines • Linear threshold function f : RN→ {+1, -1} f(x) = sgn(w1x1+...+, wNxN + b)= sgn(〈w, x〉 + b ) • function f(x) is determined by pair(w, b) • weight vector w = (w1, ..., wN) ∈RN: • bias b∈R ≡ Linear classifier +1 +1 hyperplane -1 〈w, x〉 +1 -1 point x bias (b<0) +1 weight vector w -1

  9. Margin • Sample • S = {(x1, y1), ..., (xm, ym) } • Margin γ of a hyperplane w.r.t. sample S • Scale invariance • (w,b) and (cw,cb) define the same hyperplane (c>0) +1 -1 margin γ +1 -1 point x bias (b<0) +1 -1 weight vector w

  10. An Online Learning Framework • Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs an mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.

  11. Perceptron Learning Algorithm • Perceptron • Linear classifiers as a model of a single neuron. • Learning algorithm [Rosenblatt 1956] • The first iterative algorithm for learning linear classification • Online and mistake-driven • Guaranteed to converge in the linearly separable case.The speed is given by the quantity called the margin [Novikoff 1962]. .

  12. Perceptron Learning Algorithm Initialization • Start with zero vector w :=0 • When a mistake occurs on (x, y) • Positive mistake (if y=+1) • the weight vector w is too weak • update by w := w + x/||x|| (add normalized input) • Negative mistake (if y=-1) • the weight vector w is too strong • update by w := w - x/||x||(add normalized input) • Update rule • If mistake occurs then update w by w := w + y·x/||x||

  13. Perceptron Learning Algorithm • Algorithm: • Given m examples (x1, y1),..., (xm, ym); • Initialize: w= 0 (= 0N); • Repeat the followings: • Receive the next input x. • Predict: f(x)=sgn(〈w, x〉) ∈{+1, -1}. • Receive the correct output y ∈{+1, -1}. • If mistake occurs (y f(x)< 0) then update: w := w + y·x /||x||. • variation w := w +η·y·x/||x||. • η > 0: a learning parameter • Assumption: b = 0

  14. Perceptron Learning Algorithm • Assumption (linear-separable case): • The unknown linear-threshold function f* = 〈w, x〉+ bhas margin γ w.r.t. a sample S. • Theorem (Novikoff 1962; ): • The Perceptron learning algorithm makes at most • mistakes where R = max(x,y)∈S ||x|| is the size of the maximum input vector. • The mistake bound M of the algorithm is independent from the dimension N

  15. Proof of Theorem (Novikoff) • When update is made • A mistake occurs: yf(x) < 0. • Update: w' = w + y·x /||x||. • Sketch • Upperbound of ||w|| • Lowerbound of 〈w, w*〉 • Inequality: 〈w, w*〉≦ ||w||·||w*|| .

  16. Finding a separating hyperplane • Consistent Hypothesis Finder • Find any hypothesis within C that separates positive examples from negative examples. • If a class C is not complex then any consistent Hypothesis Finder learns class C. • Excercise: • Show the following: Let S be a sample of size m. We can modify the perceptron to find a consistent hypothesis with S in O(mnM) time, where M = (2R/γ)2 is the mistake bound of the perceptron. +1 +1 -1 +1 -1 +1 -1

  17. Addtion vs. Multiplication Littlestone, Learning quickly when irrelevant attributes abound: A new linear thereshold algorithm, Machine Learning, 2(4): 285-318,1988. Kivinen and Warmuth, Exponentiated gradient versus gradient descent for linear predictors, Information and Computation, 132(1):1-63, 1997.

  18. Addtion vs. Multiplication • Perceptron • Update: Addition • Weighted majority & Winnow • Update: Multiplication • Different merits... • Presently, additive update algorithms are more popular (due to Kernel techniques).

  19. Extensions of Perceptron Kivinen, Smola, Williamson, "Online leanring with Kernels", IEEE Trans. Signal Processing.

  20. Extensions of Perceptron Algorithm • What the Perceptron algorithm does? • Risk function + Gradient descend • Perceptron's update rule • If a mistake occurs then update w := w + y·x/||x|| • Otherwise, do nothing: w := w • a mistake occurs iff y· f(x) < 0 • Risk function • Risk = Expected Error + Penalty for Complexity

  21. Risk minimization • Loss function lo(f(x), y) = lo(y·f(x)) • Expected risk • Emprical risk lo(z) error correct -1 +1 z = yf(x) +1 +1 -1 +1 +1 -1 +1 -1

  22. Online Risk Minimization for Perceptron • Batch learning • Minimizing the empirical risk by optimization methods • Online learning (Derivation of Perceptron) • Sample S = { (xt, yt) }. (The last example only) • Minimization by classical gradient descend • Same as perceptron's update rule *1) minimization of the instantaneous risk on a single example *2) η > 0: learning parameter

  23. RegulaRisk minimization • Soft margin loss function • Problem of error and noises • margin parameter ρ • Regularized Emprical risk • Problem of overfitting • Control the complexity ofweight vector w lo(z) error correct +ρ z = yf(x) +1 +1 -1 ρ -1 +1 -1 +1

  24. Introducing Kernels into Perceptron • How the Perceptron algorithm works... • mistake-driven • update rule of the weight vector. • additive update

  25. Perceptron Learning Algorithm Initialization • Start with zero vector w :=0 • When a mistake occurs on (x, y) • Positive mistake (if y=+1) • the weight vector w is too weak • update by w := w + x/||x|| (add normalized input) • Negative mistake (if y=-1) • the weight vector w is too strong • update by w := w - x/||x||(add normalized input) • Update rule • If mistake occurs then update w by w := w + y·x/||x||

  26. Online algorithm with Kernels • Weight vector built by Perceptron alg. • Weighted sum of input vectors • Coefficient αi • αi = 1 if mistake occurs at xi. • αi = 0 otherwise. • Prediction • done by inner-product representation (or kernel computation) • Kernel function:

  27. Summary • What the Perceptron algorithm does? • Risk function + Gradient descend • Instantaneous risk minimization (last step) • Extensions • Soft margin classification • Regularized risk minimization • Kernel trick • Linear Learning Machine Family • Perceptron, Winnow, Weighted majority • SVM, Approximate maximal margin learner, ...

More Related