情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels

110 Views

Download Presentation
## 情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**情報知識ネットワーク特論Prediction and Learning**2:Perceptron and Kernels 有村 博紀，喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻email: {arim,kida}@ist.hokudai.ac.jphttp://www-ikn.ist.hokudai.ac.jp/~arim ．**How to learn strings and graphs**C C C C X H H X X N X N H H H H H H H H H H H H • Learning problem • unknown function f: Graphs → {+1, -1} Classify +1 -1 -1 TCGCGAGGT +1 +1 -1 TCGCGAGGCTAGCT Fe H +1 GCAGAGTAT H -1 TCGCGAGGCTAT H +1 TCGCGAGGCTAT**Learning Strings and Graphs**• Linear learning machines (this week) • Classification by a hyperplane in N dimensional space RN • Efficient learning methods minimizing the reguralized risk • String and graph kernel methods (next week) • Substring and subgraph features • Efficient computation by dynamic programming (DP)**Prediction and Learning**• Training Data • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule • Prediction • Predict the output y given a new input x • Learning • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.**An On-line Learning Framework**• Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction. [Littlestone 1987]**Linear Learning Machines**• N-dimensional Euclidean space • The set of points w = (x1, ..., xN)∈RN • hyperplane • w = (x1, ..., xN) ∈RN: an weight vector • b∈R : a bias • the hyperplane determined by (w, b) S = { x ∈RN : 〈w, x〉 + b = 0 } • Notation • 〈w, x〉 = w1x1+ ... +, wNxN = ∑i wi xi • ||w||2= 〈w, w〉**Linear Learning Machines**• Linear threshold function f : RN→ {+1, -1} f(x) = sgn(w1x1+...+, wNxN + b)= sgn(〈w, x〉 + b ) • function f(x) is determined by pair(w, b) • weight vector w = (w1, ..., wN) ∈RN: • bias b∈R ≡ Linear classifier +1 +1 hyperplane -1 〈w, x〉 +1 -1 point x bias (b<0) +1 weight vector w -1**Margin**• Sample • S = {(x1, y1), ..., (xm, ym) } • Margin γ of a hyperplane w.r.t. sample S • Scale invariance • (w,b) and (cw,cb) define the same hyperplane (c>0) +1 -1 margin γ +1 -1 point x bias (b<0) +1 -1 weight vector w**An Online Learning Framework**• Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs an mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.**Perceptron Learning Algorithm**• Perceptron • Linear classifiers as a model of a single neuron. • Learning algorithm [Rosenblatt 1956] • The first iterative algorithm for learning linear classification • Online and mistake-driven • Guaranteed to converge in the linearly separable case.The speed is given by the quantity called the margin [Novikoff 1962]. .**Perceptron Learning Algorithm**Initialization • Start with zero vector w :=0 • When a mistake occurs on (x, y) • Positive mistake (if y=+1) • the weight vector w is too weak • update by w := w + x/||x|| (add normalized input) • Negative mistake (if y=-1) • the weight vector w is too strong • update by w := w - x/||x||(add normalized input) • Update rule • If mistake occurs then update w by w := w + y·x/||x||**Perceptron Learning Algorithm**• Algorithm: • Given m examples (x1, y1),..., (xm, ym); • Initialize: w= 0 (= 0N); • Repeat the followings: • Receive the next input x. • Predict: f(x)=sgn(〈w, x〉) ∈{+1, -1}. • Receive the correct output y ∈{+1, -1}. • If mistake occurs (y f(x)< 0) then update: w := w + y·x /||x||. • variation w := w +η·y·x/||x||. • η > 0: a learning parameter • Assumption: b = 0**Perceptron Learning Algorithm**• Assumption (linear-separable case): • The unknown linear-threshold function f* = 〈w, x〉+ bhas margin γ w.r.t. a sample S. • Theorem (Novikoff 1962; ): • The Perceptron learning algorithm makes at most • mistakes where R = max(x,y)∈S ||x|| is the size of the maximum input vector. • The mistake bound M of the algorithm is independent from the dimension N**Proof of Theorem (Novikoff)**• When update is made • A mistake occurs: yf(x) < 0. • Update: w' = w + y·x /||x||. • Sketch • Upperbound of ||w|| • Lowerbound of 〈w, w*〉 • Inequality: 〈w, w*〉≦ ||w||·||w*|| .**Finding a separating hyperplane**• Consistent Hypothesis Finder • Find any hypothesis within C that separates positive examples from negative examples. • If a class C is not complex then any consistent Hypothesis Finder learns class C. • Excercise: • Show the following: Let S be a sample of size m. We can modify the perceptron to find a consistent hypothesis with S in O(mnM) time, where M = (2R/γ)2 is the mistake bound of the perceptron. +1 +1 -1 +1 -1 +1 -1**Addtion vs. Multiplication**Littlestone, Learning quickly when irrelevant attributes abound: A new linear thereshold algorithm, Machine Learning, 2(4): 285-318,1988. Kivinen and Warmuth, Exponentiated gradient versus gradient descent for linear predictors, Information and Computation, 132(1):1-63, 1997.**Addtion vs. Multiplication**• Perceptron • Update: Addition • Weighted majority & Winnow • Update: Multiplication • Different merits... • Presently, additive update algorithms are more popular (due to Kernel techniques).**Extensions of Perceptron**Kivinen, Smola, Williamson, "Online leanring with Kernels", IEEE Trans. Signal Processing.**Extensions of Perceptron Algorithm**• What the Perceptron algorithm does? • Risk function + Gradient descend • Perceptron's update rule • If a mistake occurs then update w := w + y·x/||x|| • Otherwise, do nothing: w := w • a mistake occurs iff y· f(x) < 0 • Risk function • Risk = Expected Error + Penalty for Complexity**Risk minimization**• Loss function lo(f(x), y) = lo(y·f(x)) • Expected risk • Emprical risk lo(z) error correct -1 +1 z = yf(x) +1 +1 -1 +1 +1 -1 +1 -1**Online Risk Minimization for Perceptron**• Batch learning • Minimizing the empirical risk by optimization methods • Online learning (Derivation of Perceptron) • Sample S = { (xt, yt) }. (The last example only) • Minimization by classical gradient descend • Same as perceptron's update rule *1) minimization of the instantaneous risk on a single example *2) η > 0: learning parameter**RegulaRisk minimization**• Soft margin loss function • Problem of error and noises • margin parameter ρ • Regularized Emprical risk • Problem of overfitting • Control the complexity ofweight vector w lo(z) error correct +ρ z = yf(x) +1 +1 -1 ρ -1 +1 -1 +1**Introducing Kernels into Perceptron**• How the Perceptron algorithm works... • mistake-driven • update rule of the weight vector. • additive update**Perceptron Learning Algorithm**Initialization • Start with zero vector w :=0 • When a mistake occurs on (x, y) • Positive mistake (if y=+1) • the weight vector w is too weak • update by w := w + x/||x|| (add normalized input) • Negative mistake (if y=-1) • the weight vector w is too strong • update by w := w - x/||x||(add normalized input) • Update rule • If mistake occurs then update w by w := w + y·x/||x||**Online algorithm with Kernels**• Weight vector built by Perceptron alg. • Weighted sum of input vectors • Coefficient αi • αi = 1 if mistake occurs at xi. • αi = 0 otherwise. • Prediction • done by inner-product representation (or kernel computation) • Kernel function:**Summary**• What the Perceptron algorithm does? • Risk function + Gradient descend • Instantaneous risk minimization (last step) • Extensions • Soft margin classification • Regularized risk minimization • Kernel trick • Linear Learning Machine Family • Perceptron, Winnow, Weighted majority • SVM, Approximate maximal margin learner, ...