prediction and learning 1 majority vote algorithm n.
Skip this Video
Loading SlideShow in 5 Seconds..
情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm PowerPoint Presentation
Download Presentation
情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm

Loading in 2 Seconds...

play fullscreen
1 / 16

情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm - PowerPoint PPT Presentation

  • Uploaded on

情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm. 有村 博紀 , 喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻 email: {arim,kida} .. Prediction and Learning. Training Data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about '情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm' - saxton

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
prediction and learning 1 majority vote algorithm

情報知識ネットワーク特論Prediction and Learning 1:Majority vote algorithm

有村 博紀,喜田拓也

北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻email: {arim,kida}

prediction and learning
Prediction and Learning
  • Training Data
    • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule
  • Prediction
    • Predict the output y given a new input x
  • Learning
    • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.
an on line learning framework
An On-line Learning Framework
  • Data
    • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule.
  • Learning
    • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process.
  • Goal
    • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.
learning an unknown function
Learning an unknown function


    • Select a hypothesis h∈H for making the prediction y = h(x)from a given class of functions H= {h0, h1, h2, ..., hi, ...}.
  • Question
    • How can we select a best hypothesis h∈Hthat minimizes the number of mistakes during prediction?
    • We ignore the computation time.
naive algorithm sequential
Naive Algorithm (Sequential)
  • Algorithm:
    • Given: the hypothesis class H= {h1, ..., hN}.
    • Initialize: k = 1;
    • Repeat the followings:
      • Receive the next input x.
      • Predict by h(x) = hk(x). Receive the correct output y.
      • If the mistake occurs then k = k + 1.

Exhaustive search!

  • Observation:
    • Naive algorithm makes at most N mistakes.
halving algorithm
Halving Algorithm
  • Naive Algorithm
    • causes N mistakes in the worst case.
    • is usually exponentially large in the size |h| of a hypothesis h∈H.
  • Basic Idea
    • Want to acheive exponential speed-up!
    • Eliminate at least half of the hypotheses whenever a mistake happens.
    • A key is to carefully choose the prediction value h(x) by majority voting so that one mistake implies at least half of the hypotheses fail.
halving algorithm1
Halving Algorithm
  • Algorithm:
    • Initialize the hypothesis class H= {h1, ..., hN}.
    • Repeat the followings:
      • Receive the next input x.
      • Splits H into A+1 = { h∈H : h(x) = +1} and A+1 = { h∈H : h(x) = -1}.
      • If |A+1| ≥ |A-1| then predict y = +1; otherwise predict y = -1. Receive the output x.
      • If the prediction is wrong then remove all hypotheses that make mistake by A = A - Ay.

[Barzdin and Feivalds 1972]

Majority voting

Eliminate at least half

halving algorithm result
Halving Algorithm: Result
  • Assumption (Consistent case):
    • The unknown function fbelongs to class H.
  • Theorem (Barzdins '72; Littlestone '87):
    • The Halving algorithm makes at most log Nmistakes where N is the number of hypotheses in H.
  • This gives a general strategy to design efficient online learning algorithms.
  • Halving algorithm is not optimal [Littlestone 90]
  • ]
  • When receiving the input vector xi ∈{+1,-1}n,
    • xi splits the active experts in A into A+1 and A-1, where Aα = { i ∈A : xi = val } for every val ∈{+1,-1}.
    • Since the prediction is made according to the larger set,if a mistake occurs then the larger half is removed from A.
    • Therefore, the number of active experts in A decreases at least half.
    • It follows that |A| ≤n⋅(1/2)n after M mistakes. Note that any subset Aα (val ∈{+1,-1}) to which a perfect expert belong always makes the correct prediction.
    • This ensures that all perfect expert survives after any update of A.
    • Since |A| ≥1 by assumption, we conclude that the halving algorithm makes at most M = lg n mistakes.
majority vote algorithm
Majority Vote Algorithm

Naive & Halving Algorithms

    • Works only in consistent case
    • Often miss the correct answer in an inconsistent case
  • Inconsistent case
    • A target function does not always belong to the hypothesis class H.
    • None of the hypotheses can completely trace the target function.
  • Tentative Goal
    • Predict as well as the best experts
majority vote algorithm1
Majority Vote Algorithm
  • Majority Vote algorithm:
    • Initialize: w= (1, ...,1) ∈RN;
    • For i = 1,..., mdo:
      • Receive the next input x.
      • Predict by f(x) = Σh∈H wihi(x) (majority vote)
      • Receive the correct answer y ∈{+1,-1}.
      • If the mistake occurs (y ≠ f(x) ) then

For all hi(x)∈H such that f(x) = hi(x) dowi = wi / 2

//majority hypotheses who contributed to the //last prediction

[Littlestone & Warmuth 1989]

majority vote algorithm result
Majority Vote Algorithm: Result
  • Assumption (Inconsistent case):
    • The unknown function fmay not belong to H.
    • The best expert makes M mistakes according to the target function f.
  • Theorem (Littlestone & Warmuth)
    • the majority vote algorithm makes at most 2.4(M + log N) mistakes, where N is the number of hypotheses in H.
  • The majority vote algorithm behaves as well as the unknown best expert.
  • First,
    • we focus on the change of the sum of the weights W = ∑i wi during learning.
    • Suppose that at a round h ≥1, the best expert made m mistakes and the majority vote algorithm made M mistakes so far. Initially, the sum of the weight is set to W = n by construction.
  • Suppose that
    • the majority vote algorithm makes a mistake on an input vector x with weight vector w.
    • Let I be the set of experts who contributed to the prediction, and let WI = ∑i ∈I wi be the sum of the corresponding weights.

By assumption,

    • we have WI ≥W⋅(1/2) (*1).
    • Since the weights of the wrong experts in I are halved, the sum W' of the weights after update is given by W' =W - WI⋅(1/2) ≤W - W⋅(1/2)⋅(1/2) =W - W⋅(1/4) = W⋅(3/4) from (*1). Thus,
      • Wt ≤ Wt-1⋅(3/4)
    • Since whenever a mistake occurs, the sum W becomes 3/4 of the before, the current sum is upperbounded by
      • W ≤n⋅(3/4)M(*2).
  • On the other hand,
    • we observe the change of the weight of a best expert, say k. By assumption, the best expert k made m mistakes.
    • Since the initial weight is wk = 1 and its weight must be halved m times, we have that the current weight is wk = (1/2)m ≥0 (*3).

If k is one of the experts in E = {1,...,n}, then its weight is a part of W. Therefore, at any round h, we have the inequiality wk ≤W (*4).

  • Combining
    • the above discussions (*2), (*3), and (*4), we have an inequation
      • (1/2)m ≤n⋅( 3/4)M.
    • Solving this inequation: (1/2)m ≤n⋅(3/4)M ⇒ (4/3)M ≤n⋅2m ⇒ M lg(4/3) ≤(m + lg n) ⇒ M ≤(1/lg(4/3))(m + lg n),
    • we have M ≤2.40⋅(m + lg n) since 1/lg(4/3) = 2.4094.... ■
  • Learning functions from examples
    • Given a class H of exponentially many hypotheses
    • A simplest strategy: Select best one from H
  • Sequential algorithm
    • Consistent case: O(|H|) mistakes
  • Halving algorithms
    • Consistent case: O(log |H|) mistakes
  • Majority Voting algorithm
    • Inconsistent case: O(m + log |H|) mistakes for the mistake bound m of the best hypothesis.
  • Next
    • Linear learning machines (Perceptron and Winnow)