1 / 16

# 情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm - PowerPoint PPT Presentation

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about '情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm' - saxton

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### 情報知識ネットワーク特論Prediction and Learning 1:Majority vote algorithm

Prediction and Learning
• Training Data
• A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule
• Prediction
• Predict the output y given a new input x
• Learning
• Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.
An On-line Learning Framework
• Data
• A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule.
• Learning
• A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process.
• Goal
• Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.
Learning an unknown function

Strategy

• Select a hypothesis h∈H for making the prediction y = h(x)from a given class of functions H= {h0, h1, h2, ..., hi, ...}.
• Question
• How can we select a best hypothesis h∈Hthat minimizes the number of mistakes during prediction?
• We ignore the computation time.
Naive Algorithm (Sequential)
• Algorithm:
• Given: the hypothesis class H= {h1, ..., hN}.
• Initialize: k = 1;
• Repeat the followings:
• Receive the next input x.
• Predict by h(x) = hk(x). Receive the correct output y.
• If the mistake occurs then k = k + 1.

Exhaustive search!

• Observation:
• Naive algorithm makes at most N mistakes.
Halving Algorithm
• Naive Algorithm
• causes N mistakes in the worst case.
• is usually exponentially large in the size |h| of a hypothesis h∈H.
• Basic Idea
• Want to acheive exponential speed-up!
• Eliminate at least half of the hypotheses whenever a mistake happens.
• A key is to carefully choose the prediction value h(x) by majority voting so that one mistake implies at least half of the hypotheses fail.
Halving Algorithm
• Algorithm:
• Initialize the hypothesis class H= {h1, ..., hN}.
• Repeat the followings:
• Receive the next input x.
• Splits H into A+1 = { h∈H : h(x) = +1} and A+1 = { h∈H : h(x) = -1}.
• If |A+1| ≥ |A-1| then predict y = +1; otherwise predict y = -1. Receive the output x.
• If the prediction is wrong then remove all hypotheses that make mistake by A = A - Ay.

[Barzdin and Feivalds 1972]

Majority voting

Eliminate at least half

Halving Algorithm: Result
• Assumption (Consistent case):
• The unknown function fbelongs to class H.
• Theorem (Barzdins '72; Littlestone '87):
• The Halving algorithm makes at most log Nmistakes where N is the number of hypotheses in H.
• This gives a general strategy to design efficient online learning algorithms.
• Halving algorithm is not optimal [Littlestone 90]
• ]
[Proof]
• When receiving the input vector xi ∈{+1,-1}n,
• xi splits the active experts in A into A+1 and A-1, where Aα = { i ∈A : xi = val } for every val ∈{+1,-1}.
• Since the prediction is made according to the larger set,if a mistake occurs then the larger half is removed from A.
• Therefore, the number of active experts in A decreases at least half.
• It follows that |A| ≤n⋅(1/2)n after M mistakes. Note that any subset Aα (val ∈{+1,-1}) to which a perfect expert belong always makes the correct prediction.
• This ensures that all perfect expert survives after any update of A.
• Since |A| ≥1 by assumption, we conclude that the halving algorithm makes at most M = lg n mistakes.
Majority Vote Algorithm

Naive & Halving Algorithms

• Works only in consistent case
• Often miss the correct answer in an inconsistent case
• Inconsistent case
• A target function does not always belong to the hypothesis class H.
• None of the hypotheses can completely trace the target function.
• Tentative Goal
• Predict as well as the best experts
Majority Vote Algorithm
• Majority Vote algorithm:
• Initialize: w= (1, ...,1) ∈RN;
• For i = 1,..., mdo:
• Receive the next input x.
• Predict by f(x) = Σh∈H wihi(x) (majority vote)
• If the mistake occurs (y ≠ f(x) ) then

For all hi(x)∈H such that f(x) = hi(x) dowi = wi / 2

//majority hypotheses who contributed to the //last prediction

[Littlestone & Warmuth 1989]

Majority Vote Algorithm: Result
• Assumption (Inconsistent case):
• The unknown function fmay not belong to H.
• The best expert makes M mistakes according to the target function f.
• Theorem (Littlestone & Warmuth)
• the majority vote algorithm makes at most 2.4(M + log N) mistakes, where N is the number of hypotheses in H.
• The majority vote algorithm behaves as well as the unknown best expert.
[Proof]
• First,
• we focus on the change of the sum of the weights W = ∑i wi during learning.
• Suppose that at a round h ≥1, the best expert made m mistakes and the majority vote algorithm made M mistakes so far. Initially, the sum of the weight is set to W = n by construction.
• Suppose that
• the majority vote algorithm makes a mistake on an input vector x with weight vector w.
• Let I be the set of experts who contributed to the prediction, and let WI = ∑i ∈I wi be the sum of the corresponding weights.

By assumption,

• we have WI ≥W⋅(1/2) (*1).
• Since the weights of the wrong experts in I are halved, the sum W' of the weights after update is given by W' =W - WI⋅(1/2) ≤W - W⋅(1/2)⋅(1/2) =W - W⋅(1/4) = W⋅(3/4) from (*1). Thus,
• Wt ≤ Wt-1⋅(3/4)
• Since whenever a mistake occurs, the sum W becomes 3/4 of the before, the current sum is upperbounded by
• W ≤n⋅(3/4)M(*2).
• On the other hand,
• we observe the change of the weight of a best expert, say k. By assumption, the best expert k made m mistakes.
• Since the initial weight is wk = 1 and its weight must be halved m times, we have that the current weight is wk = (1/2)m ≥0 (*3).

If k is one of the experts in E = {1,...,n}, then its weight is a part of W. Therefore, at any round h, we have the inequiality wk ≤W (*4).

• Combining
• the above discussions (*2), (*3), and (*4), we have an inequation
• (1/2)m ≤n⋅( 3/4)M.
• Solving this inequation: (1/2)m ≤n⋅(3/4)M ⇒ (4/3)M ≤n⋅2m ⇒ M lg(4/3) ≤(m + lg n) ⇒ M ≤(1/lg(4/3))(m + lg n),
• we have M ≤2.40⋅(m + lg n) since 1/lg(4/3) = 2.4094.... ■
Conclusion
• Learning functions from examples
• Given a class H of exponentially many hypotheses
• A simplest strategy: Select best one from H
• Sequential algorithm
• Consistent case: O(|H|) mistakes
• Halving algorithms
• Consistent case: O(log |H|) mistakes
• Majority Voting algorithm
• Inconsistent case: O(m + log |H|) mistakes for the mistake bound m of the best hypothesis.
• Next
• Linear learning machines (Perceptron and Winnow)