1 / 10

Learning Scenario

Learning Scenario. The learner considers a set of candidate hypotheses H and attempts to find the most probable one h H, given the observed data. Such maximally probable hypothesis is called maximum a posteriori hypothesis ( MAP ); Bayes theorem can be used to compute it:.

darice
Download Presentation

Learning Scenario

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Scenario • The learner considers a set of candidate hypotheses H and attempts to find the most probable one h H, given the observed data. • Such maximally probable hypothesis is called maximum a posteriori hypothesis (MAP); Bayes theorem can be used to compute it: CS446-Fall ’06

  2. Learning Scenario (2) • If we ignore prior probabilities on H or assume all hypotheses are a priori equally probable • We get the Maximum Likelihood hypothesis: • This is the hypothesis that best explains the data without taking into account any prior preference over H CS446-Fall ’06

  3. Examples • A given coin is either fair or has a 60% bias in favor of Head. • Decide which coin produced the data • Two hypotheses: h1: P(H)=0.5; h2: P(H)=0.6 • Prior: P(h): P(h1)=0.75 P(h2 )=0.25 • Now we need Data. 1stExperiment: coin toss is heads so D={H}. • P(D|h): P(D|h1)=0.5 ; P(D|h2) =0.6 • P(D): P(D)=P(D|h1)P(h1) + P(D|h2)P(h2 ) = 0.5  0.75 + 0.6  0.25 = 0.525 • P(h|D): P(h1|D) = P(D|h1)P(h1)/P(D) = 0.50.75/0.525 = 0.714 P(h2|D) = P(D|h2)P(h2)/P(D) = 0.60.25/0.525 = 0.286 CS446-Fall ’06

  4. Examples(2) • A given coin is either fair or has a 60% bias in favor of Head. • Decide which coin produced the data • Two hypotheses: h1: P(H)=0.5; h2: P(H)=0.6 • Prior: P(h): P(h1)=0.75, P(h2)=0.25 • After 1st coin toss is H we still think that the coin is more likely to be fair • If we were to use Maximum Likelihood approach (i.e., assume equal priors) we would think otherwise. The data support the biased coin better. • Try: 100 coin tosses with 70 heads. • Now we will believe that the coins is biased: CS446-Fall ’06

  5. Examples(2) • h1: P(H)=0.5; h2: P(H)=0.6 • Prior: P(h): P(h1)=0.75, P(h2)=0.25 • Experiment 2: 100 coin tosses; 70 heads. 0.0057 0.9943 CS446-Fall ’06

  6. hML is often easier to find than hMAP • Recall • H is often parameterized or otherwise highly structured • hML can be found directly or iteratively from D • With prior information over H, Max P(h) < P(hML) • The difference Max P(h) – P(hML) allows other hH to be selected over hML to be hMAP • All of these must be examined CS446-Fall ’06

  7. Parameterized Models • We attempt to model the underlying distribution D(x, y) or D(y | x) • To do that, we assume a model P(x, y | ) or P(y | x ,  ), where  is the set of parameters of the model • Example: Probabilistic context-free grammars: • We assume a model of independent rules. Therefore, P(x, y | ) is the product of rule probabilities. • The parameters are rule probabilities. • Example: log-linear models • We assume a model of the form: P(x, y | ) = exp{(x,y)  }/yexp{(x,y)  } • where  is a vector of parameters, (x,y) is a vector of features. CS446-Fall ’06

  8. Log-Likelihood • We often look at the log-likelihood. Why? • Makes clearer / mathematically explicit that independence = linearity • Given training samples (xi; yi), maximize the log-likelihood • L() = i log P (xi; yi | ) or L() = i log P (yi | xi , )) • Monotonicity of log(.) means the  that maximizes is the same CS446-Fall ’06

  9. Parameterized Maximum Likelihood Estimate • Model: coin has a weighting “p” which determines its behavior: • Tossing the (p,1-p) coin m times yields k Heads, m-k Tails. • H is universe of p, say [0,1]. What is hML? • If p is the probability of Heads, the probability of the data observed is: P(D|p) = pk (1-p)m-k • The log Likelihood: L(p) = log P(D|p) = k log(p) + (m-k)log(1-p) • To maximize, set the derivative w.r.t. p equal to 0: dL(p)/dp = k/p – (m-k)/(1-p) • Solving this for p, gives: p=k/m (which is the sample average) • Notice we disregard any interesting prior over H (esp. striking at low m) CS446-Fall ’06

  10. Justification • Assumption: Our selection of the model is good; that is, that there is some parameter setting * such that P(x, y | *) approximates D(x, y) • Define the maximum-likelihood estimates: ML = argmaxL() • Property of maximum-likelihood estimates: • As the training sample size goes to , then P(x, y | ML) converges to the best approximation of D(x, y) in the model space given the data CS446-Fall ’06

More Related