1 / 59

CS b351 Learning Probabilistic Models

CS b351 Learning Probabilistic Models. Motivation. Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net Next few lectures: where does the Bayes net come from ?. Win?. Strength. Opponent Strength. Win?. Opp. Def. Strength.

trevet
Download Presentation

CS b351 Learning Probabilistic Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b351Learning Probabilistic Models

  2. Motivation • Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net • Next few lectures: where does the Bayes net come from?

  3. Win? Strength Opponent Strength

  4. Win? Opp. Def. Strength Offense strength Opp. Off. Strength Defense strength Score allowed Rush yds Rush yds allowed Pass yds

  5. Opp injuries? Injuries? s At Home? Win? Opp. Def. Strength Offense strength Opp. Off. Strength Defense strength Score allowed Rush yds Rush yds allowed Pass yds Strength of schedule

  6. Opp injuries? Injuries? s At Home? Win? Opp. Def. Strength Offense strength Opp. Off. Strength Defense strength Score allowed Rush yds Rush yds allowed Pass yds Strength of schedule

  7. Agenda • Learning probability distributions from example data • Influence of structure on performance • Maximum likelihood estimation (MLE) • Bayesian estimation

  8. Probabilistic Estimation problem • Our setting: • Given a set of examples drawn from the target distribution • Each example is complete (fully observable) • Goal: • Produce some representation of a belief state so we can perform inferences & draw certain predictions

  9. Density Estimation • Given dataset D={d[1],…,d[M]} drawn from underlying distribution P* • Find a distribution that matches P* as “close” as possible • High-level issues: • Usually, not enough data to get an accurate picture of P*, which forces us to approximate. • Even if we did have P*, how do we define “closeness” (both theoretically and in practice)? • How do we maximize “closeness”?

  10. What class of Probability Models? • For small discrete distributions, just use a tabular representation • Very efficient learning techniques • For large discrete distributions or continuous ones, the choice of probability model is crucial • Increasing complexity => • Can represent complex distributions more accurately • Need more data to learn well (risk of overfitting) • More expensive to learn and to perform inference

  11. Two Learning problems • Parameter learning • What entries should be put into the model’s probability tables? • Structure learning • Which variables should be represented / transformed for inclusion in the model? • What direct / indirect relationships between variables should be modeled? • More “high level” problem • Once structure is chosen, a set of (unestimated) parameters emerge • These need to be estimated using parameter learning

  12. Learning Coin Flips • Cherry and lime candies are in an opaque bag • Observe that c out of N draws are cherries (data)

  13. Learning Coin Flips • Observe that c out of N draws are cherries (data) • Intuition: c/N might be a good hypothesis for the fraction of cherries in the bag(or it might not, depending on the draw!) • “Intuitive” parameter estimate: empirical distribution P(cherry)  c / N(this will be justified more thoroughly later)

  14. Structure Learning Example: Histogram bucket sizes • Histograms are used to estimate distributions of continuous or large #s of discrete values… but how fine?

  15. Structure Learning: Independence Relationships • Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D) • Case 1: 15 free parameters (16 entries – sum to 1 constraint) • P(ABCD) = p1 • P(ABCD) = p2 • … • P(ABCD) = p15 • P(ABCD) = 1-p1-…-p15 • Case 2: 4 free parameters • P(A)=p1, P(A)=1-p1 • … • P(D)=p4, P(D)=1-p4

  16. Structure Learning: Independence Relationships • Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D) • P(A,B,C,D) • Would be able to fit ALL relationships in the data • P(A)P(B)P(C)P(D) • Inherently does not have the capability to accurately model correlations like A~=B • Leads to biased estimates: overestimate or underestimate the true probabilities

  17. Learned using independence assumption P(X)P(Y) Original joint distribution P(X,Y) Y X Y X

  18. Structure Learning: Expressive Power • Making more independence assumptions always makes a probabilistic model less expressive • If the independence relationships assumed by structure model A are a superset of those in structure B, then B can express any probability distribution that A can X X X Y Z Y Z Y Z

  19. C F1 F2 Fk Or F1 F2 Fk ? C

  20. Arcs do not necessarily encode causality! A C B B C A 2 BN’s that can encode the same joint probability distribution

  21. Reading off independence relationships • Given B, does the value of A affect the probability of C? • P(C|B,A) = P(C|B)? • No! • C parent’s (B) are given, and so it is independent of its non-descendents (A) • Independence is symmetric:C  A | B => A  C | B A B C

  22. Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y

  23. Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y Parameters estimated via empirical distribution (“Intuitive fit”) P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11

  24. Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y Parameters estimated via empirical distribution (“Intuitive fit”) P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger!

  25. Structure Learning: Fit vs complexity • Must trade off fit of data vs. complexity of model • Complex models • More parameters to learn • More expressive • More data fragmentation = greater sensitivity to noise

  26. Structure Learning: Fit vs complexity • Must trade off fit of data vs. complexity of model • Complex models • More parameters to learn • More expressive • More data fragmentation = greater sensitivity to noise • Typical approaches explore multiple structures, while optimizing the trade off between fit and complexity • Need a way of measuring “complexity” (e.g., number of edges, number of parameters) and “fit”

  27. Further Reading on Structure Learning • Structure learning with statistical independence testing • Score-based methods (e.g., Bayesian Information Criterion) • Bayesian methods with structure priors • Cross-validated model selection (more on this later)

  28. Statistical Parameter learning

  29. Learning Coin Flips • Observe that c out of N draws are cherries (data) • Let the unknown fraction of cherries be q (hypothesis) • Probability of drawing a cherry is q • Assumption: draws are independent and identically distributed (i.i.d)

  30. Learning Coin Flips • Probability of drawing a cherry is q • Assumption: draws are independent and identically distributed (i.i.d) • Probability of drawing 2 cherries is q*q • Probability of drawing 2 limes is (1-q)2 • Probability of drawing 1 cherry and 1 lime: q*(1-q)

  31. Likelihood Function • Likelihood of data d={d1,…,dN} given q • P(d|q) = Pj P(dj|q) = qc (1-q)N-c i.i.d assumption Gather c cherry terms together, then N-c lime terms

  32. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  33. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  34. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  35. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  36. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  37. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  38. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  39. Maximum Likelihood • Peaks of likelihood function seem to hover around the fraction of cherries… • Sharpness indicates some notion of certainty…

  40. Maximum Likelihood • P(d|q) is the likelihood function • The quantity argmaxq P(d|q) is known as the maximum likelihood estimate (MLE)

  41. Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]

  42. Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]

  43. Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

  44. Maximum Likelihood • l(q) = log P(d|q) = c log q + (N-c) log (1-q) • Setting dl/dq(q)= 0 gives the maximum likelihood estimate

  45. Maximum Likelihood • dl/dq(q) = c/q– (N-c)/(1-q) • At MLE, c/q – (N-c)/(1-q) = 0=> q = c/N

  46. Other MLE results • Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram) • Continuous Gaussian distributions • Mean = average data • Standard deviation = standard deviation of data

  47. An Alternative approach: Bayesian Estimation • P(q|d) = 1/Z P(d|q) P(q) is the posterior • Distribution of hypotheses given the data • P(d|q) is the likelihood • P(q) is the hypothesis prior q d[1] d[2] d[M]

  48. Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]

  49. Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]

  50. Assumption: Uniform prior, Bernoulli Distribution • =>Z = c! (N-c)! / (N+1)! • =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)! = (c+1) / (N+2) Can think of this as a “correction” using “virtual counts” qi Y d[1] d[2] d[M]

More Related