1 / 45

CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning. Parameter Learning (From data to distributions) . Agenda. Learning probability distributions from example data Generative vs. discriminative models Maximum likelihood estimation (MLE) Bayesian estimation. Motivation.

cortez
Download Presentation

CS b553: Algorithms for Optimization and Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS b553: Algorithms for Optimization and Learning Parameter Learning(From data to distributions)

  2. Agenda • Learning probability distributions from example data • Generative vs. discriminative models • Maximum likelihood estimation (MLE) • Bayesian estimation

  3. Motivation • Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net • Next few lectures: where does the Bayes net come from? • Setting for this lecture: • Given a set of examples drawn from a distribution • Each example is complete (fully observable) • BN structure is known, but the CPTs are unknown

  4. Density Estimation • Given dataset D={d[1],…,d[M]} drawn from underlying distribution P* • Find a distribution that matches P* as close as possible • High-level issues: • Usually, not enough data to get an accurate picture of P*, which forces us to approximate. • Even if we did have P*, how do we measure closeness? • How do we maximize closeness? • Two approaches: learning problems => • Optimization problems,or • Bayesian inference problems

  5. Kullback-Liebler Divergence • Definition: given two probability distributions P and Q over X, the KL divergence (or relative entropy) from P to Q is given by: • Properties: • iff P=Q “almost everywhere” • Not a true “metric” – non-symmetric

  6. Applying KL Divergence to Learning • Approach: given underlying distribution P*, find P (within a class of distributions) so KL divergence is minimized • If we approximate P* with draws from D, we get • Minimizing KL-divergence to the empirical distribution is the same as maximizing the empirical log-likelihood

  7. Another approach: Discriminative Learning • Do we really want to model P*? We may be more concerned with predicting the values of some subset of variables • E.g., for a Bayes net CPT, we want P(Y|PaY) but may not care about the distribution of PaY • Generative model: estimate P(X,Y) • Discriminative model: estimate P(Y|X), ignore P(X)

  8. Training Discriminative Models • Define a loss function l(y,x,P) that is given the ground truth y,x • Measures the difference between the prediction P(Y|x) and the ground truth • Examples: • Classification error I[y argmaxyP(y|x)] • Conditional log likelihood - log P(y|x) • Strategy: minimize empirical loss

  9. Discriminative Vs Generative • Discriminative models: • Don’t model the input distribution, so may have more expressive power for the same level of complexity • May learn more accurate predictive models for same sized training dataset • Directly transcribe top-down evaluation of CPTs • Generative models: • More flexible, because they don’t require a priori selection of the dependent variable Y • Bottom-up inference is easier • Both useful in different situations

  10. What class of Probability Models? • For small discrete distributions, just use a tabular representation • Very efficient learning techniques • For large discrete distributions or continuous ones, the choice of probability model is crucial • Increasing complexity => • Can represent complex distributions more accurately • Need more data to learn well (risk of overfitting) • More expensive to learn and to perform inference

  11. Learning Coin Flips • Let the unknown fraction of cherries be q (hypothesis) • Probability of drawing a cherry is q • Suppose draws are independent and identically distributed (i.i.d) • Observe that c out of N draws are cherries (data)

  12. Learning Coin Flips • Let the unknown fraction of cherries be q (hypothesis) • Intuition: c/N might be a good hypothesis • (or it might not, depending on the draw!)

  13. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = Pj P(dj|q) = qc (1-q)N-c i.i.d assumption Gather c cherry terms together, then N-c lime terms

  14. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  15. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  16. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  17. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  18. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  19. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  20. Maximum Likelihood • Likelihood of data d={d1,…,dN} given q • P(d|q) = qc (1-q)N-c

  21. Maximum Likelihood • Peaks of likelihood function seem to hover around the fraction of cherries… • Sharpness indicates some notion of certainty…

  22. Maximum Likelihood • P(d|q) be the likelihood function • The quantity argmaxq P(d|q) is known as the maximum likelihood estimate (MLE)

  23. Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]

  24. Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]

  25. Maximum Likelihood • l(q) = log P(d|q) = log [ qc(1-q)N-c]= log [ qc] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

  26. Maximum Likelihood • l(q) = log P(d|q) = c log q + (N-c) log (1-q) • Setting dl/dq(q)= 0 gives the maximum likelihood estimate

  27. Maximum Likelihood • dl/dq(q) = c/q– (N-c)/(1-q) • At MLE, c/q – (N-c)/(1-q) = 0=> q = c/N c and N are known as sufficient statistics for the parameter q– no other values give additional information about q

  28. Other MLE results • Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram) • Continuous Gaussian distributions • Mean = average data • Standard deviation = standard deviation of data

  29. Maximum Likelihood for BN • For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 B: 200 E: 500 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380 Alarm

  30. Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q)

  31. Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q) = PmPi PG(xi[m] | paXi[m];q)

  32. Fitting CPTs • Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT • Note that for large k=|PaXi|, very few datapoints will share the values of paXi! • O(|D|/2k), but some values may be even rarer • Large domains |Val(Xi)| can also be a problem • Data fragmentation

  33. Proof • Let BN have structure G over variables X1,…,Xn and parameters q • Given dataset D • L(q; D) = Pm PG(d[m]; q) = PmPi PG(xi[m] | paXi[m];q) = Pi [Pm PG(xi[m] | paXi[m]; q)] • Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local CPT of Xi: L(qXi; D) • Each CPT depends on a disjoint set of parameters qXi • => maximizing L(q; D) over all parameters qis equivalent to maximizing L(qXi; D)over each individual qXi

  34. An Alternative approach: Bayesian Estimation • P(q|d) = 1/Z P(d|q) P(q) is the posterior • Distribution of hypotheses given the data • P(d|q) is the likelihood • P(q) is the hypothesis prior q d[1] d[2] d[M]

  35. Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]

  36. Assumption: Uniform prior, Bernoulli Distribution • Assume P(q) is uniform • P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c • What’s P(Y|D)? qi Y d[1] d[2] d[M]

  37. Assumption: Uniform prior, Bernoulli Distribution • =>Z = c! (N-c)! / (N+1)! • =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)! = (c+1) / (N+2) Can think of this as a “correction” using “virtual counts” qi Y d[1] d[2] d[M]

  38. Nonuniform priors • P(q|d)  P(d|q)P(q) = qc (1-q)N-c P(q) Define, for all q, the probability that I believe in q P(q) q 0 1

  39. Beta Distribution • Betaa,b(q) = gqa-1 (1-q)b-1 • a, bhyperparameters > 0 • g is a normalizationconstant • a=b=1 is uniform distribution

  40. Posterior with Beta Prior • Posterior qc (1-q)N-c P(q)= gqc+a-1 (1-q)N-c+b-1= Betaa+c,b+N-c(q) • Prediction = meanE[q]=(c+a)/(N+a+b)

  41. Posterior with Beta Prior • What does this mean? • Prior specifies a “virtual count” of a=a-1 heads, b=b-1 tails • See heads, increment a • See tails, increment b • Effect of prior diminishes with more data

  42. Choosing a Prior • Part of the design process; must be chosen according to your intuition • Uninformed belief a=b=1, strong belief => a,b high

  43. Extensions of Beta Priors • Parameters of categorical distributions: Dirichlet prior • Mathematical expression more complex, but in practice still takes the form of “virtual counts” • Mean, standard deviation for Gaussian distributions: Gamma prior • Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions

  44. Dirichlet Prior • Categorical variable |Val(X)|=k with P(X=i) = qi • Parameter space q1,…,qk with qi  0, S qi = 1 • Maximum likelihood estimate given counts c1,…,ck in the data D: • qiML = ci/N • Dirichlet prior is Dirichlet(a1,…,ak) = • Mean is (a1/aT,…,ak/aT) with aT=Siai • Posterior P(q|D) is Dirichlet(a1+c1,…,ak+ck)

  45. Recap • Learning => optimization problem (ML) • Learning => inference problem (Bayesian estimation) • Learning parameters of Bayesian networks • Conjugate priors

More Related