1 / 86

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 11 part 2 2/18/2013. Recommended Reading. Manning & Schutze Chapter 2, Mathematical Foundations Bayesian networks http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html http://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf

lee
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 11 part 2 2/18/2013

  2. Recommended Reading • Manning & Schutze Chapter 2, Mathematical Foundations • Bayesian networks • http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html • http://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf • http://www.autonlab.org/tutorials/bayesnet.html • http://en.wikipedia.org/wiki/Bayesian_network

  3. Outline • Probability theory • Some probability problems • Minimal encoding of probability distributions

  4. Probability topics • Random variables and sample spaces • Probability distribution • Frequentist probability estimation • Expected value • Joint probability • Conditional probability • Marginal probability • Independence • Conditional independence • Product rule • Chain rule • Bayes rule • Subjective probability

  5. 1. Discrete random variables • A discrete random variabletakes on a range of values, or events • The set of possible events is the sample space, Ω • Example: rolling a die Ω = {1 dot, 2 dots, 3 dots, 4 dots, 5 dots, 6 dots} • The occurrence of a random value taking on a particular value from the sample space is a trial

  6. 2. Probability distribution • A set of data can be described as a probability distribution over a set of events • Definition of a probability distribution: • We have a set of events x drawn from a finite sample space Ω • Probability of each event is between 0 and 1 • Sum of probabilities of all events is

  7. Example: Probability distribution • Suppose you have a die that is equally weighted on all sides. • Let X be the random variable for the outcome of a single roll. p(X=1 dot) = 1 / 6 p(X=2 dots) = 1 / 6 p(X=3 dots) = 1 / 6 p(X=4 dots) = 1 / 6 p(X=5 dots) = 1 / 6 p(X=6 dots) = 1 / 6

  8. 3. Frequentist probability estimation • Suppose you have a die and you don’t know how it is weighted. • Let X be the random variable for the outcome of a roll. • Want to produce values for p̂(X), which is an estimate of the probability distribution of X. • Read as “p-hat” • Do this through Maximum Likelihood Estimation (MLE): the probability of an event is the number of times it occurs, divided by the total number of trials.

  9. Example: roll a die; random variable X • Data: roll a die 60 times, record the frequency of each event • 1 dot 9 rolls • 2 dots 10 rolls • 3 dots 9 rolls • 4 dots 12 rolls • 5 dots 9 roll • 6 dots 11 rolls

  10. Example: roll a die; random variable X • Maximum Likelihood Estimate: p̂(X=x) = count(x) / total_count_of_all_events • p̂( X = 1 dot) = 9 / 60 = 0.150 p̂( X = 2 dots) = 10 / 60 = 0.166 p̂( X = 3 dots) = 9 / 60 = 0.150 p̂( X = 4 dots) = 12 / 60 = 0.200 p̂( X = 5 dots) = 9 / 60 = 0.150 p̂( X = 6 dots) = 11 / 60 = 0.183 Sum = 60 / 60 = 1.0

  11. Convergence of p̂(X) • Suppose we know that the die is equally weighted. • We observe that our values for p̂(X) are close to p(X), but not all exactly equal. • We would expect that as the number of trials increases, p̂(X) will get closer to p(X). • For example, we could roll the die 1,000,000 times. Probability estimate will improve with more data.

  12. Simplify notation • People are often not precise, and write “p(X)” when they mean “p̂(X)” • We will do this also • Can also leave out the name of the random variable when it is understood • Example: p(X=4 dots) p(4 dots)

  13. 4. Expected value • Roll the die, get these results: p( X = roll 1) = 3 / 20 p( X = roll 4) = 2 / 20 p( X = roll 2) = 2 / 20 p( X = roll 5) = 1 / 20 p( X = roll 3) = 4 / 20 p( X = roll 6) = 8 / 20 • On average, if I roll the die, how many dots will there be? • Answer is not ( 1 + 2 + 3 + 4 + 5 + 6 ) / 6 = 1.83 • Need to consider the probability of each event

  14. Expected value of a random variable • The expected value of a random variable X is a weighted sum of the values of X. • i.e., for each event x in the sample space for the random variable X, multiply the probability of each event by the value of the event, and sum these • The expected value is not necessary equal to one of the events in the sample space.

  15. Expected value: example • The expected value of a random variable X is a weighted sum of the values of X. • Example: the average number of dots that I rolled Suppose: p( X = roll 1) = 3 / 20 p( X = roll 4) = 2 / 20 p( X = roll 2) = 2 / 20 p( X = roll 5) = 1 / 20 p( X = roll 3) = 4 / 20 p( X = roll 6) = 8 / 20 • E[X] = (3/15)*1 + (2/15)*2 + (4/15)*3 + (2/15)*4 + (1/15)*5 + (3/15)*8 = 3.73

  16. 5. Joint prob.: multiple random variables • Complex data can be described as a combination of values of multiple random variables • Example: 2 random variables • COLOR ∈ { blue, red } • SHAPE ∈ { square, circle } • Frequency of events: • count(COLOR=blue, SHAPE=square) = 1 • count(COLOR=red, SHAPE=square) = 2 • count(COLOR=red, SHAPE=circle) = 3 • count(COLOR=blue, SHAPE=circle) = 2

  17. Probability dist. over events that are combinations of random variables p(COLOR=blue, SHAPE=square) = 1 / 8 p(COLOR=red, SHAPE=square) = 2 / 8 p(COLOR=red, SHAPE=circle) = 3 / 8 p(COLOR=blue, SHAPE=circle) = 2 / 8 Sum = 8 / 8 = 1.0 Joint probability distribution

  18. May omit name of random variableif it’s understood • Joint probability distribution p: • p( blue, square ) = 1 / 8 = .125 • p( red, square ) = 2 / 8 = .250 • p( red, circle ) = 3 / 8 = .375 • p( blue, circle ) = 2 / 8 = .250 • Sum = 8 / 8 = 1.0

  19. 6. Conditional probability • Example: • You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • Read as “probability of pink given puppy” • In conditional probability: • the probability calculation is restricted to a subset of events in the joint distribution • that subset is determined by the values of the random variables being conditioned on

  20. Conditional probability • Sample space for probability calculation is restricted to particular events in the joint distribution • p( SHAPE = square | COLOR = red ) = 2 / 5 • p( SHAPE = circle | COLOR = red ) = 3 / 5 • p( COLOR = blue | SHAPE = square ) = 1 / 3 • p( COLOR = red | SHAPE = square ) = 2 / 3 • p( COLOR = blue | SHAPE = circle ) = 2 / 5

  21. Compare to unconditional probability • Unconditional probability: sample space for probability calculation is unrestricted • p( SHAPE = square) = 3/ 8 • = p( SHAPE = square | COLOR=blue or COLOR=red) = 3 / 8 • p( SHAPE = circle ) = 5 / 8 • p( COLOR = blue ) = 3 / 8 • p( COLOR = red) = 5 / 8

  22. 7. Marginal (unconditional) probability • Probability for a subset of the random variable(s), ignoring other random variable(s) • If you know only the joint distribution, you can calculate the marginal probability of a random variable • Sum over values of all other random variables:

  23. Marginal probability: example • p(COLOR=blue) = ? • Calculate by counting blue objects: 3/8 • Calculate through marginal probability: p(COLOR=blue) = p(COLOR=blue,SHAPE=circle) + p(COLOR=blue,SHAPE=square) = 2/8 + 1/8 = 3/8

  24. Why it’s called “marginal probability”:margins of the joint prob. table • Sum probs. in each row and column to get marginal probs p(COLOR=blue) p(COLOR=red) Total probability: p(COLOR, SHAPE) p(SHAPE=square) p(SHAPE=circle)

  25. Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

  26. 8. Independence • Two random variables A and B are independent if p(A, B) = p(A) * p(B) • i.e., if the joint probability equals the product of the marginal probabilities • “Independent”: a random variable has no effect on the distribution of another random variable

  27. Independence: example • Flip a fair coin: p(heads) = .5, p(tails) = .5 • Flip the coin twice. • Let X be the random variable for the 1st flip. • Let Y be the random variable for the 2nd flip. • The two flips don’t influence each other, so you would expect that p(X, Y) = p(X) * p(Y) • p(X=heads, Y=tails) = p(X=heads) * p(Y=tails) = .5*.5 = .25

  28. Non-independence: example • Suppose a class has a midterm and a final, and the final is cumulative. No one drops out of the class. • Midterm: 200 pass, 130 fail • Final: 180 pass, 150 fail • Contingency table shows marginal total counts • Rate of failure increases over time

  29. p(MIDTERM, FINAL) • This table shows values for joint probability • Divide each cell’s count by total count of 330 • Margins show marginal probabilities • Example: p(MIDTERM=fail) = 0.394

  30. p(MIDTERM) * p(FINAL) • Suppose MIDTERM and FINAL are independent. • Then p(MIDTERM, FINAL) = p(MIDTERM) * p(FINAL) • Expected probabilities assuming independence: For each cell, p(MIDTERM=x, FINAL=y) = p(MIDTERM=x) * p(FINAL=y) Example: p(MIDTERM=fail, FINAL=pass) = p(MIDTERM=fail) * p(FINAL=pass) = .394 * .545 = .215

  31. MIDTERM and FINAL are not independent:p(MIDTERM, FINAL) != p(MIDTERM) * p(FINAL) • Observed probability Joint prob. under independence

  32. Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(A|B) = p(B, A) p(B) • Probability of events of A, restricted to events of B • For numerator, only consider events that occur in both A and B B A A&B

  33. Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

  34. 9. Conditional independence • A and B are conditionally independent given C if p(A, B | C) = p(A|C) * p(B|C) • In the subset of the data specified by C, A and B are independent • Does not necessarily mean that A and B are independent

  35. Conditional independence: example • 3 random variables: • COLOR ∈ {red, blue} • SHAPE ∈ {circle, square} • KITTY ∈ {True, False} • COLOR and SHAPE are not independent. • For example, p(blue, circle) = 2/8 • but p(blue)*p(circle) = 4/8 * 5/8 = 20/64 = 2.5/8

  36. Conditional independence: example • COLOR and SHAPE are conditionally ind. given KITTY=TRUE: • p(COLOR, SHAPE|K=TRUE) = p(COLOR|K=T)*p(SHAPE|K=T) • p(red|K=T) = 2/4 = 1/2, p(blue|K=T) = 2/4 = 1/2 • p(circle|K=T) = 2/4 = 1/2, p(square|K=T) = 2/4 = 1/2 • p(red, circle|K=T) = 1/4 p(red|K=T)*p(circle|K=T) = 1/2 * 1/2 = 1/4 • p(red, square|K=T) = 1/4 p(red|K=T)*p(square|K=T) = 1/2 * 1/2 = 1/4 • p(blue, circle|K=T) = 1/4 p(blue|K=T)*p(circle|K=T) = 1/2 * 1/2 = 1/4 • p(blue, square|K=T) = 1/4 p(blue|K=T)*p(square|K=T) = 1/2 * 1/2 = 1/4

  37. 10. Product rule • Conditional probability: P(B | A) = P(A, B) P(A) • Product rule: P(A) * P(B | A) = P(A, B) • Generates joint probability from an unconditional probability and a conditional probability B A A&B

  38. Product rule, conditional probability, and independence • Product rule: P(A) * P(B | A) = P(A, B) • Suppose A and B are independent: P(A) * P(B) = P(A, B) • Then p(B | A) = p(B) • Explanation: B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events also in B does not change from the unrestricted sample space.

  39. Conditional probability and independence • B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events in B does not change. • Example: • p(COLOR=blue) = 3/9 = 1/3 • P(COLOR=blue|SHAPE=square) = 1/3 • P(COLOR=blue|SHAPE=circle) = 1/3 • p(COLOR=red) = 6/9 = 2/3 • P(COLOR=red|SHAPE=square) = 2/3 • P(COLOR=red|SHAPE=circle) = 2/3 • Therefore p(COLOR) = p(COLOR|SHAPE)

  40. 11. Chain rule • Product rule: P(A) * P(B | A) = P(A, B) • Chain rule: generalization of the product rule to N random variables • p(X1, …, Xn) = p(X1, ..., Xn-1) * p(Xn| X1, ..., Xn-1) • Example: N = 3 • p(A, B, C) = p(A, B) * p(C | A, B) = p(A) * p(B | A) * p(C | A, B)

  41. 12. Bayes rule • Thomas Bayes 1702 - 1761

  42. Inconsistent terminology • Bayes’ theorem • Bayes theorem • Bayes’s theorem • Bayes’ rule • Bayes rule  preferable? • Bayes’s rule • Baye’s theorem • Baye’s rule • Bayesian theorem • Bayesian rule

  43. Bayes Rule • One conditional probability can be obtained from the other • Product rule: • p(B)*p(A|B) = p(A, B) • p(A)*p(B|A) = p(A, B) • Calculate p(B|A) from p(A|B), p(A), and p(B): • p(A|B) = p(A)*p(B|A) p(B)

  44. Product rule: • p(B)*p(A|B) = p(A, B) • p(A)*p(B|A) = p(A, B) • Calculate p(B|A) from p(A|B), p(A), and p(B): • p(A|B) = p(A)*p(B|A) p(B) p(B)*p(A|B) = p(A, B) p(A|B) = p(A, B) / p(B) p(A|B) = p(A)*p(B|A) / p(B)

  45. 13. Subjective probability • Two schools of thought in the interpretation of probability • 1. Frequentist interpretation • Probability is the chance of occurrence of an event • Probability is estimated from measurements • 2. Bayesian, or subjective interpretation • Probability is one’s degree of belief about an event • Probability estimation involves both measurements, and numerical estimates of your beliefs about data

  46. Bayesian interpretation of conditional probability as additional evidence • Unconditional probability: p(A) • Belief about an event without any additional information • Conditional probability: p(A|B) • Belief about an event has been modified by additional knowledge about value of B

  47. Example: belief in COLOR changes when you know SHAPE • Unconditional belief of COLOR (no knowledge of value of SHAPE) • P(COLOR=blue) = P(COLOR=blue | SHAPE=circle or SHAPE=square) =.375 • P(COLOR=red) = P(COLOR=red | SHAPE=circle or SHAPE=square) = .625 • Knowledge of SHAPE changes belief in COLOR • P(COLOR=blue | SHAPE=square ) = .333 (decreases from uncondprob) • P(COLOR=red | SHAPE=square ) = .666 (increases from uncondprob)

  48. Prior, posterior, and likelihood • Bayes rule: p(A|B) = p(B|A) * p(A) / p(B) • Prior probability: p( A ) • Belief about A, without any additional evidence • Example: p( rain ) = .2 • Posterior probability: p( A | B ) • Probabilities of events change with new evidence • Example: p ( rain | hurricane ) = .999 • Likelihood: p( B | A ) • How likely is B in the first place, given A ? • Example: p( hurricane | rain ) = .000001

  49. Outline • Probability theory • Some probability problems • Minimal encoding of probability distributions

  50. #1. Sample space, joint and conditional probability • You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • I have two children. What is the probability that both are girls? • I have two children. At least one of them is a girl. What is the probability that both are girls?

More Related