1 / 68

807 TEXT ANALYTICS

807 TEXT ANALYTICS. Massimo Poesio Lab: Probability & Statistics Reminder. STATISTICAL METHODS IN NLP. In the next series of lectures we will often discuss methods to learn how to solve problems from data (MACHINE LEARNING) We will look at two main forms of learning:

Download Presentation

807 TEXT ANALYTICS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 807 TEXT ANALYTICS Massimo PoesioLab: Probability & Statistics Reminder

  2. STATISTICAL METHODS IN NLP • In the next series of lectures we will often discuss methods to learn how to solve problems from data (MACHINE LEARNING) • We will look at two main forms of learning: • Learning RULES of INTERPRETATION • Learning KNOWLEDGE • Most methods of this type presuppose some knowledge of probability and statistics

  3. WHY PROBABILISTIC APPROACHES • Suppose you are trying to model the process by which in “John can drive you home” “can” is interpreted as an auxiliary, whereas in “John kicked the can” “can” is interpreted as a noun • The idea is that in each of these cases, one interpretation is more likely than the others • PROBABILITY THEORY was developed to formalize the notion of LIKELIHOOD • STATISTICS to explain how previously observed data can be used to draw conclusions about the probability of an event NLE

  4. OUTLINE OF TUTORIAL • Intro to probability • Intro to statistics • For many problems of interest we do not know the distribution • Also need a few key notions from INFORMATION THEORY • We’ll discuss some of the key ones when talking about decision trees (next week)

  5. TRIALS (or EXPERIMENTS) • A trial is anything that may have a certain OUTCOME (on which you can make a bet, say) • Classic examples: • Throwing a die (outcomes: 1, 2, 3, 4, 5 , 6) • A horse race (outcomes?) • In NLE: • Looking at the next word in a text • Having your NL system perform a certain task NLE

  6. (ELEMENTARY) OUTCOMES • The results of an experiment: • In a coin toss, HEAD or TAILS • In a race, the names of the horses involved • Or if we are only interested in whether a particular horse wins: WIN and LOSE • In NLE: • When looking at the next word: the possible words • In the case of a system: RIGHT or WRONG NLE

  7. EVENTS • Often, we want to talk about the likelihood of getting one of several outcomes: • E.g., with dice, the likelihood of getting an even number, or a number greater than 3 • An EVENT is a set of possible OUTCOMES (possibly just a single elementary outcome): • E1 = {4} • E2 = {2,4,6} • E3 = {3,4,5,6} NLE

  8. SAMPLE SPACES • The SAMPLE SPACE is the set of all possible outcomes: • For the case of a dice, sample space S = {1,2,3,4,5,6} • For the case of a coin toss, sample space S = {H,T} • For the texting case: • Texting a word is a TRIAL, • The word texted is an OUTCOME, • EVENTS which result from this trial are: texting the word “minute”, texting a word that begins with “minu”, etc • The set of all possible words is the SAMPLE SPACE • (NB: the sample space may be very large, or even infinite) NLE

  9. PROBABILITY FUNCTIONS • The likelihood of an event is indicated using a PROBABILITY FUNCTION P • The probability of an event E is specified by a function P(E), with values between 0 and 1 • P(E) = 1: the event is CERTAIN to occur • P(E) = 0: the event is certain NOT to occur • Example: in the case of die casting, • P(E’ = ‘getting as a result a number between 1 and 6’) = P({1,2,3,4,5,6}) = 1 • P(E’’ = ‘getting as a result 7’) = 0 • The sum of the probabilities of all elementary outcomes = 1 NLE

  10. EXERCISES: ANALYTIC PROBABILITIES • When we know the entire sample space, and we can assume that all outcomes are equally likely, we can compute the probability of events such as • P(1) • P(EVEN) • P(>3)

  11. PROBABILITIES AND RELATIVE FREQUENCIES • In the case of a die, we know all of the possible outcomes ahead of time, and we also know a priori what the likelihood of a certain outcome is. But in many other situations in which we would like to estimate the likelihood of an event, this is not the case. • For example, suppose that we would like to bet on horses rather than on dice. Harry is a race horse: we do not know ahead of time how likely it is for Harry to win. The best we can do is to ESTIMATE P(WIN) using the RELATIVE FREQUENCY of the outcome `Harry wins’ • Suppose Harry raced 100 times, and won 20 races overall. Then • P(WIN) = WIN/TOTAL NUMBER OF RACES = .2 • P(LOSE) = .8 • The use of probabilities we are interested in (estimate the probability of certain sequences of words) is of this type NLE

  12. LOADED DICE • The assumption that all outcomes have equal probability is very strong • In most real situations (and with most real dice) probabilities of the outcomes are slightly different • P(1) = 1/4, P(2) = .15, P(3) = .15, P(4) = .15, P(5) = .15, P(6) = .15

  13. JOINT PROBABILITIES • We are often interested in the probability of TWO events happening: • When throwing a die TWICE, the probability of getting a 6 both times • The probability of finding a sequence of two words: `the’ and `car’ • We use the notation A&B to indicate the conjunction of two events, and P(A&B) to indicate the probability of such conjunction • Because events are SETS, the probability is often also written as • We use the same notation with WORDS: P(‘the’ & ‘car’) NLE

  14. JOINT PROBABILITIES: TOSSING A DIE TWICE • Sample space = { <1,1>, <1,2>, <1,3>, <1,4>, <1,5>, <1,6>, <2,1>, ….. ….. <6,1>, <6,2>, <6,3>, ……..}

  15. EXERCISES: PROBABILITY OF TWO EVENTS • P(first toss=1 & second toss=3) • P(first toss=even & second toss=even)

  16. OTHER COMBINATIONS OF EVENTS • A  B: either event A or event B happens • P(A  B) = P(A) + P(B) – P(AB) • NB: If AB = ∅, P(A  B) = P(A) + P(B) •  A: event A does not happen • P( A) = 1 –P(A) NLE

  17. EXERCISES: ADDITION RULE • P(A  B) = P(A) + P(B) – P(AB) • P( first toss = 1  second toss = 1) • P(sum of two tosses = 6  sum of two tosses = 3)

  18. PRIOR PROBABILITY VS CONDITIONAL PROBABILITY • The prior probability P(WIN) is the likelihood of an event occurring irrespective of anything else we know about the world • Often however we DO have additional information, that can help us making a more informed guess about the likelihood of a certain event • E.g, take again the case of Harry the horse. Suppose we know that it was raining during 30 of the races that Harry raced, and that Harry won 15 of these races. Intuitively, the probability of Harry winning when it’s raining is .5 - HIGHER than the probability of Harry winning overall • We can make a more informed guess • We indicate the probability of an event A happening given that we know that event B happened as well – the CONDITIONAL PROBABILITY of A given B – as P(A|B) NLE

  19. Conditional probability ALL RACES RACES WHEN IT RAINS RACES WON BY HARRY NLE

  20. Conditional probability • Conditional probability is DEFINED as follows: • Intuitively, you RESTRICT the range of trials in consideration to those in which event B took place, as well (most easily seen when thinking in terms of relative frequency) NLE

  21. EXAMPLE • Consider the case of Harry the horse again: • Where: • P(WIN&RAIN) = 15/100 = .15 • P(RAIN) = 30/100 = .30 • This gives: • (in agreement with our intuitions) NLE

  22. EXERCISES • P(sum of two dice = 3) • P(sum of two dice = 3 | first die = 1)

  23. THE MULTIPLICATION RULE • The definition of conditional probability can be rewritten as: • P(A&B) = P(A|B) P(B) • P(A&B) = P(B|A) P(A) NLE

  24. INDEPENDENCE • Additional information does not always help. For example, knowing the color of a dice usually doesn’t help us predicting the result of a throw; knowing the name of the jockey’s girlfriend doesn’t help predicting how well the horse he rides will do in a race; etc. When this is the case, we say that two events are INDEPENDENT • The notion of independence is defined in probability theory using the definition of conditional probability • Consider again the basic form of the chain rule: • P(A&B) = P(A|B) P(B) • We say that two events are INDEPENDENT if: • P(A&B) = P(A) P(B) • P(A|B) = P(A) NLE

  25. EXERCISES • P(H & H) • P(sum of two tosses greater than 6 & first toss = 1)

  26. THE CHAIN RULE • The multiplication rule generalizes to the so-called CHAIN RULE: • P(w1,w2,w3,….wn) = P(w1) P(w2|w1) P(w3|w1,w2) …. P(wn|w1 …. wn-1) • The chain rule plays an important role in statistical NLE: • P(the big dog) = P(the) P(big|the) P(dog|the big) NLE

  27. Bayes’ theorem • Suppose you’ve developed an IR system for searching a big database (say, the Web) • Given any search, about 1/100,000 documents is relevant (REL) • Suppose your system is pretty good: • P(YES|REL) = .95 • P(YES| REL) = .005 • What is the probability that the document is relevant, when the system says YES? • P(REL|YES)? NLE

  28. Bayes’ Theorem • Bayes’ Theorem is a pretty trivial consequence of the definition of conditional probability, but it is very useful in that it allows us to use one conditional probability to compute another • We already saw that the definition of conditional probability can be rewritten equivalently as: • P(A&B) = P(A|B) P(B) • P(A&B) = P(B|A) P(A) • If we equate the two left sides, we get Bayes’ theorem NLE

  29. Application of Bayes’ theorem NLE

  30. Applications of Bayes’ Rule in Diagnosis • DIAGNOSIS (in medicine, industry): What is the probability P(I|S) that individual X has illness I given that he has symptoms S? • Key point: hospitals typically have LOTS of data about posterior probability P(S|I) (‘got patient Y, he had illness I, and the following symptoms’), but NOT about P(I|S) • Using Bayes’ theorem • P(I|S) = P(S|I) P(I) / P(S)

  31. RANDOM VARIABLES • More general term to refer to the outcome of an experiment is RANDOM VARIABLE: • A FUNCTION • VARIABLE because it changes from time to time • RANDOM because value is not known beforehand • Examples: • Word frequency: a random variable with values 1..N, where N is the size of the corpus • Height of the individuals in a sample

  32. PROBABILITY DISTRIBUTIONS • Each random variable is associated with a PROBABILITY DISTRIBUTION that describes the likelihood of the different values that a RV can assume • Real life work of statisticians / computational linguists / AI researchers: try to find the probability distribution for a random variable of interest

  33. From data to probability distributions • Identification of probability distributions begins by exploring data • Computing STATISTICS • VISUALIZING data (e.g., to get an idea of the possibile distributions • A number of tools to do this exist, such as R

  34. WHAT IS STATISTICS • a branch of mathematics that provides techniques to analyze whether or not your data is significant (meaningful) • Statistical applications are based on probability statements • Nothing is “proved” with statistics • Statistics are reported • Statistics report the probability that similar results would occur if you repeated the experiment

  35. POPULATIONS AND SAMPLES • POPULATION includes all members of a group • Example: all 9th grade students in America • Number of 9th grade students at SR • No absolute number • SAMPLE • Used to make inferences about large populations • Samples are a selection of the population • Example: 6th period Accelerated Biology • Why the need for statistics? • Statistics are used to describe sample populations as estimators of the corresponding population • Many times, finding complete information about a population is costly and time consuming. We can use samples to represent a population.

  36. DATA Distribution Chart of Heights of 100 Control Plants

  37. Histogram-Frequency Distribution Charts This is called a “normal” curve or a bell curve This is an “idealized” curve and is theoretical based on an infinite number derived from a sample

  38. SUMMARIZING THE DATA: MEAN • If you are using a sample population • Arithmetic Mean (average) • The mean shows that ½ the members of the pop fall on either side of an estimated value: mean The sum of all the scores divided by the total number of scores. http://en.wikipedia.org/wiki/Table_of_mathematical_symbols

  39. SUMMARIZING DATA: MEDIAN • Mode: most frequently seen value (if no numbers repeat then the mode = 0) • Median: the middle number • If you have an odd number of data then the median is the value in the middle of the set • If you have an even number of data then the median is the average between the two middle values in the set.

  40. VARIANCE (s2) • Mathematically expressing the degree of variation of scores (data) from the mean • A large variance means that the individual scores (data) of the sample deviate a lot from the mean. • A small variance indicates the scores (data) deviate little from the mean

  41. Calculating the variance for a whole population Σ = sum of; X = score, value, µ = mean, N= total of scores or values OR use the VAR function in Excel http://www.mnstate.edu/wasson/ed602calcvardevs.htm

  42. Calculating the variance for a Biased SAMPLE population Σ = sum of; X = score, value, n -1 = total of scores or values-1   (often read as “x bar”) is the mean (average value of xi). Note the sample variance is larger…why? http://www.mnstate.edu/wasson/ed602calcvardevs.htm

  43. Heights in Centimeters of Five Randomly Selected Pea Plants Grown at 8-10 °C Xi = score or value; X (bar) = mean; Σ = sum of

  44. Finish Calculating the Variance There were five plants; n=5; therefore n-1=4 So 10/4= 2.5 Variance helps to characterize the data concerning a sample by indicating the degree to which individual members within the sample vary from the mean

  45. STANDARD DEVIATION • An important statistic that is also used to measure variation in biased samples. • S is the symbol for standard deviation • Calculated by taking the square root of the variance • So from the previous example of pea plants: The square root of 2.5 ; s=1.6 • Which means the measurements vary plus or minus +/- 1.6 cm from the mean

  46. What does “S” mean? • We can predict the probability of finding a pea plant at a predicted height… the probability of finding a pea plant above 12.8 cm or below 3.2 cm is less than 1% • S is a valuable tool because it reveals predicted limits of finding a particular value

  47. Pea Plant Normal Distribution Curve with Std Dev

  48. The classic case of probability distribution: the normal distribution • Many random variables describing physical phenomena (e.g., the height of people) have a nice symmetric probability distribution called the NORMAL DISTRIBUTION

  49. The Normal Curve and Standard Deviation A normal curve: Each vertical line is a unit of standard deviation 68% of values fall within +1 or -1 of the mean 95% of values fall within +2 & -2 units Nearly all members (>99%) fall within 3 std dev units http://classes.kumc.edu/sah/resources/sensory_processing/images/bell_curve.gif

  50. What you can do if you know that your data are normally distributed • You can compute the probability that an individual has a given height • You can test how likely it is that a particular sample you have is a representative sample of that phenomenon • T-TEST • etc

More Related