1 / 77

Simple Statistics for Corpus Linguistics

Simple Statistics for Corpus Linguistics. Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Outline. Numbers… A simple research question do women speak or write more than men in ICE-GB? p = proportion = probability Another research question

drew
Download Presentation

Simple Statistics for Corpus Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

  2. Outline • Numbers… • A simple research question • do women speak or write more than menin ICE-GB? • p = proportion = probability • Another research question • what happens to speakers’ use of modal shallvs. willover time? • the idea of inferential statistics • plotting confidence intervals • Concluding remarks

  3. Numbers... • We are used to concepts like these being expressed as numbers: • length (distance, height) • area • volume • temperature • wealth (income, assets)

  4. Numbers... • We are used to concepts like these being expressed as numbers: • length (distance, height) • area • volume • temperature • wealth (income, assets) • We are going to discuss another concept: • probability • proportion, percentage • a simple idea, at the heart of statistics

  5. Probability • Based on another, even simpler, idea: • probability p = x / n

  6. Probability • Based on another, even simpler, idea: • probability p = x / n • e.g. the probability that the speaker says willinstead of shall

  7. Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • e.g. the probability that the speaker says willinstead of shall

  8. Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • e.g. the probability that the speaker says willinstead of shall • cases of will

  9. Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • baseline nis • the number of times something could happen • the number of hits • in a more general search • in several alternative patterns (‘alternate forms’) • e.g. the probability that the speaker says willinstead of shall • cases of will

  10. Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • baseline nis • the number of times something could happen • the number of hits • in a more general search • in several alternative patterns (‘alternate forms’) • e.g. the probability that the speaker says willinstead of shall • cases of will • total: will + shall

  11. Probability • Based on another, even simpler, idea: • probability p = x / n • where • frequency x (often, f ) • the number of times something actually happens • the number of hits in a search • baseline nis • the number of times something could happen • the number of hits • in a more general search • in several alternative patterns (‘alternate forms’) • Probability can range from 0 to 1 • e.g. the probability that the speaker says willinstead of shall • cases of will • total: will + shall

  12. What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation

  13. What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation } How do these differ in what they might tell us?

  14. What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation • A corpus is a sample of language } How do these differ in what they might tell us?

  15. What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation • A corpus is a sample of language, varying by: • source (e.g. speech vs. writing, age...) • levels of annotation (e.g. parsing) • size(number of words) • sampling method (random sample?) } How do these differ in what they might tell us?

  16. What can a corpus tell us? • A corpus is a source of knowledge about language: • corpus • introspection/observation/elicitation • controlled laboratory experiment • computer simulation • A corpus is a sample of language, varying by: • source (e.g. speech vs. writing, age...) • levels of annotation (e.g. parsing) • size(number of words) • sampling method (random sample?) } How do these differ in what they might tell us? How does this affect the types of knowledge we might obtain? }

  17. What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus:

  18. What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event - How often?

  19. What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event • Factual evidence of new rules, etc. - How often? - How novel?

  20. What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event • Factual evidence of new rules, etc. • Interaction evidence of relationshipsbetween rules, structures and events - How often? - How novel? - Does X affect Y?

  21. What can a parsed corpus tell us? • Three kinds of evidence may be found in a parsed corpus: • Frequencyevidence of a particularknown rule, structure or linguistic event • Factual evidence of new rules, etc. • Interaction evidence of relationshipsbetween rules, structures and events • Lexical searches may also be made more precise using the grammatical analysis - How often? - How novel? - Does X affect Y?

  22. A simple research question • Let us consider the following question: • Do women speak or write more words than men in the ICE-GB corpus? • What do you think? • How might we find out?

  23. Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation

  24. Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation • Variable query: • TEXT CATEGORY = spoken, written

  25. Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation • Variable query: • TEXT CATEGORY = spoken, written • Variable query: • SPEAKER GENDER = f, m, <unknown> } combine these3 queries

  26. Lets get some data • Open ICE-GB with ICECUP • Text Fragment query for words: • “*+<{~PUNC,~PAUSE}>” • counts every word, excluding pauses and punctuation • Variable query: • TEXT CATEGORY = spoken, written • Variable query: • SPEAKER GENDER = f, m, <unknown> } combine these3 queries

  27. ICE-GB: gender / written-spoken • Proportion of words in each category spoken/written by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly female written male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1

  28. ICE-GB: gender / written-spoken • Proportion of words in each category spoken/written by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly female written p(female) = words spoken by women /total words (excluding <unknown>) male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1

  29. p = Probability = Proportion • We asked ourselves the following question: • Do women speak or write more words than men in the ICE-GB corpus? • To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known)

  30. p = Probability = Proportion • We asked ourselves the following question: • Do women speak or write more words than men in the ICE-GB corpus? • To answer this we looked at the proportion of words in ICE-GB that are produced by women (out of all words where the gender is known) • The proportion of words produced by women can also be thought of as a probability: • What is the probability that, if we were to pick any random word in ICE-GB (and the gender was known) it would be uttered by a woman?

  31. Another research question • Let us consider the following question: • What happens to modal shallvs. willover time in British English? • Does shallincrease or decrease? • What do you think? • How might we find out?

  32. Lets get some data • Open DCPSE with ICECUP • FTF query for first person declarative shall: • repeat for will

  33. Lets get some data • Open DCPSE with ICECUP • FTF query for first person declarative shall: • repeat for will • Corpus Map: • DATE } Do the first set of queries and then drop into Corpus Map

  34. Modal shall vs. will over time • Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 p(shall | {shall, will}) shall = 100% 0.8 0.6 0.4 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al. 2013)

  35. Modal shall vs. will over time • Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 p(shall | {shall, will}) shall = 100% 0.8 0.6 0.4 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al. 2013)

  36. Modal shall vs. will over time • Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE) 1.0 p(shall | {shall, will}) shall = 100% 0.8 0.6 0.4 Is shallgoing up or down? 0.2 shall = 0% 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 (Aarts et al. 2013)

  37. Is shall going up or down? • Whenever we look at change, we must ask ourselves two things:

  38. Is shall going up or down? • Whenever we look at change, we must ask ourselves two things: • What is the change relative to? • Is our observation higher or lower than we might expect? • In this case we ask • Does shalldecrease relative to shall +will?

  39. Is shall going up or down? • Whenever we look at change, we must ask ourselves two things: • What is the change relative to? • Is our observation higher or lower than we might expect? • In this case we ask • Does shalldecrease relative to shall +will? • How confident are we in our results? • Is the change big enough to be reproducible?

  40. The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77.27% of uses of think in 1920s data have a literal (‘cogitate’) meaning

  41. The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77.27% of uses of think in 1920s data have a literal (‘cogitate’) meaning Really? Not 77.28, or 77.26?

  42. The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% of uses of think in 1920s data have a literal (‘cogitate’) meaning

  43. The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% of uses of think in 1920s data have a literal (‘cogitate’) meaning Sounds defensible. But how confident can we be in this number?

  44. The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% (66-86%*) of uses of think in 1920s data have a literal (‘cogitate’) meaning

  45. The idea of a confidence interval • All observations are imprecise • Randomness is a fact of life • Our abilities are finite: • to measure accurately or • reliably classify into types • We need to express caution in citing numbers • Example (from Levin 2013): • 77% (66-86%*) of uses of think in 1920s data have a literal (‘cogitate’) meaning Finally we have a credible range of values - needs a footnote* to explain how it was calculated.

  46. The ‘sample’ and the ‘population’ • We said that the corpus was a sample

  47. The ‘sample’ and the ‘population’ • We said that the corpus was a sample • Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) • We asked questions about the sample • The answers were statements of fact

  48. The ‘sample’ and the ‘population’ • We said that the corpus was a sample • Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) • We asked questions about the sample • The answers were statements of fact • Now we are asking about “British English” ?

  49. The ‘sample’ and the ‘population’ • We said that the corpus was a sample • Previously, we asked about the proportions of male/female words in the corpus (ICE-GB) • We asked questions about the sample • The answers were statements of fact • Now we are asking about “British English” • We want to draw an inference • from the sample(in this case, DCPSE) • to the population (similarly-sampled BrE utterances) • This inference is a best guess • This process is called inferential statistics

  50. Basic inferential statistics • Suppose we carry out an experiment • We toss a coin 10 times and get 5 heads • How confident are we in the results? • Suppose we repeat the experiment • Will we get the same result again?

More Related