corpora and statistical methods n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Corpora and Statistical Methods PowerPoint Presentation
Download Presentation
Corpora and Statistical Methods

Loading in 2 Seconds...

play fullscreen
1 / 27

Corpora and Statistical Methods - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

Corpora and Statistical Methods. Albert Gatt. In this lecture. We have considered distributions of words and lexical variation in corpora. Today we consider collocations: definition and characteristics measures of collocational strength experiments on corpora hypothesis testing. Part 1 .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Corpora and Statistical Methods


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
in this lecture
In this lecture
  • We have considered distributions of words and lexical variation in corpora.
  • Today we consider collocations:
    • definition and characteristics
    • measures of collocational strength
    • experiments on corpora
    • hypothesis testing

Corpora and Statistical Methods

part 1

Part 1

Collocations: Definition and characteristics

a motivating example
A motivating example
  • Consider phrases such as:
    • strong tea ? powerful tea
    • strong support ? powerful support
    • powerful drug ? strong drug
  • Traditional semantic theories have difficulty accounting for these patterns.
    • strong and powerful seem near-synonyms
    • do we claim they have different senses?
    • what is the crucial difference?

Corpora and Statistical Methods

the empiricist view of meaning
The empiricist view of meaning
  • Firth’s view (1957):
    • “You shall know a word by the company it keeps”
    • This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953).
    • In the Firthian tradition, attention is paid to patterns that crop up with regularity in language.
      • Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc.
    • Statistical work on collocations tends to follow this tradition.

Corpora and Statistical Methods

defining collocations
Defining collocations
  • “Collocations … are statements of the habitual or customary places of [a] word.” (Firth 1957)
  • Characteristics/Expectations:
    • regular/frequently attested;
    • occur within a narrow window (span of few words);
    • not fully compositional;
    • non-substitutable;
    • non-modifiable
    • display category restrictions

Corpora and Statistical Methods

frequency and regularity
Frequency and regularity
  • We know that language is regular (non-random) and rule-based.
    • this aspect is emphasised by rationalist approaches to grammar
  • We also need to acknowledge that frequency of usage is an important factor in language development.
    • why do big and large collocate differently with different nouns?

Corpora and Statistical Methods

regularity frequency
Regularity/frequency
  • f(strong tea) > f(powerful tea)
  • f(credit card) > f(credit bankruptcy)
  • f(white wine) > f(yellow wine)
    • (even though white wine is actually yellowish)

Corpora and Statistical Methods

narrow window textual proximity
Narrow window (textual proximity)
  • Usually, we specify an n-gram window within which to analyse collocations:
    • bigram: credit card, credit crunch
    • trigram: credit card fraud, credit card expiry
  • The idea is to look at co-occurrence of words within a specific n-gram window
  • We can also count n-grams with intervening words:
    • federal (.*) subsidy
    • matches: federal subsidy, federal farm subsidy, federal manufacturing subsidy…

Corpora and Statistical Methods

textual proximity continued
Textual proximity (continued)
  • Usually collocates of a word occur close to that word.
    • may still occur across a span
  • Examples:
    • bigram: white wine, powerful tea
    • >bigram:knock on the door;knock on X’s door

Corpora and Statistical Methods

non compositionality
Non-compositionality
  • white wine
    • not really “white”, meaning not fully predictable from component words + syntax
  • signal interpretation
    • a term used in Intelligent Signal Processing: connotations go beyond compositional meaning
  • Similarly:
    • regression coefficient
    • good practice guidelines
  • Extreme cases:
    • idioms such as kick the bucket
    • meaning is completely frozen

Corpora and Statistical Methods

non substitutability
Non-substitutability
  • If a phrase is a collocation, we can’t substitute a word in the phrase for a near-synonym, and still have the same overall meaning.
  • E.g.:
    • white wine vs. yellow wine
    • powerful tea vs. strong tea

Corpora and Statistical Methods

non modifiability
Non-modifiability
  • Often, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms.
  • Example:
    • kick the bucket vs. ?kick the large bucket
  • NB:
    • this is a matter of degree!
    • non-idiomatic collocations are more flexible

Corpora and Statistical Methods

category restrictions
Category restrictions
  • Frequency alone doesn’t indicate collocational strength:
    • by the is a very frequent phrase in English
    • not a collocation
  • Collocations tend to be formed from content words:
    • A+N: powerful tea
    • N+N: regression coefficient, mass demonstration
    • N+PREP+N: degrees of freedom

Corpora and Statistical Methods

collocations in a broad sense
Collocations “in a broad sense”
  • In many statistical NLP applications, the term collocation is quite broadly understood:
    • any phrase which is frequent/regular enough…
      • proper names (New York)
      • compound nouns (elevator operator)
      • set phrases (part of speech)
      • idioms (kick the bucket)

Corpora and Statistical Methods

why are collocations interesting
Why are collocations interesting?
  • Several applications need to “know” about collocations:
    • terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices)
    • document classification: specialist phrases are good indicators of the topic of a text
    • named entity recognition: names such as New York tend to occur together frequently; phrases like new toy don’t

Corpora and Statistical Methods

example application parsing
Example application: Parsing
  • She spotted the man with a pair of binoculars
    • [VP spotted [NP the man [PP with a pair of binoculars]]]
    • [VP spotted [NP the man] [PP with a pair of binoculars]]
  • Parser might prefer (2) if spot/binoculars are frequent co-occurrences in a window of a certain width.

Corpora and Statistical Methods

example application generation
Example application: Generation
  • NLG systems often need to map a semantic representation to a lexical/syntactic one.
    • Shouldn’t use the wrong adjective-noun combinations: clean face vs. ?immaculate face
  • Lapataet al. (1999):
    • experiment asking people to rate different adjective-noun combinations
    • frequency of the combination a strong predictor of people’s preferences
    • argue that NLG systems need to be able to make contextually-informed decisions in lexical choice

Corpora and Statistical Methods

frequency based approach
Frequency-based approach
  • Motivation:
    • if two (or three, or…) words occur together a lot within some window, they’re a collocation
  • Problems:
    • frequent “collocations” under this definition include with the, onto a, etc.
    • not very interesting…

Corpora and Statistical Methods

improving the frequency based approach
Improving the frequency-based approach
  • Justeson & Katz (1995):
    • part of speech filter
    • only look at word combinations of the “right” category:
      • N + N: regression coefficient
      • N + PRP + N: jack in (the) box
    • dramatically improves the results
    • content-word combinations more likely to be phrases

Corpora and Statistical Methods

case study strong vs powerful
Case study: strong vs. powerful
  • See: Manning & Schutze `99, Sec 5.2
  • Motivation:
    • try to distinguish the meanings of two quasi-synonyms
    • data from New York Times corpus
  • Basic strategy:
    • find all bigrams <w1, w2> where w1 = strong or powerful
    • apply POS filter to remove strong on [crime], powerful in [industry] etc.

Corpora and Statistical Methods

case study cont d
Case study (cont/d)
  • Sample results from Manning & Schutze `99:
    • f(strong support) = 50
    • f(strong supporter) = 10
    • f(powerful force) = 13
    • f(powerful computers) = 10
  • Teaser:
    • would you also expect powerful supporter?
    • what’s the difference between strong supporter and powerful supporter?

Corpora and Statistical Methods

limitations of frequency based search
Limitations of frequency-based search
  • Only work for fixed phrases
    • But collocations can be “looser”, allowing interpolation of other words.
    • knock on [the,X’s,a] door
    • pull [a] punch
  • Simple frequency won’t do for these: different interpolated words dilute the frequency.

Corpora and Statistical Methods

using mean and variance
Using mean and variance
  • General idea: include bigrams even at a distance:

w1 X w2

pull a punch

  • Strategy:
    • find co-occurrences of the two words in windows of varying length
    • compute mean offset between w1 and w2
    • compute variance of offset between w1 and w2
    • if offsets are randomly distributed, then we have high variance and conclude that <w1,w2> is not a collocation

Corpora and Statistical Methods

example outcomes m s 99
Example outcomes (M&S `99)
  • position of strong wrtopposition
    • mean = -1.15, standard dev = 0.67
    • i.e. most occurrences are strong […] opposition
  • position of strong wrt for
    • mean = -1.12, standard dev = 2.15
    • i.e. for occurs anywhere around strong, SD is higher than mean.
    • can get strong support for, for the strong support, etc.

Corpora and Statistical Methods

more limitations of frequency
More limitations of frequency
  • If we use simple frequency or mean & variance, we have a good way of ranking likely collocations.
  • But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance?
  • We need to think in terms of hypothesis-testing.
    • Given <w1,w2>, we want to compare:
      • The hypothesis that they are non-independent.
      • The hypothesis that they are independent.

Corpora and Statistical Methods