- 109 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Association Measures' - kareem

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

General Remarks

- we will only use data from contingency tables
- we will consider each pair typeon its own, independently from all other pair types( no distributional information)
- we won\'t distinguish between relational and positional cooccurrences

Association Measures (AMs)

- goal: assign association score to each pair type = strength of association between components
- high score = strong association
- association in a statistical sense,but there is no precise definition
- positive vs. negative association("colourless green ideas")

Using Association Scores

- absolute values (cut-off threshold)
- input forhigher-order statistics(AMs are first-order statistics) scores should be meaningful
- ranking of collocation candidates only relative scores matter
- rank collocates of given base one marginal frequency fixed only two free parameters

First Steps: Proportions

- Workshop on Mechanized Documentation (Washington, 1964)

First Steps: Proportions

- proportions between 0 and 1
- high proportion = strong (directional) association
- need to combine two proportions into a single association score
- average (P1 + P2) / 2 is not useful
- f=1, f1=1, f2=1000 avg.=0.5005
- f=50, f1=100, f2=100 avg.=0.5

more "conservative" weighting

First Steps: Proportions

- harmonic mean
- geometric mean
- minimum
- Jaccard

First Steps: Proportions

- coefficients range from 0 to 1
- 1 = total (positive) association
- interpretation of lower scoresis less clear
- positive vs. negative association?
- which score for no association?
- what is "no association"?? random combinations

Expected Frequencies

- assume that types u and v cooccur only by chance
- f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens
- each instance of u has a chance of f2(v)/N to cooccur with a v

expected # of cooccurrences:

Expected Frequencies

- expected frequencies for all cells of the contingency table
- assuming random combinations( statistical independence)

Expected Frequencies

- comparison of expected against observed frequencies
- note that row and column sums are the same for both tables

Mutual Information

- compares O11 with E11
- ratio O11/E11 ranges from 0 to
- 1 = no association (O11=E11)
- usually logarithmic values
- range: - to +
- 0 = no assoc., < 0 neg., > 0 pos.
- used in English lexicography

Low-Frequency Pairs & Random Variation

- large amount of low-frequency data (consequence of Zipf\'s law)
- a simple (invented) example
- A:f=50, f1=100, f2=100, N=1000 O11=50, E11=10,MI = log 5
- B:f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000

Low-Frequency Pairs & Random Variation

- three problems with case B
- how meaningful is a single example? (not very much, actually)
- could well be a spelling mistake or noise from automatic processing
- we want to make generalisations (from particular corpus to "language")

this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)

The Statistical Model:Random Sample

- assumption: corpus data is a random sample from the language

base data is a random sample from all coocs. in the language

The Statistical Model:Random Sample

- random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token
- notation: U and V as "prototypes"
- for a given pair type (u,v), contingency table can becomputed from Ui and Vi

random variablesX11, X12, X21, X22

The Statistical Model:Random Sample

- population parameters11, 12, 21, 22 for pair type (u,v)
- observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample
- theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22

Two Footnotes

- vector notation for cont. tables
- population general language
- restricted to domain(s), genre(s), ...covered by source corpus
- e.g. black box in computer science vs. newspapers vs. cooking

The Sampling Distribution

- multinomial sampling distribution
- each individual cell count Xij has a binomial distribution (but these are not independent)

The Sampling Distribution

- given assumptions about the population parameters, we can compute the likelihood of the observed contingency table
- relatively high likelihood= consistent with assumptions
- relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)

Adequacy of the Statistical Model

- particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency)
- randomness assumption (random sample from fixed population)
- independence of pair tokens
- constancy of population parameters
- violations problematic only when they affect sampling distribution

Adequacy of the Statistical Model

- three causes of non-randomness
- local dependencies (e.g. syntax) usually not problematic
- inhomogeneity of source corpus(speakers, domains, topics, ...) mixture population
- repetition / clustering of bigrams can be a serious problem(does not affect segment-based data if clustered within segments)

Making Assumptions about the Population Parameters

- population parameters (, 1, 2) are unknown
- best guess from observation: MLE = maximum-likelihood estimate

Making Assumptions about the Population Parameters

- conditional probabilities with MLE
- Dice coefficient etc. are MLE for population characteristics
- MI is MLE for log( /(1 2))

unreliable for small frequencies

The Null Hypothesis

- null hypothesis H0: no association= independence of instances, i.e.P(U=u V=v) = P(U=u) P(V=v)
- not all parameters determined
- MLE maximise probability of observed data under H0

Likelihood Measures

- probability of observed data under H0 (with MLE)
- probability of single cell: X11 should be most "informative"

Likelihood Measures

- small likelihood values = strong association
- computed probabilities are often extremely small
- use negative base-10 logarithm more convenient scale high scores indicate strong association

Problems of Likelihood Measures

- three reasons for low likelihood
- observed data is inconsistent with the null hypothesis because of strong association
- association may also be negative (fewer coocs. than expected)
- observed data is consistent, but probability mass is spread across many similar contingency tables

Problems of Likelihood Measures

- high frequency = low likelihood
- e.g. binomial likelihood
- O11=1, E11=1 L = 0.3679
- O11=1000, E11=1000 L = 0.0126
- O11=4, E11=1 L 0.0126
- need to "normalise" likelihood
- NB: likelihood association measures often have good empirical results nonetheless

Likelihood Ratios

- simplest normalisation technique
- divide maximum probability of data under H0 by unconstrained maximum probability
- suggested by Dunning (1993)

Statistical Hypothesis Tests

- compute probability of group of outcomes instead of single one
- observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0
- total probability is known as the p-value or significance
- problem: ranking of cont. tables

Asymptotic Tests

- asymptotic tests defined ranking of contingency tables explicitly
- compute test statistic from data
- higher values = more evidence against H0
- can use test statistic as an AM
- theory: approximation of p-value associated with test statistic(accurate in the limit N )

Asymptotic Tests

- standard test for independence is Pearson\'s chi-squared test
- limiting distribution = 2 distribution with df=1
- number of degrees of freedom was subject of a long debate

Two-Sided Tests

- chi-squared test is two-sided, i.e. no difference between positive and negative association
- ignore small number of pairs with (non-total) negative association
- or convert to one-sided test:reject H0 only when O11 > E11
- p-value is usually divided by 2

Yates Continuity Correction

- Pearson\'s chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory")
- estimating probabilities P(Xij k) from normal distribution introduces systematic errors

Yates\' Continuity Correction

- generic form of Yates\' continuity correction for contingency tables
- usefulness is still controversial (criticised as too conservative)
- applicability for chi-squared test is generally accepted

Asymptotic Tests

- different form of chi-squared test (comparison of two binomials) is equivalent to independence test
- special eq. with Yates\' correction

Asymptotic Tests

- can also use log-likelihood ratio as a test statistic (two-sided)
- limiting distribution is found to be 2 distribution with df=1
- more conservative than Pearson\'s chi-squared test
- Dunning (1993) showed that Pearson\'s test over-estimates evidence against H0 (simulation)

Something I\'d Rather Not Mention

- Church & Hanks: O11 and E11are both random variables
- H0: expected values are equal
- assume normal distribution with unknown variance
- compare O11 and E11 with Student\'s t-test, estimating unknown variance from the observed data

Something I\'d Rather Not Mention

- one-sided test
- statistical model is questionable
- limiting distribution: t-distribution with df N
- even more conservative than log-likelihood (low-frequency data)

Exact Tests

- problem: how to establish ranking of contingency tables
- solution: reduce set of alternatives
- if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test

Exact Tests

- another solution: marginal frequencies do not provide evidence for or against H0( "ancillary" statistics)
- condition on fixed row and column sums R1, R2, C1, C2
- conditional hypergeometric distribution does not depend on parameters 1 and 2

Exact Tests

- X11 is the only free parameter
- we can use X11 – E11 for ranking
- Fisher\'s exact test (Pedersen 1996)
- computationally expensive
- numerical difficulties

Comparing Hypothesis Tests

- Fisher\'s test is now widely accepted as most appropriate
- tends to be conservative
- log-likelihood gives good approximation of "correct" p-values(slightly less conservative)
- chi-squared over-estimates
- t-score far too conservative

Other Approaches to Measuring Association

- information-theoretic (MI, entropy) equivalent to log-likelihood
- combined measures ("boosting")
- conservative estimates instead of MLE (confidence intervals)
- hypothesis tests with different null hypothesis: = C 1 2
- mixture of conservative estimates and hypothesis tests?

Implementation

- one-sided vs. two-sided tests
- need special software to obtain p-values for asymptotic tests
- numerical accuracy
- beware of zero frequencies!

Errr.... Help!? Software?

- Ted Pedersen\'s N-gram Statistics Package (NSP)[Perl, portable, easy to use]
- UCS Toolkit will be available soon from www.collocations.de[Perl/Linux, some prerequisites, for the more ambitious :o) ]

More Association Measures

- lots of association measures
- will be updated
- references
- slides from this course
- under construction

Comparing Association Measures

- mathematical discussion
- very complex
- results only for special cases
- numerical simulation
- computationally expensive
- Dunning (1993, 1998)
- lazy man\'s approach
- construct mock data set where frequencies vary systematically

Download Presentation

Connecting to Server..