1 / 64

# Association Measures - PowerPoint PPT Presentation

Association Measures. Reminder: Contingency Tables. General Remarks. we will only use data from contingency tables we will consider each pair type on its own, independently from all other pair types (  no distributional information)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Association Measures' - kareem

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Association Measures

• we will only use data from contingency tables

• we will consider each pair typeon its own, independently from all other pair types( no distributional information)

• we won't distinguish between relational and positional cooccurrences

• goal: assign association score to each pair type = strength of association between components

• high score = strong association

• association in a statistical sense,but there is no precise definition

• positive vs. negative association("colourless green ideas")

• absolute values (cut-off threshold)

• input forhigher-order statistics(AMs are first-order statistics) scores should be meaningful

• ranking of collocation candidates only relative scores matter

• rank collocates of given base one marginal frequency fixed  only two free parameters

• Workshop on Mechanized Documentation (Washington, 1964)

• proportions between 0 and 1

• high proportion = strong (directional) association

• need to combine two proportions into a single association score

• average (P1 + P2) / 2 is not useful

• f=1, f1=1, f2=1000 avg.=0.5005

• f=50, f1=100, f2=100  avg.=0.5

 more "conservative" weighting

• harmonic mean

• geometric mean

• minimum

• Jaccard

• coefficients range from 0 to 1

• 1 = total (positive) association

• interpretation of lower scoresis less clear

• positive vs. negative association?

• which score for no association?

• what is "no association"?? random combinations

• assume that types u and v cooccur only by chance

• f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens

• each instance of u has a chance of f2(v)/N to cooccur with a v

 expected # of cooccurrences:

• expected frequencies for all cells of the contingency table

• assuming random combinations( statistical independence)

• comparison of expected against observed frequencies

• note that row and column sums are the same for both tables

• compares O11 with E11

• ratio O11/E11 ranges from 0 to 

• 1 = no association (O11=E11)

• usually logarithmic values

• range: - to +

• 0 = no assoc., < 0 neg., > 0 pos.

• used in English lexicography

Low-Frequency Pairs & Random Variation

• large amount of low-frequency data (consequence of Zipf's law)

• a simple (invented) example

• A:f=50, f1=100, f2=100, N=1000 O11=50, E11=10,MI = log 5

• B:f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000

Low-Frequency Pairs & Random Variation

• three problems with case B

• how meaningful is a single example? (not very much, actually)

• could well be a spelling mistake or noise from automatic processing

• we want to make generalisations (from particular corpus to "language")

 this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)

The Statistical Model:Random Sample

• assumption: corpus data is a random sample from the language

 base data is a random sample from all coocs. in the language

The Statistical Model:Random Sample

• random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token

• notation: U and V as "prototypes"

• for a given pair type (u,v), contingency table can becomputed from Ui and Vi

 random variablesX11, X12, X21, X22

The Statistical Model:Random Sample

• population parameters11, 12, 21, 22 for pair type (u,v)

• observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample

• theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22

• vector notation for cont. tables

• population  general language

• restricted to domain(s), genre(s), ...covered by source corpus

• e.g. black box in computer science vs. newspapers vs. cooking

• multinomial sampling distribution

• each individual cell count Xij has a binomial distribution (but these are not independent)

• given assumptions about the population parameters, we can compute the likelihood of the observed contingency table

• relatively high likelihood= consistent with assumptions

• relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)

• particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency)

• randomness assumption (random sample from fixed population)

• independence of pair tokens

• constancy of population parameters

• violations problematic only when they affect sampling distribution

• three causes of non-randomness

• local dependencies (e.g. syntax)  usually not problematic

• inhomogeneity of source corpus(speakers, domains, topics, ...)  mixture population

• repetition / clustering of bigrams  can be a serious problem(does not affect segment-based data if clustered within segments)

• population parameters (, 1, 2) are unknown

• best guess from observation: MLE = maximum-likelihood estimate

• conditional probabilities with MLE

• Dice coefficient etc. are MLE for population characteristics

• MI is MLE for log( /(1  2))

 unreliable for small frequencies

• null hypothesis H0: no association= independence of instances, i.e.P(U=u  V=v) = P(U=u)  P(V=v)

• not all parameters determined

• MLE maximise probability of observed data under H0

• probability of observed data under H0 (with MLE)

• probability of single cell: X11 should be most "informative"

• small likelihood values = strong association

• computed probabilities are often extremely small

• use negative base-10 logarithm more convenient scale  high scores indicate strong association

Problems of Likelihood Measures

• three reasons for low likelihood

• observed data is inconsistent with the null hypothesis because of strong association

• association may also be negative (fewer coocs. than expected)

• observed data is consistent, but probability mass is spread across many similar contingency tables

Problems of Likelihood Measures

• high frequency = low likelihood

• e.g. binomial likelihood

• O11=1, E11=1 L = 0.3679

• O11=1000, E11=1000 L = 0.0126

• O11=4, E11=1 L  0.0126

• need to "normalise" likelihood

• NB: likelihood association measures often have good empirical results nonetheless

• simplest normalisation technique

• divide maximum probability of data under H0 by unconstrained maximum probability

• suggested by Dunning (1993)

• compute probability of group of outcomes instead of single one

• observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0

• total probability is known as the p-value or significance

• problem: ranking of cont. tables

• asymptotic tests defined ranking of contingency tables explicitly

• compute test statistic from data

• higher values = more evidence against H0

• can use test statistic as an AM

• theory: approximation of p-value associated with test statistic(accurate in the limit N  )

• standard test for independence is Pearson's chi-squared test

• limiting distribution = 2 distribution with df=1

• number of degrees of freedom was subject of a long debate

• chi-squared test is two-sided, i.e. no difference between positive and negative association

• ignore small number of pairs with (non-total) negative association

• or convert to one-sided test:reject H0 only when O11 > E11

• p-value is usually divided by 2

• Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory")

• estimating probabilities P(Xij  k) from normal distribution introduces systematic errors

• generic form of Yates' continuity correction for contingency tables

• usefulness is still controversial (criticised as too conservative)

• applicability for chi-squared test is generally accepted

• different form of chi-squared test (comparison of two binomials) is equivalent to independence test

• special eq. with Yates' correction

• can also use log-likelihood ratio as a test statistic (two-sided)

• limiting distribution is found to be 2 distribution with df=1

• more conservative than Pearson's chi-squared test

• Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation)

• Church & Hanks: O11 and E11are both random variables

• H0: expected values are equal

• assume normal distribution with unknown variance

• compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data

• one-sided test

• statistical model is questionable

• limiting distribution: t-distribution with df  N

• even more conservative than log-likelihood (low-frequency data)

• problem: how to establish ranking of contingency tables

• solution: reduce set of alternatives

• if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test

• another solution: marginal frequencies do not provide evidence for or against H0( "ancillary" statistics)

• condition on fixed row and column sums R1, R2, C1, C2

• conditional hypergeometric distribution does not depend on parameters 1 and 2

• X11 is the only free parameter

• we can use X11 – E11 for ranking

• Fisher's exact test (Pedersen 1996)

• computationally expensive

• numerical difficulties

• Fisher's test is now widely accepted as most appropriate

• tends to be conservative

• log-likelihood gives good approximation of "correct" p-values(slightly less conservative)

• chi-squared over-estimates

• t-score far too conservative

Other Approaches to Measuring Association

• information-theoretic (MI, entropy) equivalent to log-likelihood

• combined measures ("boosting")

• conservative estimates instead of MLE (confidence intervals)

• hypothesis tests with different null hypothesis:  = C  1  2

• mixture of conservative estimates and hypothesis tests?

• one-sided vs. two-sided tests

• need special software to obtain p-values for asymptotic tests

• numerical accuracy

• beware of zero frequencies!

• Ted Pedersen's N-gram Statistics Package (NSP)[Perl, portable, easy to use]

• UCS Toolkit will be available soon from www.collocations.de[Perl/Linux, some prerequisites, for the more ambitious :o) ]

• lots of association measures

• will be updated

• references

• slides from this course

• under construction

• mathematical discussion

• very complex

• results only for special cases

• numerical simulation

• computationally expensive

• Dunning (1993, 1998)

• lazy man's approach

• construct mock data set where frequencies vary systematically