association measures n.
Skip this Video
Download Presentation
Association Measures

Loading in 2 Seconds...

play fullscreen
1 / 64

Association Measures - PowerPoint PPT Presentation

  • Uploaded on

Association Measures. Reminder: Contingency Tables. General Remarks. we will only use data from contingency tables we will consider each pair type on its own, independently from all other pair types (  no distributional information)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Association Measures' - kareem

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
general remarks
General Remarks
  • we will only use data from contingency tables
  • we will consider each pair typeon its own, independently from all other pair types( no distributional information)
  • we won't distinguish between relational and positional cooccurrences
association measures ams
Association Measures (AMs)
  • goal: assign association score to each pair type = strength of association between components
  • high score = strong association
  • association in a statistical sense,but there is no precise definition
  • positive vs. negative association("colourless green ideas")
using association scores
Using Association Scores
  • absolute values (cut-off threshold)
  • input forhigher-order statistics(AMs are first-order statistics) scores should be meaningful
  • ranking of collocation candidates only relative scores matter
  • rank collocates of given base one marginal frequency fixed  only two free parameters
first steps proportions
First Steps: Proportions
  • Workshop on Mechanized Documentation (Washington, 1964)
first steps proportions1
First Steps: Proportions
  • proportions between 0 and 1
  • high proportion = strong (directional) association
  • need to combine two proportions into a single association score
  • average (P1 + P2) / 2 is not useful
    • f=1, f1=1, f2=1000 avg.=0.5005
    • f=50, f1=100, f2=100  avg.=0.5

 more "conservative" weighting

first steps proportions2
First Steps: Proportions
  • harmonic mean
  • geometric mean
  • minimum
  • Jaccard
first steps proportions3
First Steps: Proportions
  • coefficients range from 0 to 1
  • 1 = total (positive) association
  • interpretation of lower scoresis less clear
  • positive vs. negative association?
  • which score for no association?
  • what is "no association"?? random combinations
expected frequencies
Expected Frequencies
  • assume that types u and v cooccur only by chance
  • f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens
  • each instance of u has a chance of f2(v)/N to cooccur with a v

 expected # of cooccurrences:

expected frequencies1
Expected Frequencies
  • expected frequencies for all cells of the contingency table
  • assuming random combinations( statistical independence)
expected frequencies2
Expected Frequencies
  • comparison of expected against observed frequencies
  • note that row and column sums are the same for both tables
mutual information
Mutual Information
  • compares O11 with E11
  • ratio O11/E11 ranges from 0 to 
  • 1 = no association (O11=E11)
  • usually logarithmic values
  • range: - to +
  • 0 = no assoc., < 0 neg., > 0 pos.
  • used in English lexicography
low frequency pairs random variation
Low-Frequency Pairs & Random Variation
  • large amount of low-frequency data (consequence of Zipf's law)
  • a simple (invented) example
    • A:f=50, f1=100, f2=100, N=1000 O11=50, E11=10,MI = log 5
    • B:f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000
low frequency pairs random variation1
Low-Frequency Pairs & Random Variation
  • three problems with case B
    • how meaningful is a single example? (not very much, actually)
    • could well be a spelling mistake or noise from automatic processing
    • we want to make generalisations (from particular corpus to "language")

 this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)

the statistical model random sample
The Statistical Model:Random Sample
  • assumption: corpus data is a random sample from the language

 base data is a random sample from all coocs. in the language

the statistical model random sample1
The Statistical Model:Random Sample
  • random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token
  • notation: U and V as "prototypes"
  • for a given pair type (u,v), contingency table can becomputed from Ui and Vi

 random variablesX11, X12, X21, X22

the statistical model random sample2
The Statistical Model:Random Sample
  • population parameters11, 12, 21, 22 for pair type (u,v)
  • observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample
  • theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22
two footnotes
Two Footnotes
  • vector notation for cont. tables
  • population  general language
    • restricted to domain(s), genre(s), ...covered by source corpus
    • e.g. black box in computer science vs. newspapers vs. cooking
the sampling distribution
The Sampling Distribution
  • multinomial sampling distribution
  • each individual cell count Xij has a binomial distribution (but these are not independent)
the sampling distribution1
The Sampling Distribution
  • given assumptions about the population parameters, we can compute the likelihood of the observed contingency table
  • relatively high likelihood= consistent with assumptions
  • relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)
adequacy of the statistical model
Adequacy of the Statistical Model
  • particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency)
  • randomness assumption (random sample from fixed population)
    • independence of pair tokens
    • constancy of population parameters
  • violations problematic only when they affect sampling distribution
adequacy of the statistical model1
Adequacy of the Statistical Model
  • three causes of non-randomness
    • local dependencies (e.g. syntax)  usually not problematic
    • inhomogeneity of source corpus(speakers, domains, topics, ...)  mixture population
    • repetition / clustering of bigrams  can be a serious problem(does not affect segment-based data if clustered within segments)
making assumptions about the population parameters
Making Assumptions about the Population Parameters
  • population parameters (, 1, 2) are unknown
  • best guess from observation: MLE = maximum-likelihood estimate
making assumptions about the population parameters1
Making Assumptions about the Population Parameters
  • conditional probabilities with MLE
  • Dice coefficient etc. are MLE for population characteristics
  • MI is MLE for log( /(1  2))

 unreliable for small frequencies

the null hypothesis
The Null Hypothesis
  • null hypothesis H0: no association= independence of instances, i.e.P(U=u  V=v) = P(U=u)  P(V=v)
  • not all parameters determined
  • MLE maximise probability of observed data under H0
likelihood measures
Likelihood Measures
  • probability of observed data under H0 (with MLE)
  • probability of single cell: X11 should be most "informative"
likelihood measures1
Likelihood Measures
  • small likelihood values = strong association
  • computed probabilities are often extremely small
  • use negative base-10 logarithm more convenient scale  high scores indicate strong association
problems of likelihood measures
Problems of Likelihood Measures
  • three reasons for low likelihood
    • observed data is inconsistent with the null hypothesis because of strong association
    • association may also be negative (fewer coocs. than expected)
    • observed data is consistent, but probability mass is spread across many similar contingency tables
problems of likelihood measures1
Problems of Likelihood Measures
  • high frequency = low likelihood
  • e.g. binomial likelihood
    • O11=1, E11=1 L = 0.3679
    • O11=1000, E11=1000 L = 0.0126
    • O11=4, E11=1 L  0.0126
  • need to "normalise" likelihood
  • NB: likelihood association measures often have good empirical results nonetheless
likelihood ratios
Likelihood Ratios
  • simplest normalisation technique
  • divide maximum probability of data under H0 by unconstrained maximum probability
  • suggested by Dunning (1993)
statistical hypothesis tests
Statistical Hypothesis Tests
  • compute probability of group of outcomes instead of single one
  • observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0
  • total probability is known as the p-value or significance
  • problem: ranking of cont. tables
asymptotic tests
Asymptotic Tests
  • asymptotic tests defined ranking of contingency tables explicitly
  • compute test statistic from data
  • higher values = more evidence against H0
  • can use test statistic as an AM
  • theory: approximation of p-value associated with test statistic(accurate in the limit N  )
asymptotic tests1
Asymptotic Tests
  • standard test for independence is Pearson's chi-squared test
  • limiting distribution = 2 distribution with df=1
  • number of degrees of freedom was subject of a long debate
two sided tests
Two-Sided Tests
  • chi-squared test is two-sided, i.e. no difference between positive and negative association
  • ignore small number of pairs with (non-total) negative association
  • or convert to one-sided test:reject H0 only when O11 > E11
  • p-value is usually divided by 2
yates continuity correction
Yates Continuity Correction
  • Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory")
  • estimating probabilities P(Xij  k) from normal distribution introduces systematic errors
yates continuity correction3
Yates' Continuity Correction
  • generic form of Yates' continuity correction for contingency tables
  • usefulness is still controversial (criticised as too conservative)
  • applicability for chi-squared test is generally accepted
asymptotic tests2
Asymptotic Tests
  • different form of chi-squared test (comparison of two binomials) is equivalent to independence test
  • special eq. with Yates' correction
asymptotic tests3
Asymptotic Tests
  • can also use log-likelihood ratio as a test statistic (two-sided)
  • limiting distribution is found to be 2 distribution with df=1
  • more conservative than Pearson's chi-squared test
  • Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation)
something i d rather not mention
Something I'd Rather Not Mention
  • Church & Hanks: O11 and E11are both random variables
  • H0: expected values are equal
  • assume normal distribution with unknown variance
  • compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data
something i d rather not mention1
Something I'd Rather Not Mention
  • one-sided test
  • statistical model is questionable
  • limiting distribution: t-distribution with df  N
  • even more conservative than log-likelihood (low-frequency data)
exact tests
Exact Tests
  • problem: how to establish ranking of contingency tables
  • solution: reduce set of alternatives
  • if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test
exact tests1
Exact Tests
  • another solution: marginal frequencies do not provide evidence for or against H0( "ancillary" statistics)
  • condition on fixed row and column sums R1, R2, C1, C2
  • conditional hypergeometric distribution does not depend on parameters 1 and 2
exact tests2
Exact Tests
  • X11 is the only free parameter
  • we can use X11 – E11 for ranking
  • Fisher's exact test (Pedersen 1996)
  • computationally expensive
  • numerical difficulties
comparing hypothesis tests
Comparing Hypothesis Tests
  • Fisher's test is now widely accepted as most appropriate
  • tends to be conservative
  • log-likelihood gives good approximation of "correct" p-values(slightly less conservative)
  • chi-squared over-estimates
  • t-score far too conservative
other approaches to measuring association
Other Approaches to Measuring Association
  • information-theoretic (MI, entropy) equivalent to log-likelihood
  • combined measures ("boosting")
  • conservative estimates instead of MLE (confidence intervals)
  • hypothesis tests with different null hypothesis:  = C  1  2
  • mixture of conservative estimates and hypothesis tests?
  • one-sided vs. two-sided tests
  • need special software to obtain p-values for asymptotic tests
  • numerical accuracy
  • beware of zero frequencies!
errr help software
Errr.... Help!? Software?
  • Ted Pedersen's N-gram Statistics Package (NSP)[Perl, portable, easy to use]
  • UCS Toolkit will be available soon from[Perl/Linux, some prerequisites, for the more ambitious :o) ]
more association measures
More Association Measures
  • lots of association measures
  • will be updated
  • references
  • slides from this course
  • under construction
comparing association measures
Comparing Association Measures
  • mathematical discussion
    • very complex
    • results only for special cases
  • numerical simulation
    • computationally expensive
    • Dunning (1993, 1998)
  • lazy man's approach
    • construct mock data set where frequencies vary systematically