Statistical NLP: Lecture 7

Statistical NLP: Lecture 7 Collocations (Ch 5)

Introduction • Collocations are characterized by limited compositionality. • Large overlap between the concepts of collocationsand terms, technical term and terminological phrase. • Collocations sometimes reflect interesting attitudes (in English) towards different types of substances: strongcigarettes, tea, coffee versus powerfuldrug (e.g., heroin) • 미국 쪽은 관심이 적었으나, 영국 쪽은 많았음(Firth), contextual view • 신발  신다, 양복  입다

Definition (w.r.t Computational andStatistical Literature) • [A collocation is defined as] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]

Other Definitions/Notions (w.r.t.Linguistic Literature) • Collocations are not necessarily adjacent • Typical criteria for collocations: noncompositionality, non-substitutability, nonmodifiability. • Collocations cannot be translated into other languages. • Generalization to weaker cases (strong association of words, but not necessarily fixed occurrence.

Linguistic Subclasses of Collocations • Light verbs: verbs with little semantic content • Verb particle constructions or Phrasal Verbs • Proper Nouns/Names • Terminological Expressions

Overview of the Collocation DetectingTechniques Surveyed • Selection of Collocations by Frequency • Selection of Collocation based on Mean and Varianceof the distance between focal word and collocating word. • Hypothesis Testing • Mutual Information

Frequency (Justeson & Katz, 1995) 1. Selecting the most frequently occurring bigrams : table 5.1 2. Passing the results through a part-of speech filter: patterns likely to be phrases - A N, N N, A A N, N A N, N N N, N P N - Table 5.3 - Table 5.4 : strong vs. powerful  Simple method that works very well.

Mean and Variance (I)(Smadja et al., 1993) • Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible relationships. • {knock, hit, beat, rap} {the door, on his door, at the door, on the metal front door} • The method computes the mean and variance of the offset (signed distance) between the two words in the corpus. • Fig. 5.4 • a man knocked on Donaldson’s door (5) • door before knocked (-2), door that she knocked (-3) • If the offsets are randomly distributed (i.e., no collocation), then the variance/sample deviation will be high.

Mean and Variance (II) • n = number of times two words collocate • μ = sample mean • di = the value of each sample • Sample deviation:

예 • position of “strong” wrt “opposition” • 평균(d): -1.15, 표준편차(s): 0.67 • Fig. 5.2 • 9 이내의 collocation만 고려!!! • “strong” wrt “support” • strong leftist support • “strong” wrt “for” • Table 5.5 • 평균이 1.0보다 크고, 편차가 작으면 : 정보가 큼 • 편차가 크면 : 정보가 작음 • 평균이 0에 가까우면 정규분포를 보이는 것임 • Samaja는 여기에 histogram을 이용하여 제거 • 80%의 정확도로 terminology를 찾아냄 • knock == door  terminology는 아님, • 자연언어생성에는 knock === door도 효과적임 (다수 어절 중에 선택)

Hypothesis Testing: Overview • High frequency and low variance can be accidental. We want to determine whether the cooccurrence is random or whether it occurs more often than chance. • new company • This is a classical problem in Statistics calledHypothesis Testing. • We formulate a null hypothesis H0(no association- only chance) and calculate the probability that a collocation would occur if H0 were true, and then reject H0if p is too low. Otherwise, retain H0 as possible. • 단순한 형태: p(w1w2) = p(w1)p(w2)

Hypothesis Testing: The t-test • The t-test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean m. • The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one to get a sample of that mean and variance assuming that the sample is drawn from a normal distribution with mean μ. • To apply the t-test to collocations, we think of text corpus as a long sequence of N bigrams.

Hypothesis Testing: Formula

예 • 158cm 평균 키 • 200명  평균 169, 편차의 제곱: 2600 • t=(169-158)/root(2600/200)  3.05 • Confidence level 0.005(일반적으로 실험과학에서 받아들이는 정도)이면 2.576 ::: 이보다 크므로 • 99.5%의 신뢰도로 null hypothesis가 아님

문서에서 • new companies • p(new) = 15828/14307688 • p(companies) = 4675/14307668 • H0 : p(new companies) =p(new)*p(companies) => 3.615*10-7 • Bernoulli trial : p= 3.615*10-7 표준편차의 제곱은 p(1-p)이며 이는 p가 아주 작으므로 거의 p이다. • new companies는 8번 나타남 : 8/143076688 = 5.591*10-7 • t= (5.591 – 3.615)*10-7/root(5.591*10-7/14307668)  0.999932 • 표에서 찾으면 0.005면 2.576이며, 이보다 작으므로 null hypothesis를 부정하지 못함 • Table 5.6으로 설명

Hypothesis testing of differences(Church & Hanks, 1989) • We may also want to find words whose cooccurrence patterns best distinguish between two words. This application can be useful for lexicography. • The t-test is extended to the comparison of the means of two normal populations. • Here, the null hypothesis is that the average difference is 0.

Hypothesis testing of difs. (II)

t-test for statistical significance of thedifference between two systems

t-test for differences (continued) • Pooled s2 = (1081.6 + 1186.9) / (10 + 10) = 113.4 • For rejecting the hypothesis that System 1 is better then System 2 with a probability level of α = 0.05, the critical value is t=1.725 (from statistics table) • We cannot conclude the superiority of System 1 because of the large variance in scores

Chi-Square test (I): Method • Use of the t-test has been criticized because it assumes that probabilities are approximately normally distributed (not true, generally). • The Chi-Square test does not make this assumption. • The essence of the test is to compare observed frequencies with frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence.

Chi-Square test (II): Formula

계산 • 계산 예를 보임!!! • 계산에 대한 설명을 함 • new companies는 null hypothesis를 부정 못 함 • 상위 20개의 t-scores와 x2는 같음!!!

Chi-Square test (III): Applications • One of the early uses of the Chi square test in Statistical NLP was the identification of translation pairs in aligned corpora (Church & Gale, 1991). • A more recent application is to use Chi square as a metric for corpus similarity (Kilgariff and Rose,1998) • Nevertheless, the Chi-Square test should not be used in small corpora.

예 • Table 5.10 • 설명하면서 cosine measure 등도 언급

Likelihood Ratios I: Within a singlecorpus (Dunning, 1993) • Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. • In applying the likelihood ratio test to collocation discovery, we examine the following two alternative explanations for the occurrence frequency of a bigram w1 w2: • The occurrence of w2 is independent of the previous occurrence of w1 • The occurrence of w2 is dependent of the previous occurrence of w1

수식 설명 • Binomial distribution을 가정!!!

Log likelihood

실제 응용 • -2logλ x2 • 전체 공간 (p1, p2), 부분공간(a subset of cases) p1=p2

Likelihood Ratios II: Between two or morecorpora (Damerau, 1993) • Ratios of relative frequenciesbetween two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpora. • r(relative frequency ratio)은 간단히 구해짐 • r1 = c1(w)/N1, r2 = c2(w)/N2, r=r1/r2 (표 5.13) • This approach is most useful for the discovery of subject-specific collocations.

Mutual Information (I) • An information-theoretic measure for discovering collocations is pointwise mutual information(Church et al., 89, 91) • Pointwise Mutual Information is roughly a measure of how much one word tells us about the other. • Pointwise mutual information does not work well with sparse data.

Mutual Information (II)

설명 • 왜 좋지 않은지를 179페이지 내용으로 설명 • 비트가 쓰이냐는 문제 • 180페이지의 예, 181페이지의 예로 여기에 적용하면 문제가 되는 이유 설명

Collocation에 대한 정리 • 184페이지 설명 • 185페이지 확장 • 186페이지 설명

Statistical NLP: Lecture 7

Statistical NLP: Lecture 7

Presentation Transcript

Dependent Events and Statistical Fluctuations

Simulation Techniques: Outline

4-1 Statistical Inference

Biostatistics

Lecture VII

Statistical Evaluation of Data

LING / C SC 439/539 Statistical Natural Language Processing

Statistical Inference I: Hypothesis testing; sample size

LING / C SC 439/539 Statistical Natural Language Processing

Statistical Quality Control

Sequence Alignment and Dynamic Programming

Statistical met h od s for boson s

SPSS

Statistical met h od s for boson s

LECTURE 5: LAWS IN THE SOCIAL SCIENCES?

Chapter 19 Statistical thermodynamics: the concepts

Statistical inference for astrophysics

Statistical Process Control

Machine Translation- 2

Part 2 Statistical Mechanics