Learning Within-Sentence Semantic Coherence

Learning Within-Sentence Semantic Coherence Elena Eneva Rose Hoberman Lucian Lita Carnegie Mellon University

Semantic (in)Coherence • Trigram: content words unrelated • Effect on speech recognition: • Actual Utterance: “THE BIRDFLU HAS AFFECTED CHICKENS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMANS SICK” • Top Hypothesis: “THE BIRDFLU HAS AFFECTED SECONDS FOR YEARS BUT ONLY RECENTLY BEGAN MAKING HUMAN SAID” • Our goal: model semantic coherence

A Whole Sentence Exponential Model [Rosenfeld 1997] • P0(s) is an arbitrary initial model (typically N-gram) • fi(s)’s are arbitrary computable properties of s (aka features) • Z is a universal normalizing constant def

A Methodology for Feature Induction Given corpus T of training sentences: • Train best-possible baseline model, P0(s) • Use P0(s) to generate corpus T0 of “pseudo sentences” • Pose a challenge: find (computable) differences that allow discrimination between T and T0 • Encode the differences as features fi(s) • Train a new model:

Discrimination Task: Are these content words generated from a trigram or a natural sentence? • - - - feel - - sacrifice - - sense - - - - - - - - -meant - - - - - - - - trust - - - - truth • - - kind - free trade agreements - - - living - - ziplock bag - - - - - - university japan's daiwa bank stocks step –

Building on Prior Work • Define “content words” (all but top 50) • Goal: model distribution of content words in sentence • Simplify: model pairwise co-occurrences (“content word pairs”) • Collect contingency tables; calculate measure of association for them

Q Correlation Measure Derived from Co-occurrence Contingency Table • Q values range from –1 to +1

Density Estimates • We hypothesized: • Trigram sentences: wordpair correlation completely determined by distance • Natural sentences: wordpair correlation independent of distance • kernel density estimation • distribution of Q values in each corpus • at varying distances

Distance = 1 Distance = 3 Q Distributions • ---- Trigram Generated • Broadcast News Density Q Value

Likelihood Ratio Feature she is a country singer searching for fame and fortune in nashville Q(country,nashville) = 0.76 Distance = 8 Pr (Q=0.76|d=8,BNews) = 0.32 Pr(Q=0.76|d=8,Trigram) = 0.11 Likelihood ratio = 0.32/0.11 = 2.9

Simpler Features • Q Value based • Mean, median, min, max of Q values for content word pairs in the sentence (Cai et al 2000) • Percentage of Q values above a threshold • High/low correlations across large/small distances • Other • Word and phrase repetition • Percentage of stop words • Longest sequence of consecutive stop/content words

Datasets • LM and contingency tables (Q values) derived from 103 million words of BN • From remainder of BN corpus and sentences sampled from trigram LM: • Q value distributions estimated from ~100,000 sentences • Decision tree trained and test on ~60,000 sentences • Disregarded sentences with < 7 words • “Mike Stevens says it’s not real” • “We’ve been hearing about it”

Experiments • Learners: • C5.0 decision tree • Boosting decision stumps with Adaboost.MH • Methodology: • 5-fold cross validation on ~60,000 sentences • Boosting for 300 rounds

Results

Shannon-Style Experiment • 50 sentences • ½ “real” and ½ trigram-generated • Stopwords replaced by dashes • 30 participants • Average accuracy of 73.77% ± 6 • Best individual accuracy 84% • Our classifier: • Accuracy of 78.9% ± 0.42

Summary • Introduced a set of statistical features which capture aspects of semantic coherence • Trained a decision tree to classify with accuracy of 80% • Next step: incorporate features into exponential LM

Future Work • Combat data sparsity • Confidence intervals • Different correlation statistic • Stemming or clustering vocabulary • Evaluate derived features • Incorporate into an exponential language model • Evaluate the model on a practical application

Agreement among Participants

Expected Perplexity Reduction • Semantic coherence feature • 78% of broadcast news sentences • 18% of trigram-generated sentences • Kullback-Leibler divergence: .814 • Average perplexity reduction per word = .0419 (2^.814/21) per sentence? • Features modify probability of entire sentence • Effect of feature on per-word probability is small

---- Trigram Generated • Broadcast News Distribution of Likelihood Ratio Density Likelihood Value

Discrimination Task • Natural Sentence: • but it doesn't feel like a sacrifice in a sense that you're really saying this is you know i'm meant to do things the right way and you trust it and tell the truth • Trigram-Generated: • they just kind of free trade agreements which have been living in a ziplock bag that you say that i see university japan's daiwa bank stocks step though

Q Values at Distance 1 • ---- Trigram Generated • Broadcast News Density Q Value

---- Trigram Generated • Broadcast News Q Values at Distance 3 Density Q Value

Outline • The problem of semantic (in)coherence • Incorporating this into the whole-sentence exponential LM • Finding better features for this model using machine learning • Semantic coherence features • Experiments and results

Learning Within-Sentence Semantic Coherence