Segmentation in NLP: A quick survey of techniques and applications

Segmentation in NLP:A quick survey of techniques and applications Tony Davis (3M Health Information Systems) DC NLP meetup 13 August 2014

What is segmentation? • Dividing a document, broadly defined as a sequential information-bearing structure, into parts that have some internal coherence • Well known examples: book chapters, episodes of TV miniseries (segments are contiguous, strictly ordered, disjoint, and exhaustive) • More examples: sections and subsections of a textbook, topically-based segments defined by an index, plays in a sports broadcast (these depart from the prototype; they’re hierarchical, overlapping, “fuzzy”) • Notice that these examples are more or less “preset”, at the time the document is created, but ad hoc segments can be useful, too • Labeling the segments with good terms is an additional aspect of segmentation Examples: chapter titles, section headers in a clinical report, descriptions of plays in a football game

Why is segmentation useful? • Segmentation guides and improves the user experience • Segmentation affects meaning

What are we trying to do with automatic segmentation? • Create “good”/”useful” segments, label them appropriately, and relate them to one another • What makes a good segment? • Highly dependent on type of content (baseball game, cooking show, feature-length movie, talk show) • Thus, the mix of techniques for automatic segmentation will vary • Relating segments to one another? • Topical similarity: the same people speaking about the same things, similar players or similar kinds of plays, common themes, … • Continuity of narrative (subplots in a TV comedy series, continuation of a project on home repair show)

Topic-based segmentation • Topic segmentation: dividing a document into topically coherent segments • Possibly a partition of the document (exhaustive, non-overlapping segments) • But ad hoc segments may be more useful in some circumstances • Labeling the segments with good terms is often treated separately • Several approaches used: • Discourse features • Some signal a topic shift; others a continuation • Highly domain-specific • Similarity measures between adjacent blocks of text • Use document similarity measures, as in TextTiling (Hearst, 1994) or Choi’s algorithm (Choi, 2000) • Posit boundaries at points where similarity is low • Lexical chains: repeated occurrences of a term (or of closely related terms) • Again, posit boundaries where cohesion is low (few lexical chains cross the boundary (e.g., Galley, et al., 2003)

Topic segmentation – text similarity • Choi’s algorithm (Choi, 2000) • Construct a similarity matrix of all sentence pairs • “Sharpen” the distinctions by replacing similarity values with their ranks in the local region • Coherent regions appear as brighter square areas along the diagonal; segment boundaries lie at their corners Freddy Y.Y. Choi. 2000. Advances in domain independent linear text segmentation. Proceedings of NAACL 2000, 109-117.

Topic segmentation – lexical chains • LCSeg (Galley, et al., 2003) • Based on lexical chains: repeated occurrences of a term in a document • Each chain receives a score based on the number of terms within it, and its length (shorter chains get a higher score) • Cohesion at each potential boundary is measured by comparing scores of chains spanning that boundary and chains on either side near that boundary • Segments are postulated where cohesion is low Michel Galley, Kathleen McKeown, Eric Fosler-Lussier, and Hongyan Jing. 2003. Discourse Segmentation of Multi-Party Conversation. Proceedings of the 41st Annual Meeting of the ACL, 562-569

Topic segmentation – modeling term influence and relatedness • Model both the influence of a term beyond the sentence it occurs in and semantic relatedness among terms • The range of a term’s influence extends beyond the sentence it occurs in, but how far? (relevance intervals) • Semantic relatedness among terms (contextually mediated graphs) • Apply this model to topic-based segmentation • And now, a short diversion about streaming media… GeetuAmbwaniand Anthony R. Davis. 2010. Contextually–Mediated Semantic Similarity Graphs for Topic Segmentation. Proceedings of the Workshop on Graph=based Methods in NLP, 48th Annual Meeting of the ACL.

Automatic segmentation of streaming media • Streaming media presents particular difficulties • Its timed nature makes navigation cumbersome, so a useful system should extract relevant intervals of documents rather than present a list of documents to the user • Speech recognition is error-prone, so the system must compensate for noisy input • To address these issues, use a mix of statistical and symbolic NLP • Create relevant intervals for each useful term, using term co-occurrence information (the “COW model”) • Term co-occurrence measured using pointwise mutual information (PMI) within a corpus PMI(X,Y) = P(X,Y)/P(X)P(Y)

“Original” StreamSage system architecture user interfaces for applications large text corpus co-occurring word (COW) model media archive speech recognition compound interval calculation topic segmentation query expansion SR output WSD virtual document collection relevance interval calculation anaphor resolution

Relevance Intervals (ad hoc segments) • Each RI is a contiguous segment of audio/video deemed relevant to a term • RIs are calculated for all content words (after lemmatization) and multi-word expressions • RI basis: sentence containing the term Each RI is expanded forward and backward to capture relevant material, using the techniques including: • Topic boundary detection by changes in COW values across sentences • Topic boundary detection via discourse markers • Synonym-based query expansion • Anaphor resolution • Nearby RIs for the same term are merged • Each RI is assigned a magnitude, reflecting its likely importance to a user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term

Relevance Intervals: an Example • Index term: squatter Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” • We build an RI for squatter around each of these sentences…

Relevance Intervals: an Example • Index term: squatter Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] • We build an RI for squatter around each of these sentences…

Relevance Intervals: an Example • Index term: squatter Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary] In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals] NPR’s Kenneth Walker has a report. [merge nearby intervals] Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] • Two occurrences of squatter produce a complete merged interval.

Merging Ris for multiple terms Occurrence of Original Term Russia Iran Note that this can only be done at query time, So it needs to be fairly quick and simple.

Merging RIs Occurrence of Original Term Russia Iran Activation Spreading

Merging RIs Occurrence of Original Term Russia Iran Russia and Iran

Evaluating RI Quality • Evaluation of retrieval effectiveness in timed media raises further issues: • Building a gold-standard is painstaking, and potentially quite subjective • It’s necessary to measure how closely the system’s RIs match the gold standard’s • What’s a reasonable baseline? • We created a gold standard of about 2300RIswith about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI on a scale of 1 (highly relevant) to 3 (marginally relevant). • Testing performed on speech recognizer output

Evaluating RI Quality • Measure amounts of extraneous and missed content missed by system ideal RI extraneous In system system RI

Evaluating RI Quality Comparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present

How to spot shifts in topic • Yesterday, I took my dog to the park. • While there, I took him off the leash to get some exercise. • After 2 minutes, Spot began chasing a squirrel. • ______________(Topic Shift)________________ • I realized I needed to go grocery shopping. • So I went off to Trader Joe’s. • Unfortunately, they were out of cashews.

Calculating connection strengths for edges • Construct a graph in which each node represents a term and a sentence, iff the sentence is contained in an RI for that term

Calculating connection strengths for edges • Label each sentence with the terms that have Ris including that sentence; think of these as nodes in a graph

Calculating connection strengths for edges Now we’ll connect these nodes to one another, with edges that have weights

Calculating connection strengths for edges

Connection strength formula and in general, for terms a and b in sentences i and i + 1 respectively:

Build a graph of the terms in the document, sentence by sentence

Filtering edges in the graph

A real example (from CNN) • S_190 We've got to get this addressed and hold down health care costs. • S_191 Senator Ron Wyden the optimist from Oregon, we appreciate your time tonight. • S_192 Thank you. • ______________(Topic Shift)______________ • S_193 Coming up, the final day of free health clinic in Kansas City, Missouri.

Graph representation of documents After pruning low-weight edges, regions near a boundary tend to be sparse

Data and parameter settings • Two data sets • Concatenated New York Times articles • 186 pseudodocuments, each containing 20 articles • Closed-captions for 13 TV shows • News, talk shows, documentaries, lifestyle shows • “Noisier” than news articles • Parameter settings • Edge-pruning threshold: remove all edges with a connection strength below a set threshold (we tried a couple and usually used 0.5) • “Normalized novelty”: on the two sides of a potential boundary, the number of nodes labeled with the same terms, normalized by the total number of terms • Threshold (given normalized novelty): posit a boundary between sentences where normalized novelty falls below a set threshold (we typically used 0.6)

Systems compared * morphadorner.northwestern.edu/morphadorner/-textsegmenter

Evaluation metrics • How well does the hypothesized set of boundaries match the true (reference) set? • Pk(Beeferman, et al. 1997) and WindowDiff (Pevzner & Hearst, 2002) • Both compare hypothesis to reference segmentation within a sliding window • Pkis the proportion of windows in which hypothesis and reference disagree on the number of boundaries • WindowDiff tallies the difference in the number of boundaries in each window • Both commonly used instead of precision and recall, because they take approximate matching into account • They have drawbacks of their own, however Doug Beeferman, Adam Berger, and John Lafferty. 1997. Text Segmentation Using Exponential Models. Proceedings of EMNLP 2 Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28:1

Evaluation metrics • Pkand WindowDiff in action: comparing two segmentations

Evaluation metrics • Pkand WindowDiff: sliding window is half the average reference segment size

Evaluation metrics • Pkand WindowDiff in action: window slides one sentence forward

Evaluation metrics • Pkand WindowDiff in action: still no disagreement between the two segmentations within the window

Evaluation metrics • Pkand WindowDiff in action: one black mark against the hypothesis segmentation, which posits a boundary where the reference doesn’t

Evaluation metrics • Pkand WindowDiff in action: the same mistake counts again against the hypothesis (note that mistakes closer to reference boundaries appear in fewer windows, and are therefore penalized less)

Results on pseudodocuments 185 documents, containing 20 NYT articles each Number of boundaries not specified to systems

Results on pseudodocuments 86 documents out of the 186, for which C99 and SS+C both find at least 20 boundaries; top 20 boundaries selected

TV show closed-captions: inter-annotator agreement on segmentation • Annotators note major and minor boundaries, using 1-5 rating scale • As expected, IAA is low, so we create a reference annotation

TV show closed-captions: inter-annotator agreement on segmentation • Pk values between pairs of annotators: all boundaries and major boundaries

TV show closed-captions: segmentation • Accuracy is low, which is unsurprising given the low IAA

Using segment-level metadata for sports broadcast segmentation • Play-by-play data for sports • Available for several sports in near real-time • Particularly useful for sports that are inherently segmented into discrete plays • Advantages of consistency and quality • Many other types of segment level metadata, depending on type of content • DVD chapters • Advertising breaks in TV shows • Screenplays

Segment-level metadata for sports • Closed captions can be problematic • Sometimes describe clearly what’s going on… >> Kenny: THEY START FROM THEIR OWN 35. AND THIS IS GRANT WHO FUMBLED MOMENTS AGO. HAD A TERRIFIC HALF OF THE SEASON • Here is data for this play from Stats, Inc. (near real-time): <play quarter="1" play-start-time="14:33" play-stop-time="14:27" down="1" distance="10" away-score-before="7" away-score-after="7" home-score-before="0" home-score-after="0" yardline="GB35" end-yardline="GB43" possession="GB" end-possession="GB" details="Ryan Grant rush to the left for 8 yards to the GB43. Tackled by Darryl Tapp."/>

Segment-level metadata for sports • Alignment of this type of metadata with the video is fairly easily achieved, because the game clock is displayed on the video • Stats, Inc. data for play, with added video times: <play quarter="1" play-start-time="14:33" play-stop-time="14:27" video-start-time="0:12:41" video-stop-time="0:12:45” down="1" distance="10" away-score-before="7" away-score-after="7" home-score-before="0" home-score-after="0" yardline="GB35" end-yardline="GB43" possession="GB" end-possession="GB" details="Ryan Grant rush to the left for 8 yards to the GB43. Tackled by Darryl Tapp."/> SEA 7 14:33 Q1 GB 0 DN 1 YD 10

Finding interesting plays in a football game

Segmentation in NLP: A quick survey of techniques and applications