Indexing and Searching Timed Media: the Role of Mutual Information Models

Indexing and Searching Timed Media: the Role of Mutual Information Models Tony Davis (StreamSage/Comcast) IPAM, UCLA 1 Oct. 2007

A bit about what we do… • StreamSage (now a part of Comcast) focuses on indexing and retrieval of “timed” media (video and audio, aka “multimedia” or “broadcast media”) • A variety of applications, but now centered on cable TV • This is joint work with many members of the research team: Shachi Dave Abby Elbow David Houghton I-Ju Lai Hemali Majithia Phil Rennert Kevin Reschke Robert Rubinoff Pinaki Sinha Goldee Udani

Overview • Theme: use of term association models to address challenges of timed media • Problems addressed: • Retrieving all and only the relevant portions of timed media for a query • Lexical semantics (WSD, term expansion, compositionality of multi-word units) • Ontology enhancement • Topic detection, clustering, and similarity among documents

Automatic indexing and retrieval of streaming media • Streaming media presents particular difficulties • Its timed nature makes navigation cumbersome, so the system must extract relevant intervals of documents rather than present a list of documents to the user • Speech recognition is error-prone, so the system must compensate for noisy input • We use a mix of statistical and symbolic NLP • Various modules factor into calculating relevant intervals for each term: • Word sense disambiguation • Query expansion • Anaphor resolution • Name recognition • Topic segmentation and identification

Statistical and NLP Foundations: the COW and YAKS Models

Two types of models • COW (Co-Occurring Words) • Based on proximity of terms, using a sliding window • YAKS (Yet Another Knowledge Source) • Based on grammatical relationships between terms

The COW Model • Large mutual information model of word co-occurrence • MI(X,Y) = P(X,Y)/P(X)P(Y) • Thus, COW values greater than 1 indicate correlation (tendency to co-occur); less than 1, anticorrelation • Values are adjusted for centrality (salience) • Two main COW models: • New York Times, based on 325 million words (about 6 years) of text • Wikipedia, more recent, roughly the same amount of text • also have specialized COW models (for medicine, business, and others), as well as models for other languages

The COW (Co-Occurring Words) Model • The COW model is at the core of what we do • Relevance interval construction • Document segmentation and topic identification • Word sense disambiguation (large-scale and unsupervised, based on clustering COWs of an ambiguous word) • Ontology construction • Determining semantic relatedness of terms • Determining specificity of a term

The COW (Co-Occurring Words) Model • An example: top 10 COWs of railroad//N post//J 165.554 shipper//N 123.568 freight//N 121.375 locomotive//N 119.602 rail//N 73.7922 railway//N 64.6594 commuter//N 63.4978 pickups//N 48.3637 train//N 44.863 burlington//N 41.4952

Multi-Word Units (MWUs) • Why do we care about MWUs? Because they act like single words in many cases, but also: • MWUs are often powerful disambiguators of words within them (see, e.g., Yarowsky (1995), Pederson (2002) for wsd methods that exploit this): • ‘fuel tank’, ‘fish tank’, ‘tank tread’ • ‘indoor pool’, ‘labor pool’, ‘pool table’ • Useful in query expansion • ‘Dept. of Agriculture’  ‘Agriculture Dept.’ • ‘hookworm in dogs’  ‘canine hookworm’ • Provide many terms that can be added to ontologies • ‘commuter railroad’, ‘peanut butter’

Multi-Word Units (MWUs) in our models • MWUs in our system • We extract nominal MWUs, using a simple procedure based on POS-tagging: • Example: • ({N,J}) ({N,J}) N • N Prep (‘the’) ({N,J}) N • where Prep is ‘in’, ‘on’, ‘of’, ‘to’, ‘by’, ‘with’, ‘without’, ‘for’, or ‘against’ • For the most common 100,000 or so MWUs in our corpus, we calculate COW values, as we do for words

COWs of MWUs • An example: top ten COWs of ‘commuter railroad’ post//J 1234.47 pickups//N 315.005 rail//N 200.839 sanitation//N 186.99 weekday//N 135.609 transit//N 134.329 commuter//N 119.435 subway//N 86.6837 transportation//N 86.487 railway//N 86.2851

COWs of MWUs • Another example: top ten COWs of ‘underground railroad’ abolitionist//N 845.075 slave//N 401.732 gourd//N 266.538 runaway//J 226.163 douglass//N 170.459 slavery//N 157.654 harriet//N 131.803 quilt//N 109.241 quaker//N 94.6592 historic//N 86.0395

The YAKS model • Motivations • COW values reflect simple co-occurrence or association, but no particular relationship beyond that • For some purposes, it’s useful to measure the association between two terms in a particular syntactic relationship • Construction • Parse a lot of text (the same 325 million words of New York Times used to build our NYT COW model); however, long sentences (>25 words) were discarded, as parsing them was slow and error-prone • The parser’s output provides information about grammatical relations between words in a clause; to measure the association of a verb (say ‘drink’) and a noun as its object (say ‘beer’), we consider the set of all verb-object pairs, and calculate mutual information over that set • Calculate MI for broader semantic classes of terms, e.g.: food, substance. Semantic classes were taken from Cambridge International Dictionary of English (CIDE); there are about 2000 of them, arranged in a shallow hierarchy

YAKS Examples • Some objects of ‘eat’ OBJ head=eat arg1=hamburger 139.249 arg1=pretzel 90.359 arg1=:Food 18.156 arg1=:Substance 7.89 arg1=:Sound 0.324 arg1=:Place 0.448

Relevance Intervals

Relevance Intervals (RIs) • Each RI is a contiguous segment of audio/video deemed relevant to a term • RIs are calculated for all content words (after lemmatization) and multi-word expressions • RI basis: sentence containing the term • Each RI is expanded forward and backward to capture relevant material, using the techniques including: • Topic boundary detection by changes in COW values across sentences • Topic boundary detection via discourse markers • Synonym-based query expansion • Anaphor resolution • Nearby RIs for the same term are merged • Each RI is assigned a magnitude, reflecting its likely importance to a user searching on that term, based on the number of occurrences of the term in the RI, and the COW values of other words in the RI with the term

Relevance Intervals: an Example • Index term: squatter • Among the sentences containing this term are these two, near each other: Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” • We build an RI for squatter around each of these sentences…

Relevance Intervals: an Example • Index term: squatter • Among the sentences containing this term are these two, near each other: Paul Bew is professor of Irish politics at Queens University in Belfast. In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. NPR’s Kenneth Walker has a report. Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] • We build an RI for squatter around each of these sentences…

Relevance Intervals: an Example • Index term: squatter • Among the sentences containing this term are these two, near each other: Paul Bew is professor of Irish politics at Queens University in Belfast. [topic segment boundary] In South Africa the government is struggling to contain a growing demand for land from its black citizens. [cow-expand] Authorities have vowed to crack down and arrest squatters illegally occupying land near Johannesburg. In a most serious incident today more than 10,000 black South Africans have seized government and privately-owned property. [cow-expand] Hundreds were arrested earlier this week and the government hopes to move the rest out in the next two days. [merge nearby intervals] NPR’s Kenneth Walker has a report. [merge nearby intervals] Thousands of squatters in a suburb outside Johannesburg cheer loudly as their leaders deliver angry speeches against whites and landlessness in South Africa. “Must give us a place…” [topic segment boundary] • Two occurrence of squatter produce a complete merged interval.

Relevance Intervals and Virtual Documents • The set of all the RIs for a term in a document constitutes the virtual document for that term • In effect, the VD for a term is intended to approximate a document that would have been produced had the authors focused solely on that term • A VD is assigned a magnitude equal to the highest magnitude of the RIs in it, with a bonus if more than one RI has a similarly high magnitude

Merging Ris for multiple terms Occurrence of Original Term Russia Iran Note that this can only be done at query time, So it needs to be fairly quick and simple.

Merging RIs Occurrence of Original Term Russia Iran Activation Spreading

Merging RIs Occurrence of Original Term Russia Iran Russia and Iran

Evaluating RIs and VDs • Evaluation of retrieval effectiveness in timed media raises further issues: • Building a gold-standard is painstaking, and potentially more subjective • It’s necessary to measure how closely the system’s Ris match the gold standard’s • What’s a reasonable baseline? • We created a gold standard of about 2300 VDs with about 200 queries on about 50 documents (NPR, CNN, ABC, and business webcasts), and rated each RI in a VD on a scale of 1 (highly relevant) to 3 (marginally relevant). • Testing of the system was performed on speech recognizer output

Evaluating RIs and VDs • Measure amounts of extraneous and missed content ideal RI missed system RI extraneous

Evaluating RIs and VDs • Comparison of percentages of median extraneous and missed content over all queries between system using COWs and system using only sentences with query terms present

MWUs and compositionality

MWUs, idioms, and compositionality • Several partially independent factors are in play here (Calzolari, et al. 2002): 1.reduced syntactic and semantic transparency; 2.reduced or lack of compositionality; 3.more or less frozen or fixed status; 4.possible violation of some otherwise general syntactic patterns or rules; 5.a high degree of lexicalization (depending on pragmatic factors); 6.a high degree of conventionality.

MWUs, idioms, and compositionality • In addition, there are two kinds of “mixed” cases • Ambiguous MWUs, with one meaning compositional and the other not: ‘end of the tunnel’, ‘underground railroad’ • “Normal” use of some component words, but not others: ‘flea market’ (a kind of market) ‘peanut butter’ (a spread made from peanuts)

Automatic detection of non-compositionality • Previous work • Lin (1999): “based on the hypothesis that when a phrase is non-compositional, its mutual information differs significantly from the mutual informations [sic] of phrases obtained by substituting one of the words in the phrase with a similar word.” • For instance, the distribution of ‘peanut’ and ‘butter’ should differ from that of ‘peanut’ and ‘margarine’ • Results are not very good yet, because semantically-related words often have quite different distributions, and many compositional collocations are “institutionalized”, so that substituting words within them will change distributional statistics.

Automatic detection of non-compositionality • Previous work • Baldwin, et al. (2002): “use latent semantic analysis to determine the similarity between a multiword expression and its constituent words”; “higher similarities indicate greater decomposability” • “Our expectation is that for constituent word-MWE pairs with higher LSA similarities, there is a greater likelihood of the MWE being a hyponym of the constituent word.” (for head words of MWEs) • “correlate[s] moderately with WordNet-based hyponymy values.”

Automatic detection of non-compositionality • We use the COW model for a related approach to the problem • COWs (and COW values) of an MWU and its component words will be more alike if the MWU is compositional. • We use a measure of occurrences of a component word near an MWU as another criterion of compositionality • The more often words in the MWU appear near it, but not as a part of it, the more likely it is that the MWU is compositional.

COW pair sum measure • Get the top n COWs of an MWU, and of one of its component words. • For each pair of COWs (one from each of these lists), find their COW value. railroad commuter railroad post//J post//J shipper//N pickups//N freight//N rail//N

COW pair sum measure • Get the top n COWs of an MWU, and of one of its component words. • For each pair of COWs (one from each of these lists), find their COW value. • Then sum up these values. This provides a measure of how similar the contexts in which the MWU and its component word appear are.

Feature overlap measure • Get the top n COWs (and values) of an MWU, and of one of its component words. • For each COW with a value greater than some threshold, treat that COW as a feature of the term. • Then compute the overlap coefficient (Jaccard coefficient); for two sets of features A and B: | A B | | A B |

Occurrence-based measure • For each occurrence of an MWU, determine if a given component word occurs in a window around that occurrence, but not as part of that MWU. • Calculate the proportion of occurrences for which this the case, compared to all occurrences of the MWU.

Testing the measures • We extracted all MWUs tagged as idiomatic in the Cambridge International Dictionary of English (about 1000 expressions). • There are about 112 of these that conform to our MWU patterns and occur with sufficient frequency in our corpus that we have calculated COWs for them. fashion victim flea market flip side

Testing the measures • We then searched the 100,000 MWUs for which we have COW values, choosing compositional MWUs containing the same terms. • In some cases, this is difficult or impossible, as no appropriate MWUs are present. About 144 MWUs are on the compositional list. fashion victim fashion designer crime victim flea market [flea collar] market share flip side [coin flip] side of the building

Results: basic statistics • The idiomatic and compositional sets are quite different in aggregate, though there is a large variance:

Results: discriminating the two sets • How well does each measure discriminate between idioms and non-idioms? COW pair sum

Results: discriminating the two sets • How well does each measure discriminate between idioms and non-idioms? feature overlap

Results: discriminating the two sets • How well does each measure discriminate between idioms and non-idioms? occurrence-based measure

Results: discriminating the two sets • Can we do better combining the measures? • We used the decision-tree software C5.0 to check Rule: if COW pair sum <= -216.739 or COW pair sum <= 199.215 and occ. measure < 27.74% then idiomatic; otherwise non-idiomatic

Results: discriminating the two sets • Some cases are “split”—classified as idiomatic with respect to one component word but not the other word: bear hug is idiomatic w.r.t. bear but not hug flea market is idiomatic w.r.t. flea but not market • Other methods to improve performance on this task • MWUs often come in semantic “clusters”: ‘almond tree’, ‘peach tree’, ‘blackberry bush’, ‘pepper plant’, etc. • Corresponding components in these MWUs can be localized in a small area of WordNet (Barrett, Davis, and Dorr (2001)) or UMLS (Rosario, Hearst, and Fillmore (2002)). • “Outliers” that don’t fit the pattern are potentially idiomatic or non-compositional (’plane tree’ but ’rubber tree’, which is compositional).

Clustering and topic detection

Clustering by similarities among segments • Content of a segment is represented by its topically salient terms. • The COW model is used to calculate a similarity measure for each pair of segments. • Clustering on the resulting matrix of similarities (using the CLUTO package) yields topically distinct clusters of results.

Clustering Results • Example: crane • 10 segments form 2 well-defined clusters, one relating to birds, the other to machinery (and cleanup of the WTC debris in particular)

Clustering Results • Example: crane

Cluster Labeling • Topic labels for clusters improve usability. • Candidate cluster labels can be obtained from: • Topic terms of segments in a cluster • Multi-word units containing the query term(s) • Outside sources (taxonomies, Wikipedia, …)

Indexing and Searching Timed Media: the Role of Mutual Information Models

Indexing and Searching Timed Media: the Role of Mutual Information Models

Presentation Transcript

WHAT IS A MUTUAL FUND?

Mutual of Omaha presents Medicare Basics

How Consumers Use Media

The Curriculum: models

The Indexing or Dividing Head

Models for Program Planning in Health Promotion

Role-Based Access Control

What we have covered

IR - Indexing

Searching for Constructions with GrETEL

IR - Indexing

CMPT 454

Improving the effectiveness of Web searching: Methodological issues

Monograph 19 The Role of the Media in Promoting and Reducing Tobacco Use

Asynchronous Communication Mechanisms Using Self-timed Circuits

Mutual Funds

Techniques for Indexing and Browsing Image/Video Databases

Searching for Exotics at Hadron Colliders

CS 245: Database System Principles Notes 4: Indexing

Toward Unified Models of Information Extraction and Data Mining