- 77 Views
- Uploaded on
- Presentation posted in: General

LM Approaches to Filtering

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

LM Approaches to Filtering

Richard Schwartz, BBN

LM/IR ARDA 2002

September 11-12, 2002

UMASS

- LM approach
- What is it?
- Why is it preferred?

- Controlling Filtering decision

- We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.
- The tf-idf metric computes various statistics of words and documents.
- By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes
P(Doc is Relevant | Query, Document, Collection, etc.)

- If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant)
- The LM approach is a solution to the second part of this.
- The prior probability component is also important.

- If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model.
- The LMs would not be expected to be the same even with long queries.

- We (ideally) have three sets of documents:
- Positive documents
- Negative documents
- Large corpus of unknown (mostly negative) documents

- We can estimate a model for both positive and negative documents
- We can find more positive documents in large corpus
- We use large corpus to smooth models from positive and negative documents

- We compute the probability of each of each new document given each of the models
- The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.

- We can model the probability of the document given the topic in many ways.
- A simple unigram mixture works surprisingly well.
- Weighted mixture of distributions from the topic training and the full corpus

- We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique
- We can extend the model in many ways:
- Ngram model of words
- Phrases: proper names, collocations

- Because we use a formal generative model, we know how to incorporate any effect we want.
- E.g., probability of features of top-5 documents given some document is relevant

- For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents.
- Problems:
- The score for a particular document depends on many factors that are not important for the decision
- Length of document
- Percentage of low-likelihood words

- The range of scores depends on the particular topic.

- The score for a particular document depends on many factors that are not important for the decision
- Would like to map the score for any document and topic into a real posterior probability

- By using the relative score for two models, we remove some of the variance due to the particular document.
- We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents.
- Advantages of using Off-Topic documents:
- We have a very large number of documents
- We can fix the probability of false alarms

- For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono-language, cross-language, speech recognition output, etc.
- Large improvement will come after multiple sites start using similar techniques.

- Tested in TDT
- Operating with small amounts of training data for each category
- 1 to 4 documents per event

- Robustness to changes over time
- adaptation

- Multi-lingual domains
- How to set threshold for filtering
- Using model of ‘eventness’

- Operating with small amounts of training data for each category
- Large hierarchical category sets
- How to use the structure

- Effective use of prior knowledge
- Predicting performance and characterizing classes
- Need a task where both the discriminative and the LM approach will be tested.

- If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about:
- Airplane crashes
- Terrorism
- Building fires
- Injuries and Death
- Some combination of the above

- In general, we need a way to clarify which combination of topics the user wants
- In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).