lm approaches to filtering
Download
Skip this Video
Download Presentation
LM Approaches to Filtering

Loading in 2 Seconds...

play fullscreen
1 / 11

LM Approaches to Filtering - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

LM Approaches to Filtering. Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS. Topics. LM approach What is it? Why is it preferred? Controlling Filtering decision. What is LM Approach?. We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'LM Approaches to Filtering' - evelia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
lm approaches to filtering

LM Approaches to Filtering

Richard Schwartz, BBN

LM/IR ARDA 2002

September 11-12, 2002

UMASS

topics
Topics
  • LM approach
    • What is it?
    • Why is it preferred?
  • Controlling Filtering decision
what is lm approach
What is LM Approach?
  • We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.
  • The tf-idf metric computes various statistics of words and documents.
  • By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes

P(Doc is Relevant | Query, Document, Collection, etc.)

  • If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant)
  • The LM approach is a solution to the second part of this.
  • The prior probability component is also important.
what it is not
What it is not
  • If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model.
  • The LMs would not be expected to be the same even with long queries.
issues in lm approaches for filtering
Issues in LM Approaches for Filtering
  • We (ideally) have three sets of documents:
    • Positive documents
    • Negative documents
    • Large corpus of unknown (mostly negative) documents
  • We can estimate a model for both positive and negative documents
    • We can find more positive documents in large corpus
    • We use large corpus to smooth models from positive and negative documents
  • We compute the probability of each of each new document given each of the models
  • The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.
language modeling choices
Language Modeling Choices
  • We can model the probability of the document given the topic in many ways.
  • A simple unigram mixture works surprisingly well.
    • Weighted mixture of distributions from the topic training and the full corpus
  • We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique
  • We can extend the model in many ways:
    • Ngram model of words
    • Phrases: proper names, collocations
  • Because we use a formal generative model, we know how to incorporate any effect we want.
    • E.g., probability of features of top-5 documents given some document is relevant
how to set the threshold
How to Set the Threshold
  • For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents.
  • Problems:
    • The score for a particular document depends on many factors that are not important for the decision
      • Length of document
      • Percentage of low-likelihood words
    • The range of scores depends on the particular topic.
  • Would like to map the score for any document and topic into a real posterior probability
score normalization techniques
Score Normalization Techniques
  • By using the relative score for two models, we remove some of the variance due to the particular document.
  • We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents.
  • Advantages of using Off-Topic documents:
    • We have a very large number of documents
    • We can fix the probability of false alarms
the bottom line
The Bottom Line
  • For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono-language, cross-language, speech recognition output, etc.
  • Large improvement will come after multiple sites start using similar techniques.
grand challenges
Grand Challenges
  • Tested in TDT
    • Operating with small amounts of training data for each category
      • 1 to 4 documents per event
    • Robustness to changes over time
      • adaptation
    • Multi-lingual domains
    • How to set threshold for filtering
    • Using model of ‘eventness’
  • Large hierarchical category sets
    • How to use the structure
  • Effective use of prior knowledge
  • Predicting performance and characterizing classes
  • Need a task where both the discriminative and the LM approach will be tested.
what do you really want
What do you really want?
  • If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about:
    • Airplane crashes
    • Terrorism
    • Building fires
    • Injuries and Death
    • Some combination of the above
  • In general, we need a way to clarify which combination of topics the user wants
  • In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).
ad