1 / 32

UMass Amherst at TDT 2003

UMass Amherst at TDT 2003. James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan

deiter
Download Presentation

UMass Amherst at TDT 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UMass Amherst at TDT 2003 James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema Raghavan Center for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts Amherst

  2. What we did • Tasks • Story Link Detection • Topic Tracking • New Event Detection • Cluster Detection

  3. Outline • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models

  4. ROI motivation • Analyzed vector space similarity measures • Failed to distinguish between similar topics • e.g. two “health care” stories from different topics • different locations and individuals • similarity dominated by “health care” terms • drugs, cost, coverage, plan, prescription • Possible solution: first categorize stories • different category  different topics (mostly true) • use within-category statistics • “health care” may be less confusing • Rules of Interpretation provide natural categories

  5. ROI intuition Sn simnew(s1,s2)<simold(s1,s2) Sn simnew(s1,s2)=simold(s1,s2) ROI tagged corpus • Each document in the corpus is classified into one of the ROI categories • Stories in different ROIs are less likely to be in same topic. • If two stories belong to different ROIs, we should trust their similarities less

  6. ROI classifiers • Naïve Bayes • BoosTexter [Schapire and Singer, 2000 ] • Decision tree classifier • Generates and combines simple rules • Features are terms with tf as weights • Used most likely single class • Explored distribution of all classes • Unable to do so successfully

  7. Training Data for Classification • Experiments: train on TDT-2,test on TDT-3 • Submissions: train on TDT-2 plus TDT-3 • Training data prepared the same way • Stories in each topic tagged with topic’s ROI • Remove duplicate stories (in topics with the same ROI) • Remove all stories with more than one ROI • Worst case: a single story relevant to… Chinese Labor Activists with ROI Legal/Criminal Cases Blair Visits China in October with ROI Political/Diplomatic Mtgs. China will not allow Opposition Parties with ROI Miscellaneous • Experiments with removing named entities for training

  8. Naïve Bayes vs. BoosTexter • Similar classification accuracy • Overall accuracy is the same • Errors are substantially different • Our training results (TDT-3) • BoosTexter beat Naïve Bayes for SLD and NED • BoosTexter used in most tasks for submission • Evaluation results: • In Link Detection, using Naïve Bayes more useful

  9. ROI classes in link detection • Given story pair and their estimated ROIs • If estimated ROIs are same, leave score alone • If they are different, reduce score • Reduced to 1/3 of original value based on training runs • Used four different ROI classifiers • ROI-BT,ne: BoosTexter with named entities • ROI-BT, no-ne: BoosTexter without named entities • ROI-NB, ne: Naïve Bayes with name entities • ROI-NB, no-ne: Naïve Bayes without name entities

  10. Training effectiveness (TDT-3) • Story Link Detection • Minimum normalized cost

  11. Evaluation results • Story link detection

  12. ROI for tracking • Compare story to centroid of topic • Built from training stories • If ROI does not match, drop score based on how bad mismatch is • Used ROI-BT,ne classifier only

  13. Training for tracking • Topic tracking on TDT-3 • Minimum normalized cost • ROI BoosTexter with named entities only

  14. Evaluation results • Topic tracking on TDT-3 • Minimum normalized cost • ROI BoosTexter with named entities only

  15. ROI-based vocabulary pruning • New Event Detection only • Create “stop list” for each ROI • 300 most frequent terms in stories within ROI • Obtained from TDT-2 corpus • When story is classified into an ROI… • Remove those terms from the story’s vector • ROI determined from BoosTexter classifier

  16. New Event Detection approach • Cosine Similarity measure • ROI-based vocabulary pruning • Score normalization • Incremental IDF • Remove short documents • Preprocessing • Train BoosTexter on TDT-2 &TDT-3 • Include named entities while training

  17. NED Results TDT 3 TDT 4

  18. ROI Conclusions • Both uses of ROI helped in training • Score reduction for ROI mismatch • Tracking and link detection • Vocabulary pruning for new event detection • Score reduction failed in evaluation • Name entities important in ROI classifier • TDT-4 has different set of entities (time gap) • Possible overfitting to TDT-3? • Preliminary work applying to detection • Unsuccessful to date

  19. Outline • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models

  20. Comparing multilingual stories • Baseline • All stories converted to English • Using provided machine translations • New approaches • Dictionary translation of Arabic stories • Native language comparisons • Adaptation in tracking

  21. Dictionary Translation of Arabic • Probabilistic translation model • Each Arabic word has multiple English translations • Obtain P(e|a) from UN Arabic-English parallel corpus • Forms a pseudo-story in English representing Arabic Story • Can get large due to multiple translations per word • Keep English words whose summed probabilities are the greatest

  22. Language specific comparisons • Language representations: • Arabic CP1256 encoding and light stemming • English stopped and stemmed with kstem • Chinese segmented if necessary and overlapping bigrams • Linking Task: • If stories in same language, use that language • All other comparisons done using all stories translated into English

  23. Adaptation in tracking • Adaptation • Stories added to topic when high similarity score • Establish topic representation in each language as soon as added story in that language appears • Similarity of Arabic story compared to Arabic topic representation, etc.

  24. Cross-Lingual Link Detection Results Translation Conditions: • 1DcosIDF: baseline, all stories in English using provided translations. • UDcosIDF: all stories in English but using dictionary translation of Arabic. • 4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic

  25. Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman) Translation Conditions: • 1DcosIDF: baseline. • UDcosIDF: dictionary translation of Arabic. • 4DcosIDF: comparing a pair of stories in native language. • ADcosIDF: baseline plus adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.

  26. Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr) • Translation Conditions: • 1DcosIDF: baseline. • UDcosIDF: dictionary translation of Arabic. • 4DcosIDF: comparing a pair of stories in native language. • ADcosIDF: baseline plus adaptation.

  27. Outline • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models

  28. Relevance Models for SLD • Relevance Model (RM): “model of stories relevant to a query” • Algorithm: • Given stories A,B • compute “queries” QA and QB • estimate relevance models P(w|QA) and P(w|QB) • compute divergence between relevance models

  29. Results: Story Link Detection

  30. Relevance Models for Tracking • Initialize: • set P(M|Q) = 1/Nt if M is a training doc • compute relevance model as before • For each incoming story D: • score = divergence between P(w|D) and RM • if (score > threshold) add D to the training, recompute RM • allow no more than k adaptations

  31. Results: Topic Tracking

  32. Conclusions • Rule of Interpretation (ROI) classification • ROI-based vocabulary reduction • Cross-language techniques • Dictionary translation of Arabic stories • Native language comparisons • Adaptive tracking • Relevance models

More Related