1 / 31

Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features

Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features. Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign. Much of the Information Sought on the Web nowadays is about Entities . How to improve our products’ quality?. We love George!!.

zelda
Download Presentation

Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity-Centric Document Filtering:Boosting Feature Mapping through Meta-Features Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign

  2. Much of the Information Sought on the Web nowadays is about Entities. How to improve our products’ quality? We love George!! The Web A Huge Entity Database OMG! IPad Air is coming out~~ Fans BUSINIESS TREC-KBA Task How to help Wikipedia editors enrich Wikipedia? Editor

  3. Proposal: Entity-Centric Document Filtering System

  4. Entity-Centric Document Filtering System: Automatically Identify Relevant Documents for Entities Relevant Documents entity-centric document filtering system Billions of News, blogs, forums, tweets... Irrelevant Documents Interested Entities

  5. INPUT: Only Entity Name is Usually Insufficient. Michael Jordan

  6. INPUT: Use Identification Page to Characterize the Target Entity. Entity Identification Pages Resolve the ambiguity problem. Provide more information about the entity

  7. OUTPUT: Relevant/Irrelevant Documents for Target Entities. Relevant Irrelevant Bill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ... Steve Jobs’ story is completely different from Bill Gates ... Bill Gates Michael Jordan is a Leading researcher in machine learning and AI. Michael Jordan is considered by many the best basket player in NBA history Michael Jordan (NBA Player)

  8. Problem: Entity-Centric Learning to Filter

  9. Problem: Entity-Centric Learning to Filter Wiki Page Relevant Irrelevant Wiki Page Relevant Irrelevant Training Phase Entity-centric Document Filter Testing Phase Wiki Page ? ? ? ?

  10. How to Predict Document Relevance for an Entity Characterized by an Identification Page? Relevance • Traditional IR models such as BM25, language model do not work. • Designed for Short Queries • Entity Pages contain many Noisy Keywords

  11. Our Idea: Check if the document mentions about the most basic information of the entity. Seattle Microsoft Philanthropist Windows

  12. Challenge: Learning Across Entities.

  13. For an Entity with Labeled Documents, Learning its Important Keywords is Simple. Relevant Document Irrelevant Document Bill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ... Steve Jobs’ story is completely different from Bill Gates ... Relevance of document d for entity e How many times appear in Importance of keyword as feature weighting, as features • High : Microsoft, founder, software, ... • Low: Apple, from, ipad, ...

  14. However, Such Keyword Importance is Not Adaptable to Other Entities. UNC NBA Chicago Bull Seattle Microsoft MVP Philanthropist Windows Keyword Importance Transfer Training Entities (with Labeled Documents) New Entities (without Labeled Documents)

  15. Insight: Meta-feature Based Keyword Mapping

  16. Two Keywords for Two Entities:Similar Properties Similar Importance Similar Importance Keyword: Chicago Bull Keyword: Microsoft Both of them... are mentioned a lot in their Wiki Pages. are organization. appear in the info-box. ....

  17. Meta-Feature -- “Features of Features”:Properties that are related to keyword importance General Meta-Feature IDF, IsNoun, InEntity, ... ID-Page-Related Meta-Feature InInfobox, InOpenPara, ... Wiki Page InSpec, InReview, ... Amazon Page

  18. Solution: Boosting Mapping Model

  19. Clustering-based Keyword Mapping Training Phase here ... Hollywood the Microsoft ... is as ... NKU Harvard this the a CFR Cascade Keyword Weighting: Keyword Weighting: Testing Phase Wiki NBA NBA the there UNC UNC the there ... ... Wiki must Bobcats Bobcats must

  20. Document Relevance based on Keyword Clusters Keyword Importance Keyword Clusters

  21. Traditional Clustering Algorithm Might Fail 1. Irrelevant Meta-Features might Lead to Useless Clusters Occupation Oscar the October ... Hollywood ... actor WA programmer is screenwriter for MS consistently 2. Different Possible Ways of Clustering. Which one is better? ? OR 10

  22. BoostMapping: Boosting Effective Clusters Document Labels Objective of Clustering: Boosting the Prediction Accuracy of Relevance Hollywood here the Microsoft ... ... NKU is Harvard this as Cascade CFR the a Only Useful Clusters are Generated.

  23. BoostMapping:1. Initialization: Uniform Document Importance

  24. BoostMapping:2. Enumerate Conditions to Generate the Most Predictive Cluster. Achieve the Highest Prediction Accuracy Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...

  25. BoostMapping:3. Update the Document Distribution Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...

  26. BoostMapping:4. Generate the Next Cluster Under the Current Document Distribution Cluster IDF <= 1.45 Is_Infobox = False .... Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...

  27. BoostMapping:5. Repeat the Process Until the Predict Accuracy Converge Update the document distribution again Cluster IDF <= 1.45 Is_Infobox = False .... Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...

  28. Experiment

  29. Three Datasets • TREC-KBA • 29 person entities, 52,238 documents • Wikipedia pages as ID pages • Product • 39 product entities, 2,398 documents • Amazon pages as ID pages • MilQuery (From Million Query Track) • 143 general entities, 8,208 documents. • Wikipedia pages ad ID pages. Dinosaur Hostage Rescue Kodak

  30. Performance Comparison with Baselines QBD-TFIDF: Use TFIDF to Select Important Keywords as Queries. QueryByName: Use Entity Names As Queries VectorSim: Measure Relevance Based on Query-Document Similarity LinearMapping: Keyword Mapping based on a Linear Function.

  31. Thanks! Q&A

More Related