1 / 24

ECIR – a Lightweight Approach for Entity-centric Information Retrieval

ECIR – a Lightweight Approach for Entity-centric Information Retrieval. TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann. Who we are. SAP Research. HPI ECIR seminar.

mateo
Download Presentation

ECIR – a Lightweight Approach for Entity-centric Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECIR – a Lightweight Approach for Entity-centric Information Retrieval TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann

  2. Who we are SAP Research HPI ECIR seminar • Benjamin Michael Christoph Alexander Wojciech Falk • Emde Leben Thiele Hold Barczynski Brauer • Prof. Dr. Felix • Naumann • Information Systems chair • at Hasso-Plattner-Institut

  3. Motivation:Entity-centric Search in Enterprises • Companies cannot index the whole web: • have to leverage resources on the web (search engines) • extract/ rank entities at runtime (limited hardware resources) • But there is business value in discovering related entities, e.g., to determine competing companies: • Source: SAP • Target: organization • Narrative: Find competitors developing ERP software. • Our system evaluates the capabilities of three different search engines to answer queries for the Related Entity Finding (REF)- task. TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  4. Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  5. Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  6. Query Rewriting -Key word query generation <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  7. Query Rewriting-Synonym Retrieval <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) get alternative names from Freebase and take the most popular (rank using search engines hit count) TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  8. Query Rewriting -Filtering using POS-tags <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) Find/VB homepages/NNS of/IN manufacturers/NNS of/IN LCD/NNP televisions/NNS sold/VBN by/IN Costco/NNP TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  9. Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  10. Extraction Rule Constructionfor SAP Business Objects ThingFinder • Source entitiy and its alternative names: • Mapping a topic’s target type to predefined ThingFinder-types • Stemmed noun and verb tokens from a topic’s narrative • Combined rule to extract candidates #groupSourceName (scope="Sentence"):(<Costco>|<Costco><travel>|<Costco><Wholesale><Corp.>|..) #groupTargetType (scope="Sentence"): [TE ORGANIZATION]<>+[/TE] #groupContext (scope="Sentence"): (<STEM:sold>|<STEM:LCD>|<STEM:manufacturer>|..) #groupCandidate (scope="Paragraph"): [UL]%(TargetType),%(SourceName),(%(Context))[/UL] TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  11. Example Candidate Rule Firing EXTRACTED PARAGRAPH: • SENTENCE#1: Update, October 2006: Prices have dropped everywhere in the flat panel TV market, even among top-tier manufacturers like Panasonic, Sharp and Sony. • SENTENCE#2: While Costco offers a 50" Vizio HDTV LCD for around $2,000, top-notch sets from Panasonic are selling for as low as $2,400 at authorized internet dealers—that factors out to a difference of only $40 a year if you consider the 10-year lifespan of the Panasonic Plasma TV. • SENTENCE#3: From what we've seen of the build quality in the "bargain-basement" models by Vizio, Maxent, and Envision, you'll be lucky if your discount 50" televisionslasts half that long. Source EntityContext TokenPotential Target Entitiy TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  12. Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  13. Entity Deduplication and Ranking per document • Deduplication combines Jaro-Winkler and Jaccard Similarity. • local ranking (duplicate group per document): • distance in sentences between Source and Target Entity • normalized such that score for distance = 0 is score = 1 • per duplicate group consider only maximum score per document • rank 1: Vizio, Panasonic (sentence distance 0) • rank 2: Sharp, Sony, Maxent, Envision (sentence dinstance 1) …even among top-tier manufacturers like Panasonic, Sharpand Sony. While Costcooffers a 50" VizioHDTV LCD … from Panasonicare selling … consider the 10-year lifespan of the PanasonicPlasma TV. From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …

  14. Entity Deduplication and Ranking across documents • global ranking (for duplicate group in document corpus) • aggregated local scores for duplicate groups • considers rank position of documents • Target Entities extracted from higher ranked documents are preferred document a; search engine rank 5 …even among top-tier manufacturers like Panasonic, Sharpand Sony. While Costcooffers a 50" VizioHDTV LCD … from Panasonicare selling … consider the 10-year lifespan of the PanasonicPlasma TV. From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky … document b; search engine rank 1 … stores Costco Wholesale and Sam's Clubs. In addition, the quality of VizioLCDs looks very similar to the Sonysand Samsungson …

  15. Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  16. Homepage Candidate Retrieval Topic: <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> expandedquery: ~sold ~homepages ~manufactures ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) Target Entities: Panasonic, Sharp, Sony, Vizio, Maxent, Envision queriestotheselectedsearchengine TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  17. Homepage Candidate Retrieval queries to the selected search engine: http://www.google.com/search?q=vizio http://www.google.com/search?q=feature:homepage+vizio http://www.google.com/search?q=allintitle:vizio http://www.google.com/search?q=allinanchor:Find+homepages+of+manufacturers... • homepagecandidatesset: • searchengineresults • Wikipediaoutgoing links • shortened URLs TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  18. Homepage Ranking | source-specific homepagecandidatesset shortened URLs http://www.vizio.com/ http://www.vizio.com/discover/ http://wiki.answers.com/Q http://www.youreviewelectronics.com/vizio-reviews • aggregate scores of candidates from multiple sources • extract vector of 17 features for each candidate: • 5 source-specific flags (f1 to f5) • 12 text-based page features (f6 to f17) searchengineresults http://www.vizio.com/warranties-installation/ http://www.vizio.com/discover/via/ Wikipediaoutgoing Links http://www.youreviewelectronics.com/vizio-reviews/ http://www.vizio.com/ feature:homepageoperator http://www.vizio.com/products f3 f4 relatedtoWikipediapage http://www.vizio.com/products http://wiki.answers.com/Q/Who_manufactures_Vizio_TV allinanchor operator allintitle operator f1 f2 f5 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  19. Homepage Ranking | configurable • feature values are multiplied with configurable weights • finding best weight configuration: • depends on search engine • genetic algorithm for training features: TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  20. Evaluation | TREC2010 nDCG values • B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010

  21. Evaluation | TREC2010 nDCG values • B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction • nDCG aggregated over all topics TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010

  22. Evaluation 2009 | Target Entity Types • all runs averaged score for TREC2009 topics • performance for Organization and Persons better than for Products TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  23. Results from TREC 2010 • avg nDCG around 0.8 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

  24. Results from TREC 2010 | bugfixed • avg nDCG around 1.15 24 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010

More Related