240 likes | 333 Views
ECIR – a Lightweight Approach for Entity-centric Information Retrieval. TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann. Who we are. SAP Research. HPI ECIR seminar.
E N D
ECIR – a Lightweight Approach for Entity-centric Information Retrieval TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann
Who we are SAP Research HPI ECIR seminar • Benjamin Michael Christoph Alexander Wojciech Falk • Emde Leben Thiele Hold Barczynski Brauer • Prof. Dr. Felix • Naumann • Information Systems chair • at Hasso-Plattner-Institut
Motivation:Entity-centric Search in Enterprises • Companies cannot index the whole web: • have to leverage resources on the web (search engines) • extract/ rank entities at runtime (limited hardware resources) • But there is business value in discovering related entities, e.g., to determine competing companies: • Source: SAP • Target: organization • Narrative: Find competitors developing ERP software. • Our system evaluates the capabilities of three different search engines to answer queries for the Related Entity Finding (REF)- task. TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Query Rewriting -Key word query generation <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Query Rewriting-Synonym Retrieval <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) get alternative names from Freebase and take the most popular (rank using search engines hit count) TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Query Rewriting -Filtering using POS-tags <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) Find/VB homepages/NNS of/IN manufacturers/NNS of/IN LCD/NNP televisions/NNS sold/VBN by/IN Costco/NNP TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Extraction Rule Constructionfor SAP Business Objects ThingFinder • Source entitiy and its alternative names: • Mapping a topic’s target type to predefined ThingFinder-types • Stemmed noun and verb tokens from a topic’s narrative • Combined rule to extract candidates #groupSourceName (scope="Sentence"):(<Costco>|<Costco><travel>|<Costco><Wholesale><Corp.>|..) #groupTargetType (scope="Sentence"): [TE ORGANIZATION]<>+[/TE] #groupContext (scope="Sentence"): (<STEM:sold>|<STEM:LCD>|<STEM:manufacturer>|..) #groupCandidate (scope="Paragraph"): [UL]%(TargetType),%(SourceName),(%(Context))[/UL] TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Example Candidate Rule Firing EXTRACTED PARAGRAPH: • SENTENCE#1: Update, October 2006: Prices have dropped everywhere in the flat panel TV market, even among top-tier manufacturers like Panasonic, Sharp and Sony. • SENTENCE#2: While Costco offers a 50" Vizio HDTV LCD for around $2,000, top-notch sets from Panasonic are selling for as low as $2,400 at authorized internet dealers—that factors out to a difference of only $40 a year if you consider the 10-year lifespan of the Panasonic Plasma TV. • SENTENCE#3: From what we've seen of the build quality in the "bargain-basement" models by Vizio, Maxent, and Envision, you'll be lucky if your discount 50" televisionslasts half that long. Source EntityContext TokenPotential Target Entitiy TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Entity Deduplication and Ranking per document • Deduplication combines Jaro-Winkler and Jaccard Similarity. • local ranking (duplicate group per document): • distance in sentences between Source and Target Entity • normalized such that score for distance = 0 is score = 1 • per duplicate group consider only maximum score per document • rank 1: Vizio, Panasonic (sentence distance 0) • rank 2: Sharp, Sony, Maxent, Envision (sentence dinstance 1) …even among top-tier manufacturers like Panasonic, Sharpand Sony. While Costcooffers a 50" VizioHDTV LCD … from Panasonicare selling … consider the 10-year lifespan of the PanasonicPlasma TV. From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …
Entity Deduplication and Ranking across documents • global ranking (for duplicate group in document corpus) • aggregated local scores for duplicate groups • considers rank position of documents • Target Entities extracted from higher ranked documents are preferred document a; search engine rank 5 …even among top-tier manufacturers like Panasonic, Sharpand Sony. While Costcooffers a 50" VizioHDTV LCD … from Panasonicare selling … consider the 10-year lifespan of the PanasonicPlasma TV. From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky … document b; search engine rank 1 … stores Costco Wholesale and Sam's Clubs. In addition, the quality of VizioLCDs looks very similar to the Sonysand Samsungson …
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Candidate Retrieval Topic: <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> expandedquery: ~sold ~homepages ~manufactures ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) Target Entities: Panasonic, Sharp, Sony, Vizio, Maxent, Envision queriestotheselectedsearchengine TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Candidate Retrieval queries to the selected search engine: http://www.google.com/search?q=vizio http://www.google.com/search?q=feature:homepage+vizio http://www.google.com/search?q=allintitle:vizio http://www.google.com/search?q=allinanchor:Find+homepages+of+manufacturers... • homepagecandidatesset: • searchengineresults • Wikipediaoutgoing links • shortened URLs TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Ranking | source-specific homepagecandidatesset shortened URLs http://www.vizio.com/ http://www.vizio.com/discover/ http://wiki.answers.com/Q http://www.youreviewelectronics.com/vizio-reviews • aggregate scores of candidates from multiple sources • extract vector of 17 features for each candidate: • 5 source-specific flags (f1 to f5) • 12 text-based page features (f6 to f17) searchengineresults http://www.vizio.com/warranties-installation/ http://www.vizio.com/discover/via/ Wikipediaoutgoing Links http://www.youreviewelectronics.com/vizio-reviews/ http://www.vizio.com/ feature:homepageoperator http://www.vizio.com/products f3 f4 relatedtoWikipediapage http://www.vizio.com/products http://wiki.answers.com/Q/Who_manufactures_Vizio_TV allinanchor operator allintitle operator f1 f2 f5 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Ranking | configurable • feature values are multiplied with configurable weights • finding best weight configuration: • depends on search engine • genetic algorithm for training features: TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Evaluation | TREC2010 nDCG values • B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010
Evaluation | TREC2010 nDCG values • B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction • nDCG aggregated over all topics TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010
Evaluation 2009 | Target Entity Types • all runs averaged score for TREC2009 topics • performance for Organization and Persons better than for Products TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Results from TREC 2010 • avg nDCG around 0.8 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Results from TREC 2010 | bugfixed • avg nDCG around 1.15 24 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010