240 likes | 333 Views
Explore entity-centric search for enterprises, leveraging search engines to extract and rank related entities. Evaluate search engines for Related Entity Finding task. Discover competitors in ERP software development. Processing pipeline includes POS tagging, Freebase search engine, query rewriting, and document retrieval. Utilize synonym retrieval and POS-tag filtering for entity search. Extraction rules constructed for SAP Business Objects ThingFinder. Example candidate rule firing extracts relevant information about LCD television manufacturers sold by Costco.
E N D
ECIR – a Lightweight Approach for Entity-centric Information Retrieval TREC 2010 HPI Potsdam, SAP Research Center Dresden Barczynski, Brauer, Emde, Hold, Leben, Thiele, Naumann
Who we are SAP Research HPI ECIR seminar • Benjamin Michael Christoph Alexander Wojciech Falk • Emde Leben Thiele Hold Barczynski Brauer • Prof. Dr. Felix • Naumann • Information Systems chair • at Hasso-Plattner-Institut
Motivation:Entity-centric Search in Enterprises • Companies cannot index the whole web: • have to leverage resources on the web (search engines) • extract/ rank entities at runtime (limited hardware resources) • But there is business value in discovering related entities, e.g., to determine competing companies: • Source: SAP • Target: organization • Narrative: Find competitors developing ERP software. • Our system evaluates the capabilities of three different search engines to answer queries for the Related Entity Finding (REF)- task. TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Query Rewriting -Key word query generation <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Query Rewriting-Synonym Retrieval <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) get alternative names from Freebase and take the most popular (rank using search engines hit count) TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Query Rewriting -Filtering using POS-tags <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> ~sold ~homepages ~manufacturers ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) Find/VB homepages/NNS of/IN manufacturers/NNS of/IN LCD/NNP televisions/NNS sold/VBN by/IN Costco/NNP TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Extraction Rule Constructionfor SAP Business Objects ThingFinder • Source entitiy and its alternative names: • Mapping a topic’s target type to predefined ThingFinder-types • Stemmed noun and verb tokens from a topic’s narrative • Combined rule to extract candidates #groupSourceName (scope="Sentence"):(<Costco>|<Costco><travel>|<Costco><Wholesale><Corp.>|..) #groupTargetType (scope="Sentence"): [TE ORGANIZATION]<>+[/TE] #groupContext (scope="Sentence"): (<STEM:sold>|<STEM:LCD>|<STEM:manufacturer>|..) #groupCandidate (scope="Paragraph"): [UL]%(TargetType),%(SourceName),(%(Context))[/UL] TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Example Candidate Rule Firing EXTRACTED PARAGRAPH: • SENTENCE#1: Update, October 2006: Prices have dropped everywhere in the flat panel TV market, even among top-tier manufacturers like Panasonic, Sharp and Sony. • SENTENCE#2: While Costco offers a 50" Vizio HDTV LCD for around $2,000, top-notch sets from Panasonic are selling for as low as $2,400 at authorized internet dealers—that factors out to a difference of only $40 a year if you consider the 10-year lifespan of the Panasonic Plasma TV. • SENTENCE#3: From what we've seen of the build quality in the "bargain-basement" models by Vizio, Maxent, and Envision, you'll be lucky if your discount 50" televisionslasts half that long. Source EntityContext TokenPotential Target Entitiy TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Entity Deduplication and Ranking per document • Deduplication combines Jaro-Winkler and Jaccard Similarity. • local ranking (duplicate group per document): • distance in sentences between Source and Target Entity • normalized such that score for distance = 0 is score = 1 • per duplicate group consider only maximum score per document • rank 1: Vizio, Panasonic (sentence distance 0) • rank 2: Sharp, Sony, Maxent, Envision (sentence dinstance 1) …even among top-tier manufacturers like Panasonic, Sharpand Sony. While Costcooffers a 50" VizioHDTV LCD … from Panasonicare selling … consider the 10-year lifespan of the PanasonicPlasma TV. From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky …
Entity Deduplication and Ranking across documents • global ranking (for duplicate group in document corpus) • aggregated local scores for duplicate groups • considers rank position of documents • Target Entities extracted from higher ranked documents are preferred document a; search engine rank 5 …even among top-tier manufacturers like Panasonic, Sharpand Sony. While Costcooffers a 50" VizioHDTV LCD … from Panasonicare selling … consider the 10-year lifespan of the PanasonicPlasma TV. From what we've seen … by Vizio, Maxent, and Envision, you'll be lucky … document b; search engine rank 1 … stores Costco Wholesale and Sam's Clubs. In addition, the quality of VizioLCDs looks very similar to the Sonysand Samsungson …
Processing Pipeline POS Tagger Freebase Search Engine Topics Query Rewriting Document Retrieval Rule Generator Target Entitiy Extraction Deduplication & Ranking Results Clueweb Mapping Homepage Retrieval TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Candidate Retrieval Topic: <entity_name>Costco</entity_name> <entity_URL>cw09 − en0006 − 60 − 20817</entity_URL> <target_entity>organization</target_entity> <narrative>Find homepages of manufacturers of LCD televisions sold by Costco.</narrative> expandedquery: ~sold ~homepages ~manufactures ~LCD ~television (“Costco” OR “Costco Wholesale” OR … ) Target Entities: Panasonic, Sharp, Sony, Vizio, Maxent, Envision queriestotheselectedsearchengine TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Candidate Retrieval queries to the selected search engine: http://www.google.com/search?q=vizio http://www.google.com/search?q=feature:homepage+vizio http://www.google.com/search?q=allintitle:vizio http://www.google.com/search?q=allinanchor:Find+homepages+of+manufacturers... • homepagecandidatesset: • searchengineresults • Wikipediaoutgoing links • shortened URLs TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Ranking | source-specific homepagecandidatesset shortened URLs http://www.vizio.com/ http://www.vizio.com/discover/ http://wiki.answers.com/Q http://www.youreviewelectronics.com/vizio-reviews • aggregate scores of candidates from multiple sources • extract vector of 17 features for each candidate: • 5 source-specific flags (f1 to f5) • 12 text-based page features (f6 to f17) searchengineresults http://www.vizio.com/warranties-installation/ http://www.vizio.com/discover/via/ Wikipediaoutgoing Links http://www.youreviewelectronics.com/vizio-reviews/ http://www.vizio.com/ feature:homepageoperator http://www.vizio.com/products f3 f4 relatedtoWikipediapage http://www.vizio.com/products http://wiki.answers.com/Q/Who_manufactures_Vizio_TV allinanchor operator allintitle operator f1 f2 f5 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Homepage Ranking | configurable • feature values are multiplied with configurable weights • finding best weight configuration: • depends on search engine • genetic algorithm for training features: TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Evaluation | TREC2010 nDCG values • B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010
Evaluation | TREC2010 nDCG values • B = Bing, G = Google, Y = Yahoo; 64/16: number of documents retrieved for entity extraction • nDCG aggregated over all topics TREC2010-ECIR HPI/SAP | HPI information systems | SAP research | Nov 09, 2010
Evaluation 2009 | Target Entity Types • all runs averaged score for TREC2009 topics • performance for Organization and Persons better than for Products TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Results from TREC 2010 • avg nDCG around 0.8 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010
Results from TREC 2010 | bugfixed • avg nDCG around 1.15 24 TREC2010-ECIR HPI/SAP | HPI Information Systems | SAP Research | Nov 19, 2010