1 / 34

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching. Arnab Nandi  Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH. PRESENTED BY VAIBHAV MEHTA. Scenario. Scenario. Search over structured data Commerce entertainment

harva
Download Presentation

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH PRESENTED BY VAIBHAV MEHTA

  2. Scenario Arnab Nandi & Phil Bernstein

  3. Scenario • Search over structured data • Commerce • entertainment • Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse. Arnab Nandi & Phil Bernstein

  4. Scenario “Amazon.com” 3rd Party Feed 3rd Party Feed 3rd Party Feed 3rd Party Feed query Users Search engine + data warehouse • High Precision • (Irrespective of Recall) • Minimal Human Involvement results Arnab Nandi & Phil Bernstein

  5. Example Feed 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> • <Description>Ever…</Description> • <RunTime>127</RunTime> • <Categories> • <Category>Action</Category> • <Category>Comedy</Category> • </Categories> • <MPAA>PG-13</MPAA> • <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> • <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> • </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein

  6. Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> •  <Description>Ever…</Description> •  <RunTime>127</RunTime> • <Categories> •  <Category>Action</Category> •  <Category>Comedy</Category> •  </Categories> •  <MPAA>PG-13</MPAA> •  <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> •  <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> •  </Movie> <MOVIE>  <MOVIE_ID>57590</MOVIE_ID>  <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>  <RUNTIME>02:00</RUNTIME>  <GENRE1>Action/Adventure</GENRE1>  <GENRE2/>  <RATING>NR</RATING>  <ADVISORY/>  <URL>http://www.indianajones.com/</URL>  <ACTOR1>Harrison Ford</ACTOR1>  <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein

  7. Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) • -<Movie> • <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> • <Release Key="Yes">2008</Release> •  <Description>Ever…</Description> •  <RunTime>127</RunTime> • <Categories> •  <Category>Action</Category> •  <Category>Comedy</Category> •  </Categories> •  <MPAA>PG-13</MPAA> •  <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> • -<Persons> •  <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> • -</Persons> •  </Movie> <MOVIE>  <MOVIE_ID>57590</MOVIE_ID>  <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>  <RUNTIME>02:00</RUNTIME>  <GENRE1>Action/Adventure</GENRE1>  <GENRE2/>  <RATING>NR</RATING>  <ADVISORY/>  <URL>http://www.indianajones.com/</URL>  <ACTOR1>Harrison Ford</ACTOR1>  <ACTOR2>Karen Allen</ACTOR2> </MOVIE> Arnab Nandi & Phil Bernstein

  8. Various Problems Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein

  9. Unlike conventional matching… • We have web search click data • For both Warehouse & 3rd party website • The databases we are integrating (usually) have a presence on the web • Why not use click data as a feature for schema & taxonomy matching? 3rd Party Feed query Users Search engine + data warehouse results Arnab Nandi & Phil Bernstein

  10. Outline • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results Arnab Nandi & Phil Bernstein

  11. Core idea • “If two (sets of) products are searched for by similar queries, then they are similar” Web Search Small laptop Arnab Nandi & Phil Bernstein

  12. Core idea Asus.com Warehouse Clicklog hardware Small Laptops eee Pro. Laptops X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein

  13. Query Distributions click count Arnab Nandi & Phil Bernstein

  14. Mapping to Taxonomy • Map URL to product, which belongs to taxonomy • http://www.amazon.com/dp/B001JTA59C • Shopping | Electronics |Netbooks 3rd party DB (provided to us) Arnab Nandi & Phil Bernstein

  15. Aggregating Query Distributions Asus.com Warehouse hardware Small Laptops eee Pro. Laptops eee ::: small laptops Arnab Nandi & Phil Bernstein

  16. Generating Correspondences • Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. • Process • For each page (URL) • Identify query distribution • Identify category / schema element of that page • For each category / schema element C • Aggregate over pages in C to get query distribution • For each foreign category / schema element • Find host category / schema element with most similar query distribution Arnab Nandi & Phil Bernstein

  17. Outline • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results Arnab Nandi & Phil Bernstein

  18. Example: Taxonomy Matching Warehouse: Professional Laptops Warehouse: Small Laptops eee Arnab Nandi & Phil Bernstein

  19. Example: Taxonomy Matching Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 Arnab Nandi & Phil Bernstein

  20. Distribution Similarity Metric Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein

  21. Example: Taxonomy Matching “small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop 1 x (5/25) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 Arnab Nandi & Phil Bernstein

  22. Advantages of Clicklogs • Resilient to language • Resilient to new domains, data, and features • As long as people query & click, we have data to learn from • Generates mappings previous methods can’t • Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators • Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic  ≈ Software ▷ Developer Tools Arnab Nandi & Phil Bernstein

  23. System Design Arnab Nandi & Phil Bernstein

  24. Outline • Scenario • Using Clicklogs • Core idea • Using Query Distributions • Example • System Architecture • Results Arnab Nandi & Phil Bernstein

  25. Experimenting with Click Logs • Commercial warehouse mapping, 258 products • from a 70,000 term Amazon.com taxonomy (613 in gold) • to a 6,000 term warehouse taxonomy (40 in gold) • Live.com (now Bing.com) search querylog • Amazon to warehouse mapping task, consecutively halving the clicklog size used • 1.8 million clicks to Amazon.com product pages • Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22). Arnab Nandi & Phil Bernstein

  26. Summary of Results • 90% precision / recall possible • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein

  27. Precision / Recall • Commercial warehouse mapping, 258 products • from a 70K term Amazon.com taxonomy • to a 6,000 term warehouse taxonomy (613 categories used) Arnab Nandi & Phil Bernstein

  28. Summary of Results 90% precision / recall possible • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein

  29. Varying Clicklog Size • Successively decreased clicklog size by half • Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein

  30. Summary of Results 90% precision / recall possible Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein

  31. Comparing Query Distributions Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) • Replace Jaccard with various phrase similarity metrics • Minimal difference due to size of most queries Σ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein

  32. Summary of Results 90% precision / recall possible • Query distribution is a good similarity metric • Bigger clicklogs imply better recall • Technique isn't very sensitive to similarity metric Arnab Nandi & Phil Bernstein

  33. Conclusion • Unsupervised mapping is possible • very high recall / precision when enough queries are present • Click logs are promising • Finds results that other methods cannot find • As clicklog size increases, it will produce more mappings • Combinable with existing methods Arnab Nandi & Phil Bernstein

  34. Questions? Arnab Nandi & Phil Bernstein

More Related