1 / 16

Minimally Supervised Learning of Semantic Knowledge from Query Logs

Mamoru Komachi (†) and Hisami Suzuki (‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA. Minimally Supervised Learning of Semantic Knowledge from Query Logs. IJCNLP-08, Hyderabad, India. Task.

locke
Download Presentation

Minimally Supervised Learning of Semantic Knowledge from Query Logs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mamoru Komachi(†) and Hisami Suzuki(‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA Minimally Supervised Learning of Semantic Knowledge from Query Logs IJCNLP-08, Hyderabad, India

  2. Task • Learn semantic categories from web search query logs by bootstrapping with minimal supervision • Semantic category: a set of words which are interrelated • Named entities, technical terms, paraphrases, … • Can be useful forsearch ads, etc… similar similar Darjeeling Kombucha (Japanese tea) Chai (Indian tea) 2 2014/4/1

  3. Our Contribution • First to use the Japanese query logs for the task of learning of named entities • Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm

  4. Table of Contents • Related work • Bootstrapping techniques for relation extraction • Scoring metrics • The Tchai algorithm • Problems of Espresso • Extension to Espresso • Experiment • System performance and comparison to other algorithms • Samples of extracted instances and patterns

  5. Bootstrapping • Iteratively conduct pattern induction and instance extraction starting from seed instances • Can fertilize small set of seed instances Query log (Corpus) Instances Contextual patterns vaio Compare vaio laptop Compare # laptop Toshiba satellite Compare toshiba satellite laptop #:slot HP xb3000 Compare HP xb3000 laptop

  6. Instance lookup and pattern induction • Semantic drift • Computational efficency ANA ANA 予約 # 予約 query log extracted pattern instance Restaurant reservation? Flight reservation? Broad coverage, Noisy patterns Use all strings but instances =Require no segmentation Generic patterns

  7. Instance/Pattern Scoring Metrics • Sekine & Suzuki (2007) • Starts from a large named entity dictionary • Assign low scores to generic patterns and ignore • Basilisk (Thelen and Riloff, 2002) • Balance the recall and precision of generic patterns • Espresso (Pantel and Pennacchiotti, 2006) PMI is normalized by the maximum of all P and I P: patterns in corpus I: instances in corpus PMI: pointwise mutual information r: reliability score Reliability of an instance and a pattern is mutually defined

  8. The Tchai Algorithm • Filter generic patterns/instances • Not to select generic patterns and instances • Replace scaling factor in reliability scores • Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns • This modification shows a large impact on the effectiveness of our algorithm • Only induce patterns at the beginning • Tchai runs 400X faster than Espresso

  9. Experiments • Japanese query logs from 2007/01-02 • Unique one million (166 millions in token) • Target categories • Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list • Travel: the largest category (712 words) • Finance: the smallest category (240 words)

  10. Results High precision (92.1%) Travel Finance Learned 251 novel words Due to the ambiguity of hand labeling (e.g. Tokyo Disney Land) Include common nouns related to Travel (e.g. Rental car)

  11. Sample of Instances (Travel category) Able to learn several sub-categories in which no seed words given

  12. System Performance Travel Finance High precision and recall High precision but low relative recall due to strict filtering Relative Recall (Pantel et al., 2004)

  13. Cumulative precision: Travel Tchai achieved the best precision

  14. Sample Extracted Patterns Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain Tchai found context patterns that are characteristic to the domain

  15. Conclusion and future work • Conclusion • Use of query logs for semantic category learning • Improved Espresso algorithm in both precision and performance • Future work • Generalize bootstrapping method by graph-based matrix calculation

  16. Tchai Thank you for listening!

More Related