minimally supervised learning of semantic knowledge from query logs
Skip this Video
Download Presentation
Minimally Supervised Learning of Semantic Knowledge from Query Logs

Loading in 2 Seconds...

play fullscreen
1 / 16

Minimally Supervised Learning of Semantic Knowledge from Query Logs - PowerPoint PPT Presentation

  • Uploaded on

Mamoru Komachi (†) and Hisami Suzuki (‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA. Minimally Supervised Learning of Semantic Knowledge from Query Logs. IJCNLP-08, Hyderabad, India. Task.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Minimally Supervised Learning of Semantic Knowledge from Query Logs' - locke

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
minimally supervised learning of semantic knowledge from query logs
Mamoru Komachi(†) and Hisami Suzuki(‡)

(†) Nara Institute of Science and Technology, Japan

(‡) Microsoft Research, USA

Minimally Supervised Learning of Semantic Knowledge from Query Logs

IJCNLP-08, Hyderabad, India

  • Learn semantic categories from web search query logs by bootstrapping with minimal supervision
  • Semantic category: a set of words which are interrelated
    • Named entities, technical terms, paraphrases, …
  • Can be useful forsearch ads, etc…





(Japanese tea)


(Indian tea)



our contribution
Our Contribution
  • First to use the Japanese query logs for the task of learning of named entities
  • Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm
table of contents
Table of Contents
  • Related work
    • Bootstrapping techniques for relation extraction
    • Scoring metrics
  • The Tchai algorithm
    • Problems of Espresso
    • Extension to Espresso
  • Experiment
    • System performance and comparison to other algorithms
    • Samples of extracted instances and patterns
  • Iteratively conduct pattern induction and instance extraction starting from seed instances
  • Can fertilize small set of seed instances

Query log






Compare vaio laptop

Compare # laptop

Toshiba satellite

Compare toshiba satellite laptop


HP xb3000

Compare HP xb3000 laptop

instance lookup and pattern induction
Instance lookup and pattern induction
  • Semantic drift
  • Computational efficency


ANA 予約

# 予約

query log

extracted pattern


Restaurant reservation?

Flight reservation?

Broad coverage,

Noisy patterns

Use all strings but instances

=Require no segmentation

Generic patterns

instance pattern scoring metrics
Instance/Pattern Scoring Metrics
  • Sekine & Suzuki (2007)
    • Starts from a large named entity dictionary
    • Assign low scores to generic patterns and ignore
  • Basilisk (Thelen and Riloff, 2002)
    • Balance the recall and precision of generic patterns
  • Espresso (Pantel and Pennacchiotti, 2006)

PMI is normalized by

the maximum of all P and I

P: patterns in corpus

I: instances in corpus

PMI: pointwise mutual information

r: reliability score

Reliability of an instance and

a pattern is mutually defined

the tchai algorithm
The Tchai Algorithm
  • Filter generic patterns/instances
    • Not to select generic patterns and instances
  • Replace scaling factor in reliability scores
    • Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns
    • This modification shows a large impact on the effectiveness of our algorithm
  • Only induce patterns at the beginning
    • Tchai runs 400X faster than Espresso
  • Japanese query logs from 2007/01-02
    • Unique one million (166 millions in token)
  • Target categories
    • Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list
    • Travel: the largest category (712 words)
    • Finance: the smallest category (240 words)

High precision (92.1%)



Learned 251 novel words

Due to the ambiguity of hand labeling

(e.g. Tokyo Disney Land)

Include common nouns related to Travel

(e.g. Rental car)

sample of instances travel category
Sample of Instances (Travel category)

Able to learn several sub-categories in which no seed words given

system performance
System Performance



High precision and recall

High precision but low relative recall due to strict filtering

Relative Recall (Pantel et al., 2004)

cumulative precision travel
Cumulative precision: Travel

Tchai achieved the best precision

sample extracted patterns
Sample Extracted Patterns

Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain

Tchai found context patterns that are characteristic to the domain

conclusion and future work
Conclusion and future work
  • Conclusion
    • Use of query logs for semantic category learning
    • Improved Espresso algorithm in both precision and performance
  • Future work
    • Generalize bootstrapping method by graph-based matrix calculation