minimally supervised learning of semantic knowledge from query logs
Download
Skip this Video
Download Presentation
Minimally Supervised Learning of Semantic Knowledge from Query Logs

Loading in 2 Seconds...

play fullscreen
1 / 16

Minimally Supervised Learning of Semantic Knowledge from Query Logs - PowerPoint PPT Presentation


  • 149 Views
  • Uploaded on

Mamoru Komachi (†) and Hisami Suzuki (‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA. Minimally Supervised Learning of Semantic Knowledge from Query Logs. IJCNLP-08, Hyderabad, India. Task.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Minimally Supervised Learning of Semantic Knowledge from Query Logs' - locke


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
minimally supervised learning of semantic knowledge from query logs
Mamoru Komachi(†) and Hisami Suzuki(‡)

(†) Nara Institute of Science and Technology, Japan

(‡) Microsoft Research, USA

Minimally Supervised Learning of Semantic Knowledge from Query Logs

IJCNLP-08, Hyderabad, India

slide2
Task
  • Learn semantic categories from web search query logs by bootstrapping with minimal supervision
  • Semantic category: a set of words which are interrelated
    • Named entities, technical terms, paraphrases, …
  • Can be useful forsearch ads, etc…

similar

similar

Darjeeling

Kombucha

(Japanese tea)

Chai

(Indian tea)

2

2014/4/1

our contribution
Our Contribution
  • First to use the Japanese query logs for the task of learning of named entities
  • Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm
table of contents
Table of Contents
  • Related work
    • Bootstrapping techniques for relation extraction
    • Scoring metrics
  • The Tchai algorithm
    • Problems of Espresso
    • Extension to Espresso
  • Experiment
    • System performance and comparison to other algorithms
    • Samples of extracted instances and patterns
bootstrapping
Bootstrapping
  • Iteratively conduct pattern induction and instance extraction starting from seed instances
  • Can fertilize small set of seed instances

Query log

(Corpus)

Instances

Contextual

patterns

vaio

Compare vaio laptop

Compare # laptop

Toshiba satellite

Compare toshiba satellite laptop

#:slot

HP xb3000

Compare HP xb3000 laptop

instance lookup and pattern induction
Instance lookup and pattern induction
  • Semantic drift
  • Computational efficency

ANA

ANA 予約

# 予約

query log

extracted pattern

instance

Restaurant reservation?

Flight reservation?

Broad coverage,

Noisy patterns

Use all strings but instances

=Require no segmentation

Generic patterns

instance pattern scoring metrics
Instance/Pattern Scoring Metrics
  • Sekine & Suzuki (2007)
    • Starts from a large named entity dictionary
    • Assign low scores to generic patterns and ignore
  • Basilisk (Thelen and Riloff, 2002)
    • Balance the recall and precision of generic patterns
  • Espresso (Pantel and Pennacchiotti, 2006)

PMI is normalized by

the maximum of all P and I

P: patterns in corpus

I: instances in corpus

PMI: pointwise mutual information

r: reliability score

Reliability of an instance and

a pattern is mutually defined

the tchai algorithm
The Tchai Algorithm
  • Filter generic patterns/instances
    • Not to select generic patterns and instances
  • Replace scaling factor in reliability scores
    • Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns
    • This modification shows a large impact on the effectiveness of our algorithm
  • Only induce patterns at the beginning
    • Tchai runs 400X faster than Espresso
experiments
Experiments
  • Japanese query logs from 2007/01-02
    • Unique one million (166 millions in token)
  • Target categories
    • Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list
    • Travel: the largest category (712 words)
    • Finance: the smallest category (240 words)
results
Results

High precision (92.1%)

Travel

Finance

Learned 251 novel words

Due to the ambiguity of hand labeling

(e.g. Tokyo Disney Land)

Include common nouns related to Travel

(e.g. Rental car)

sample of instances travel category
Sample of Instances (Travel category)

Able to learn several sub-categories in which no seed words given

system performance
System Performance

Travel

Finance

High precision and recall

High precision but low relative recall due to strict filtering

Relative Recall (Pantel et al., 2004)

cumulative precision travel
Cumulative precision: Travel

Tchai achieved the best precision

sample extracted patterns
Sample Extracted Patterns

Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain

Tchai found context patterns that are characteristic to the domain

conclusion and future work
Conclusion and future work
  • Conclusion
    • Use of query logs for semantic category learning
    • Improved Espresso algorithm in both precision and performance
  • Future work
    • Generalize bootstrapping method by graph-based matrix calculation
ad