modeling term relevancies in information retrieval using graph laplacian kernels n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Modeling term relevancies in information retrieval using Graph Laplacian Kernels PowerPoint Presentation
Download Presentation
Modeling term relevancies in information retrieval using Graph Laplacian Kernels

Loading in 2 Seconds...

play fullscreen
1 / 34

Modeling term relevancies in information retrieval using Graph Laplacian Kernels - PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on

Modeling term relevancies in information retrieval using Graph Laplacian Kernels. Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht. A Problem in Document Retrieval. There is a ‘gap’ between search queries and documents. . Query: car. A Problem in Document Retrieval.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Modeling term relevancies in information retrieval using Graph Laplacian Kernels' - stormy


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
modeling term relevancies in information retrieval using graph laplacian kernels

Modeling term relevancies in information retrievalusing Graph Laplacian Kernels

Shuguang Wang

Joint work with SaeedAmizadehand MilosHauskrecht

a problem in document retrieval
A Problem in Document Retrieval
  • There is a ‘gap’ between search queries and documents.

Query: car

a problem in document retrieval1
A Problem in Document Retrieval
  • There is a ‘gap’ between search queries and documents.

Query: car

Google.com

Bing.com

Yahoo.com

a problem in document retrieval2
A Problem in Document Retrieval
  • There is a ‘gap’ between search queries and documents.

Query: car

Google.com

Bing.com

Yahoo.com

a problem in document retrieval3
A Problem in Document Retrieval
  • There is a ‘gap’ between search queries and documents.

Query: car

Google.com

Bing.com

Yahoo.com

Good enough?

a problem in document retrieval4
A Problem in Document Retrieval
  • What about the documents about automobiles, BMW, Benz, …?
  • There are various expressions for a same entities.
  • One solution is to expand the original user queries with some ‘relevant’ terms.
traditional query expansion methods
Traditional Query Expansion Methods
  • Human and/or computer generated thesauri
    • Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts.
  • Human Relevance feedback
    • Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009).
    • User click information (Yin et al., ECIR 2009)
  • Automatic query expansion
    • Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996).
      • Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents.
traditional query expansion methods1
Traditional Query Expansion Methods
  • Human and/or computer generated thesauri
    • Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts.
  • Human Relevance feedback
    • Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009).
    • User click information (Yin et al., ECIR 2009)
  • Automatic query expansion
    • Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996).
      • Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents.
    • Analyze query flow graph in (Bordino et al., SIGIR 2010)

Expensive, and time consuming

traditional query expansion methods2
Traditional Query Expansion Methods
  • Human and/or computer generated thesauri
    • Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts.
  • Human Relevance feedback
    • Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009).
    • User click information (Yin et al., ECIR 2009)
  • Automatic query expansion
    • Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996).
      • Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents.
    • Analyze query flow graph in (Bordino et al., SIGIR 2010)

Expensive, and time consuming

Human Input

traditional query expansion methods3
Traditional Query Expansion Methods
  • Human and/or computer generated thesauri
    • Zhou et al., SIGIR 2007 proposed to expand query with MeSH concepts.
  • Human Relevance feedback
    • Implicit feedback from human such as tracking eye movement (Buscher et al., SIGIR 2009).
    • User click information (Yin et al., ECIR 2009)
  • Automatic query expansion
    • Pseudo Relevance Feedback first proposed in (Xu and Croft, SIGIR 1996).
      • Use top ‘n’ documents from the initial search as the implicit feedback and select ‘relevant’ terms from these ‘n’ documents.
    • Analyze query flow graph in (Bordino et al., SIGIR 2010)

Expensive, and time consuming

Human Input

a different view
A Different View
  • What we really need here is a way to estimate term-term relevance.
  • Problem of finding expansion terms for user queries  Problem of finding ‘relevant’ terms given a similarity metric.
  • How to derive a term-term similarity metric?
term term similarity
Term-Term Similarity
  • Hypothesis: the metric ‘d’ should be smooth, i.e., d(t1) ~ d(t2) if ‘t1’ and ‘t2’are similar/relevant.
  • Why not graph Laplacian kernels?!
    • We can easily have smoothness property.
    • We can also define distance metrics with it.
define affinity graph
Define Affinity Graph
  • Nodes are terms
  • Edges are co-occurrences
  • Weights of the edges are the number of documents terms co-occur
graph laplacian kernels
Graph Laplacian Kernels
  • General Form Definition:
  • Resistance:
  • Diffusion:
  • P-step Random Walk:

Recall:

graph laplacian kernels1
Graph Laplacian Kernels
  • General Form Definition:
  • Resistance:
  • Diffusion:
  • P-step Random Walk:

How to choose hyper parameters?

Recall:

graph laplacian kernels2
Graph Laplacian Kernels
  • General Form Definition:
  • Resistance:
  • Diffusion:
  • P-step Random Walk:

How to choose hyper parameters?

Recall:

How to choose g(λ)?

non parametric kernel
Non-parametric kernel
  • Learn the transformation g(λ) directly from training data.
    • If we know some terms are similar, we want to maximize their similarities.
    • At the same time, we want to have a smoother metric.
an optimization problem
An Optimization Problem

: the set of eigenvalues of original Laplacian graph

tin‘ and tjn’ are pair of similar terms in the training document n’

an optimization problem1
An Optimization Problem

Maximize for known similar terms tin and tjn

: the set of eigenvalues of original Laplacian graph

tin‘ and tjn’ are pair of similar terms in the training document n’

an optimization problem2
An Optimization Problem

Penalize more for large eigenvalues

: the set of eigenvalues of original Laplacian graph

tin‘ and tjn’ are pair of similar terms in the training document n’

kernel to distances
Kernel to Distances
  • Given the kernel K, we can define distances between any pair of nodes, d(i,j), in the graph.

µ2

µ1

Recall:

µn

We define:

ti

tj

kernel to distances1
Kernel to Distances
  • Given the kernel K, we can define distances between any pair of nodes, d(i,j), in the graph.

µ2

µ1

Recall:

µn

We define:

Euclidean Distance

ti

tj

d(i,j) = Kii+Kjj-2Kij

kernel to distances2
Kernel to Distances
  • Given the kernel K, we can define distances between any pair of nodes, d(i,j), in the graph.

µ2

µ1

Recall:

µn

The distance metric derived from graph Laplacian Kernel is the Euclidean distances in the kernel space

We define:

Euclidean Distance

ti

tj

d(i,j) = Kii+Kjj-2Kij

using term term similarity in ir
Using term-term similarity in IR
  • Deal with similarity between sets and terms.
    • In query expansion tasks, set of query terms is ‘S’ and a candidate expansion term is ‘t’.
  • Transform the pair-wise distances, ‘d’, to set-to-term similarity.
    • Naïve methods:
      • dmax=max(d(S,t))
      • davg=avg(d(S,t))
      • dmin=min(d(S,t))
set to term similarity
Set-to-term Similarity
  • Query collapsing
query collapsing
Query Collapsing
  • We have to compute eigen-decompostion again for each query.
    • It is too expensive for the online task.
  • Approximation is possible.
    • We want to approximate the projection of `new’ point ‘S’ in the kernel space.
    • We need to add one element in the original eigenvector.

µ2

µ2

µ1

µ1

µn

µn

A

A

E

E

S

nystr m approximation
Nystrőm Approximation
  • For all nodes in the graph Laplacian, we have
  • If the new point s’ was in the graph, it would satisfy the above as well.
nystr m approximation1
Nystrőm Approximation
  • For all nodes in the graph Laplacian, we have
  • If the new point s’ was in the graph, it would satisfy the above as well.
nystr m approximation2
Nystrőm Approximation
  • For all nodes in the graph Laplacian, we have
  • If the new point s’ was in the graph, it would satisfy the above as well.
evaluation
Evaluation
  • Two tasks:
    • Term prediction (scientific publication)
      • Give the terms in the abstracts, predict the possible terms in the full body
      • Compare with TFIDF, PRF, PLSI
    • Query expansion
      • Compare with Lemur/Indri + PRF and Terrier + PRF
  • Kernels:
    • Diffusion (optimized by line search)
    • Resistance
    • Non-parametric (optimized by line search)
  • Set-to-term:
    • Average
    • Query collapse
term prediction
Term prediction
  • 6000 articles about 10 cancers downloaded from PubMed.
    • 80% as training and 20% as testing
  • Given the terms in abstracts, rank all the candidate terms using the distance metrics.
    • The smaller the distances between candidate terms and query terms, the higher rank these terms are.
  • Use AUC to evaluate (Joachims, ICML 2005)
query expansion
Query Expansion
  • Four TREC datasets: Genomic 03 & 04, Adhoc TREC 7 & 8.
  • We built graph using different set of terms in these datasets:
    • genes/proteins on Genomic 03 data
    • 5000 terms with highest TFIDF scores on Genomic 04 data.
    • 25% subsamples from all (~100k) unique terms from TREC7 & 8.
  • Use Mean Average Precision (MAP) to evaluate the performance.
  • Only Resistance Kernel