Why spectral retrieval works
Download
1 / 23

Why Spectral Retrieval Works - PowerPoint PPT Presentation


  • 235 Views
  • Uploaded on

Why Spectral Retrieval Works. SIGIR 2005 in Salvador, Brazil, August 15 – 19. Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar. What we mean by spectral retrieval. Ranked retrieval in the term space. . 1.00. 1.00. 0.00.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Why Spectral Retrieval Works' - RexAlvis


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Why spectral retrieval works l.jpg

Why Spectral Retrieval Works

SIGIR 2005 in Salvador, Brazil, August 15 – 19

Holger Bast

Max-Planck-Institut für Informatik (MPII)

Saarbrücken, Germany

joint work with Debapriyo Majumdar


What we mean by spectral retrieval l.jpg
What we mean by spectral retrieval

  • Ranked retrieval in the term space

1.00

1.00

0.00

0.50

0.00

"true" similarities to query

qTd2

———|q||d2|

qTd1

———|q||d1|

cosine similarities

0.82

0.00

0.00

0.38

0.00


What we mean by spectral retrieval3 l.jpg
What we mean by spectral retrieval

  • Ranked retrieval in the term space

1.00

1.00

0.00

0.50

0.00

"true" similarities to query

cosine similarities

0.82

0.00

0.00

0.38

0.00

  • Spectral retrieval = linear projection to an eigensubspace

L q

projection matrix L

cosine similarities in the subspace

(Lq)T(Ld1)——————|Lq| |Ld1|

0.98

0.98

-0.25

0.73

0.01


Why and when does this work l.jpg
Why and when does this work?

  • Previous work: if the term-document matrix is a slight perturbation of a rank-k matrix then projection to ak-dimensional subspace works

    • Papadimitriou, Tamaki, Raghavan, Vempala PODS'98

    • Ding SIGIR'99

    • Ando and Lee SIGIR'01

    • Azar, Fiat, Karlin, McSherry, Saia STOC'01

  • Our explanation: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns

    • no single subspace is appropriate for all term pairs

    • we fix that problem


Spectral retrieval alternative view l.jpg
Spectral retrieval — alternative view

  • Ranked retrieval in the term space

  • Spectral retrieval = linear projection to an eigensubspace

L q

projection matrix L

(Lq)T(Ld1)——————|Lq||Ld1|

cosine similarities in the subspace

=

qT(LTLd1)——————|Lq||LTLd1|


Spectral retrieval alternative view6 l.jpg
Spectral retrieval — alternative view

  • Ranked retrieval in the term space

expansion matrix LTL

  • Spectral retrieval = linear projection to an eigensubspace

L q

projection matrix L

cosine similarities in the subspace

qT(LTLd1)——————|Lq||LTLd1|


Spectral retrieval alternative view7 l.jpg
Spectral retrieval — alternative view

  • Ranked retrieval in the term space

expansion matrix LTL

qT(LTLd1)——————|q||LTLd1|

similarities after document expansion

  • Spectral retrieval = linear projection to an eigensubspace

L q

projection matrix L

qT(LTLd1)——————|Lq||LTLd1|

cosine similarities in the subspace

Spectral retrieval = document expansion (not query expansion)


Why document expansion l.jpg
Why document "expansion"

internet

surfing

beach

web

=

·

0-1 expansion matrix


Why document expansion9 l.jpg
Why document "expansion"

add "internet" if "web" is present

internet

surfing

beach

web

=

·

0-1 expansion matrix


Why document expansion10 l.jpg
Why document "expansion"

  • Ideal expansion matrix has

    • high scores for intuitively related terms

    • low scores for intuitively unrelated terms

add "internet" if "web" is present

internet

surfing

beach

web

=

·

matrix L projectingto 2 dimensions

expansion matrix LTL

expansion matrixdepends heavily on the subspace dimension!


Why document expansion11 l.jpg
Why document "expansion"

  • Ideal expansion matrix has

    • high scores for intuitively related terms

    • low scores for intuitively unrelated terms

add "internet" if "web" is present

internet

surfing

beach

web

=

·

matrix L projectingto 3 dimensions

expansion matrix LTL

expansion matrixdepends heavily on the subspace dimension!


Our key observation l.jpg

logic /

logics

node /

vertex

logic /

vertex

0

200

400

600

0

200

400

600

0

200

400

600

subspace dimension

subspace dimension

subspace dimension

Our Key Observation

  • We studied how the entries in the expansion matrix depend on the dimension of the subspace to which documents are projected

expansion matrix entry

0

no single dimension is appropriate for all term pairs


Our key observation13 l.jpg

logic /

logics

node /

vertex

logic /

vertex

0

200

400

600

0

200

400

600

0

200

400

600

subspace dimension

subspace dimension

subspace dimension

Our Key Observation

  • We studied how the entries in the expansion matrix depend on the dimension of the subspace to which documents are projected

expansion matrix entry

0

no single dimension is appropriate for all term pairs

but the shape of the curve is a good indicator for relatedness!


Curves for related terms l.jpg

0

200

400

600

0

0

200

200

400

400

600

600

subspace dimension

subspace dimension

subspace dimension

Curves for related terms

  • We call two terms perfectly related if they have an identical co-occurrence pattern

term 1

term 2

proven shape for perfectly related terms

provably small change after slight perturbation

half way to a real matrix

expansion matrix entry

0

point of fall-off is different for every term pair!

up-and-then-down shape remains


Curves for unrelated terms l.jpg

0

0

0

200

200

200

400

400

400

600

600

600

subspace dimension

subspace dimension

subspace dimension

Curves for unrelated terms

  • Co-occurrence graph:

    • vertices = terms

    • edge = two terms co-occur

  • We call two terms perfectly unrelated if no path connects them in the graph

provably small changeafter slight perturbation

proven shape forperfectly unrelated terms

half way to a real matrix

expansion matrix entry

0

curves for unrelated terms are random oscillations around zero


Telling the shapes apart tn l.jpg
Telling the shapes apart — TN

  • Normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs

  • For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to 0

expansion matrix entry

0

set entry to 1

set entry to 1

set entry to 0

0

200

400

600

0

200

400

600

0

200

400

600

subspace dimension

subspace dimension

subspace dimension

a simple 0-1 classification, no fractional entries!


An alternative algorithm tm l.jpg
An alternative algorithm — TM

  • Again, normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs

  • For each term pair compute the monotonicity of its initial curve (= 1 if perfectly monotone,  0 as number of turns increase)

  • If monotonicity is above some threshold, set entry in expansion matrix to 1, otherwise to 0

0.07

0.07

0.69

0.69

0.82

0.82

expansion matrix entry

0

set entry to 1

set entry to 1

set entry to 0

0

200

400

600

0

200

400

600

0

200

400

600

subspace dimension

subspace dimension

subspace dimension

again: a simple 0-1 classification!


Experimental results l.jpg
Experimental results

(average precision)

425 docs3882 terms

Baseline: cosine similarity in term space

Latent Semantic Indexing Dumais et al. 1990

Term-normalized LSI Ding et al. 2001

Correlation-based LSI Dupret et al. 2001

Iterative Residual Rescaling Ando & Lee 2001

our non-negativity test

our monotonicity test

* the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!


Experimental results19 l.jpg
Experimental results

(average precision)

425 docs3882 terms

21578 docs5701 terms

233445 docs99117 terms

* the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!


Conclusions l.jpg
Conclusions

  • Main message: spectral retrieval works through its ability to identify pairs of termswith similar co-occurrence patterns

    • a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace

  • Some useful corollaries …

    • new insights into the effect of term-weighting and other normalizations for spectral retrieval

    • straightforward integration of known word relationships

    • consequences for spectral link analysis?


Conclusions21 l.jpg
Conclusions

  • Main message: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns

    • a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace

  • Some useful corollaries …

    • new insights into the effect of term-weighting and other normalizations for spectral retrieval

    • straightforward integration of known word relationships

    • consequences for spectral link analysis?

Obrigado!


Why document expansion23 l.jpg
Why document "expansion"

  • Ideal expansion matrix has

    • high scores for related terms

    • low scores for unrelated terms

  • Expansion matrix LTL depends on the subspace dimension

add "internet" if "web" is present

internet

surfing

beach

web

=

·

matrix L projectingto 4 dimensions

expansion matrix LTL


ad