Compressed data structures for annotated web search
Download
1 / 18

Compressed Data Structures for Annotated Web Search - PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on

Compressed Data Structures for Annotated Web Search. Soumen Chakrabarti Sasidhar Kasturi Bharath Balakrishnan Ganesh Ramakrishnan Rohit Saraf. Searching the annotated Web. Search engines increasingly supplement “ten blue links” using Web of objects From object catalogs like

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Compressed Data Structures for Annotated Web Search' - elyse


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compressed data structures for annotated web search

Compressed Data Structures for Annotated Web Search

Soumen ChakrabartiSasidhar KasturiBharath BalakrishnanGanesh RamakrishnanRohit Saraf


Searching the annotated web
Searching the annotated Web

  • Search engines increasingly supplement “ten blue links” using Web of objects

  • From object catalogs like

    • WordNet: basic types and common entities

    • Wikipedia: millions of entities

    • Freebase: tens of millions of entities

    • Product catalogs, LinkedIn, IMDB, Zagat …

  • Several new capabilities required

    • Recognizing and disambiguating entity mentions

    • Indexing these mentions along with text

    • Query execution and entity ranking


Lemmas and entities
Lemmas and entities

  • In (Web) text, noisy and ambiguous lemmas are used to mention entities

  • Lemma = word or phrase

  • Lemma-to-entity relation is many-to-many

  • Goal: given mention in context, find correct entity in catalog, if any

  • Lemma also called “leaf” because we use a trie to detect mention phrases

Michael

Basketball player

Michael Jordan

Berkeley professor

Jordan

Country

River

Big Apple

City that never sleeps

New York City

A state in USA

New York

Lemmas

Entities


Features for disambiguation
Features for disambiguation

After the UNC workshop, Jordan gave a tutorial on nonparametric Bayesian methods.

nonparametric

workshop

Bayesian

tutorial

after

UNC

Feature vectors x

Millions of features

After a three-season career at UNC, Jordan emerged as a league star with his leaping ability and slam dunks.

leap

slam

dunk

league

season


Inferring the correct entity
Inferring the correct entity

  • Each lemma is associated with a set of candidate entities

  • For each lemma ℓ and each candidate entity e, learn a weight vector w(ℓ,e) in the same space as feature vectors

  • When deployed to resolve an ambiguity about lemma ℓ, choose

Linear model; dot product


The f e w map
The ℓ, f, e w map

  • Uncompressed key, value takes 12+4 bytes = 128 bits per entry

  • ~500M entries  8GB just for map

  • No primitive type to hold keys

  • With Java overheads, easily 20GB RAM

    • From ~2M to ~100M entities?

  • Total marginal entropy: 33.6 bits per entry

  • From 128 down to 33.6 and beyond?

  • Must compress keys and values

  • And exploit correlations between them


F e w organization
(ℓ, f )  {e w} organization

  • When scanning documents for disambiguation, we first encounter lemma ℓ and then features f from context around it

  • Initialize score accumulator for each candidate entity e

  • For each feature f in context

    • Probe data structure with (ℓ, f )

    • Retrieve sparse map {e w}

    • For each entry in map

      • Update entity scores

  • Choose top candidate entity

ℓ1

f1

{ew}

f2

{ew}

f3

{ew}

f4

{ew}

“LFE map” or LFEM

“LFE map” or LFEM


Short entity ids

Millions of entities globally but few for a given lemma

Use variable length integer codes

Frequent short ID has shortest code

Short entity IDs

0

Basketball player

1

CBS, PepsiCo, Westinghouse exec

Machine learning researcher

2

MichaelJordan

Mycologist

3

Candidate entitiessorted by decreasing occurrence frequency in reference corpus

Racing driver

4

Lemma

5

Goalkeeper

Short entity IDs wrt lemma


Encoding of f e w
Encoding of (ℓ, f ){e w}

e= short ID

ℓ1

f1

f2

  • We used code,others may be better

  • For adjacent short IDs,we spend only one bit

  • Irregular sizes record

  • Must read frombeginning todecompress

Index into start of segmentfor each lemma ID

ℓ2


Random access on f
Random access on (ℓ, f )

  • Already support random access on ℓ

  • Number of distinct ℓ in O(10 million)

  • Cannot afford time to decompress from the beginning of ℓ block

  • Cannot afford (full) index array for (ℓ, f )

  • Within each ℓ block, allocate sync points

  • Old technique in IR indexing

  • New issues:

    • Outer allocation of total sync among ℓ blocks

    • Tuning syncs to measured (ℓ, f ) probe distribution — inner allocation


Inner sync point allocation policies
Inner sync point allocation policies

  • Say Kℓsync points budgeted to lemma ℓ

  • To which features can we seek?

  • For others, sequential decode

  • DynProg: optimal expected probetime with dynamic program

  • Freq: allocate syncs at f withlargest probe prob. p(f |ℓ)

  • Equi: measure off segmentswith about equal number of bits

  • EquiAndFreq: split budget

f1

f2

f3

f4


Outer allocation policies
Outer allocation policies

  • Given overall budget K, how many syncs Kℓ does leaf get?

    • Hit probpℓ, bits in leaf segment bℓ

  • Analytical expression for effect of inner allocation can be intractable

  • Hit:Kℓ  pℓ

  • HitBit: Kℓ  pℓbℓ

  • SqrtHitBit: Assume equispaced inner allocation (see Managing Gigabytes)


Experiments
Experiments

  • 500 million pages, mostly English, spam-free

  • Catalog has about two million lemmas and entities

Testfold

Trainfold

Trainfold

Trainfold

Trainfold

Testcontexts

Testcontexts

Testcontexts

Testcontexts

ℓ,fworkload

ℓ,fworkload

ℓ,fworkload

ℓ,fworkload

Spotter

“Reference”

Testfold

Trainfold

Testfold

Traincontexts

Traincontexts

ℓ,fworkload

Spotter

Sampler

Our best policycompresses LFEM down to only 18 bits/entry compared to 33.6 bits/entry marginalentropy, and 128 bits/entry raw data

Smoother

ℓ,f(e,w)model

Disambiguation trainer and cross-validator

Smoothed ℓ,fdistribution

Corpus

Compressor

L-F-E map

Annotationindex

Entity andtypeIndexer

“Payload”

Annotator


Inner policies compared
Inner policies compared

  • Equi close to optimal DynProg but fast to compute

  • Freq surprisingly bad: long tail

  • Blending Equi and Freq worse than Equi alone

  • Relative order stable as sample size increased: long tail again

Lookup cost: lower is better


Diagnosis freq vs equi
Diagnosis: Freq vs. Equi

Note scales

  • Plots show cumulative seek cost starting at sync

    • Collapse back to zero at next sync

  • Features with largest frequency not evenly placed

  • Tail features in between lead to steep seek costs

  • Equi never lets seek cost get out of hand

  • (How about permuting features? See paper)


Outer policies compared
Outer policies compared

Probe cost

  • Inner policy set to best (DynProg)

  • SqrtHitBit better than Bit better than HitBit

  • Not surprising, given DynProg behaves closer to Equi than Freq

Sync budget


Comparison with other systems
Comparison with other systems

  • Downloaded software or network services

  • Regression removes per-page, per-token overhead

  • LFEM wins, largely because of syncs

  • LFEM RAM << downloaded software


Conclusion
Conclusion

  • Compressed in-memory multilevel maps for disambiguation

  • Random access via tuned sync allocation

  • >20 GB down to 1.15 GB

  • Faster than public disambiguation systems

  • Annotate 500M pages with 2M Wikipedia entities + index on 408 cores in ~18 hours

  • Also in the paper: design of compressed annotation index posting list


ad