Compressed Data Structures for Annotated Web Search

Compressed Data Structures for Annotated Web Search Soumen ChakrabartiSasidhar KasturiBharath BalakrishnanGanesh RamakrishnanRohit Saraf

Searching the annotated Web • Search engines increasingly supplement “ten blue links” using Web of objects • From object catalogs like • WordNet: basic types and common entities • Wikipedia: millions of entities • Freebase: tens of millions of entities • Product catalogs, LinkedIn, IMDB, Zagat … • Several new capabilities required • Recognizing and disambiguating entity mentions • Indexing these mentions along with text • Query execution and entity ranking

Lemmas and entities • In (Web) text, noisy and ambiguous lemmas are used to mention entities • Lemma = word or phrase • Lemma-to-entity relation is many-to-many • Goal: given mention in context, find correct entity in catalog, if any • Lemma also called “leaf” because we use a trie to detect mention phrases Michael Basketball player Michael Jordan Berkeley professor Jordan Country River Big Apple City that never sleeps New York City A state in USA New York Lemmas Entities

Features for disambiguation After the UNC workshop, Jordan gave a tutorial on nonparametric Bayesian methods. nonparametric workshop Bayesian tutorial after UNC Feature vectors x Millions of features After a three-season career at UNC, Jordan emerged as a league star with his leaping ability and slam dunks. leap slam dunk league season

Inferring the correct entity • Each lemma is associated with a set of candidate entities • For each lemma ℓ and each candidate entity e, learn a weight vector w(ℓ,e) in the same space as feature vectors • When deployed to resolve an ambiguity about lemma ℓ, choose Linear model; dot product

The ℓ, f, e w map • Uncompressed key, value takes 12+4 bytes = 128 bits per entry • ~500M entries  8GB just for map • No primitive type to hold keys • With Java overheads, easily 20GB RAM • From ~2M to ~100M entities? • Total marginal entropy: 33.6 bits per entry • From 128 down to 33.6 and beyond? • Must compress keys and values • And exploit correlations between them

(ℓ, f )  {e w} organization • When scanning documents for disambiguation, we first encounter lemma ℓ and then features f from context around it • Initialize score accumulator for each candidate entity e • For each feature f in context • Probe data structure with (ℓ, f ) • Retrieve sparse map {e w} • For each entry in map • Update entity scores • Choose top candidate entity ℓ1 f1 {ew} f2 {ew} f3 {ew} f4 {ew} “LFE map” or LFEM “LFE map” or LFEM

Millions of entities globally but few for a given lemma Use variable length integer codes Frequent short ID has shortest code Short entity IDs 0 Basketball player 1 CBS, PepsiCo, Westinghouse exec Machine learning researcher 2 MichaelJordan Mycologist 3 Candidate entitiessorted by decreasing occurrence frequency in reference corpus Racing driver 4 Lemma 5 Goalkeeper Short entity IDs wrt lemma

Encoding of (ℓ, f ){e w} e= short ID ℓ1 f1 f2 • We used code,others may be better • For adjacent short IDs,we spend only one bit • Irregular sizes record • Must read frombeginning todecompress Index into start of segmentfor each lemma ID ℓ2

Random access on (ℓ, f ) • Already support random access on ℓ • Number of distinct ℓ in O(10 million) • Cannot afford time to decompress from the beginning of ℓ block • Cannot afford (full) index array for (ℓ, f ) • Within each ℓ block, allocate sync points • Old technique in IR indexing • New issues: • Outer allocation of total sync among ℓ blocks • Tuning syncs to measured (ℓ, f ) probe distribution — inner allocation

Inner sync point allocation policies • Say Kℓsync points budgeted to lemma ℓ • To which features can we seek? • For others, sequential decode • DynProg: optimal expected probetime with dynamic program • Freq: allocate syncs at f withlargest probe prob. p(f |ℓ) • Equi: measure off segmentswith about equal number of bits • EquiAndFreq: split budget f1 f2 f3 f4

Outer allocation policies • Given overall budget K, how many syncs Kℓ does leaf get? • Hit probpℓ, bits in leaf segment bℓ • Analytical expression for effect of inner allocation can be intractable • Hit:Kℓ  pℓ • HitBit: Kℓ  pℓbℓ • SqrtHitBit: Assume equispaced inner allocation (see Managing Gigabytes)

Experiments • 500 million pages, mostly English, spam-free • Catalog has about two million lemmas and entities Testfold Trainfold Trainfold Trainfold Trainfold Testcontexts Testcontexts Testcontexts Testcontexts ℓ,fworkload ℓ,fworkload ℓ,fworkload ℓ,fworkload Spotter “Reference” Testfold Trainfold Testfold Traincontexts Traincontexts ℓ,fworkload Spotter Sampler Our best policycompresses LFEM down to only 18 bits/entry compared to 33.6 bits/entry marginalentropy, and 128 bits/entry raw data Smoother ℓ,f(e,w)model Disambiguation trainer and cross-validator Smoothed ℓ,fdistribution Corpus Compressor L-F-E map Annotationindex Entity andtypeIndexer “Payload” Annotator

Inner policies compared • Equi close to optimal DynProg but fast to compute • Freq surprisingly bad: long tail • Blending Equi and Freq worse than Equi alone • Relative order stable as sample size increased: long tail again Lookup cost: lower is better

Diagnosis: Freq vs. Equi Note scales • Plots show cumulative seek cost starting at sync • Collapse back to zero at next sync • Features with largest frequency not evenly placed • Tail features in between lead to steep seek costs • Equi never lets seek cost get out of hand • (How about permuting features? See paper)

Outer policies compared Probe cost • Inner policy set to best (DynProg) • SqrtHitBit better than Bit better than HitBit • Not surprising, given DynProg behaves closer to Equi than Freq Sync budget

Comparison with other systems • Downloaded software or network services • Regression removes per-page, per-token overhead • LFEM wins, largely because of syncs • LFEM RAM << downloaded software

Conclusion • Compressed in-memory multilevel maps for disambiguation • Random access via tuned sync allocation • >20 GB down to 1.15 GB • Faster than public disambiguation systems • Annotate 500M pages with 2M Wikipedia entities + index on 408 cores in ~18 hours • Also in the paper: design of compressed annotation index posting list

Compressed Data Structures for Annotated Web Search