150 likes | 274 Views
This paper presents a novel approach to two-sided error-tolerant search, addressing uncertainties in text queries and documents. It highlights the importance of accommodating user mistakes in query input and OCR errors in document data. By developing a clustering method for vocabulary, the approach enables efficient indexing and retrieval. Focused on minimizing cover index size while maximizing recall and precision, this method demonstrates significant improvements over traditional baseline methods, offering a robust solution for efficient search in uncertain contexts.
E N D
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010
Motivation • Handling uncertainty in text search is important • Query side – users make mistakes typing the query • Either due to mistyping • Or because we do not know the correct spelling (have incomplete knowledge about the underlying data) Efficient Two-Sided Error-Tolerant Search
Motivation • Handling uncertainty in text search is important • Query side – user mistakes when typing the query • Either due to mistyping • Or because we do not know the correct spelling or have incomplete knowledge about the underlying data • Document side –mistakes in the documents • Those who type the documents also make mistakes • OCR errors Efficient Two-Sided Error-Tolerant Search
State Of The Art • Not so much work on fast error-tolerant search • There is prior work done on document-side error tolerance • Overall only few relevant papers in the literature • BASELINE: Replace each query word by a disjunction of similar words A lot of work done on approximate string matching / searching Efficient Two-Sided Error-Tolerant Search
BASELINE is all but efficient • Example fast AND list ANDintersction fast AND list AND (intersection ORinterrsectionOR intersession ORintersacitionnORintrasectionOR …) There can be hundreds of similar words! • Large list merging and diskI/O overhead • But the current state-of-the-art is not much faster than BASELINE … Efficient Two-Sided Error-Tolerant Search
Our Approach - Clustering • Based on clustering of the vocabulary • A vocabulary V is the set of all words in a corpus • The clusters may overlap i.e. words can belong to few clusters • Definition (cover) • Let q be a keyword, K a clustering of Vand be the set of all words within a threshold T. Anexact cover of is a set of clusters from K with union . An approximate cover of does not necessarily contain all of Efficient Two-Sided Error-Tolerant Search
Our Approach - Clustering • Based on clustering of the vocabulary • A vocabulary V is the set of all words in a corpus • The clusters may overlap i.e. words can belong to few clusters • Definition (cover) • Let q be a keyword, K a clustering of Vand be the set of all words within a threshold T. Anexact cover of is a set of clusters from K with union . An approximate cover of does not necessarily contain all of • The number of sets n in the cover is called cover index • Precision of a cover is defined as • Recall of a cover is defined as Efficient Two-Sided Error-Tolerant Search
Our Approach - Clustering • Compute a clustering, so that for each q we can compute a good cover: • (C1) with cover index as small as possible • (C2) with recall as large as possible • (C3) with precision as large as possible • (C4) frequency-weighted overlap as small as possible Efficient Two-Sided Error-Tolerant Search
Using the Clustering – Indexing • For each occurrence of a word, determine its clusters • Add corresponding artificial postings to the index by prepending the cluster ids, e.g. C:165:house Doc. 7012 house Doc. 7012 C:9823:house Doc. 7012 In clusters 165 and 9823 Efficient Two-Sided Error-Tolerant Search
Using the Clustering – Query Time • For each q, compute and all affected cluster ids • ComputeMinimal Cover Index • Given a cover recall (and precision), there is no cover with smaller cover index (similar to the set cover problem) algoritm C:59:* OR C:1017:* 59, 201<- 59, 221<- algorithm 59, 1017,56<- Transform q into a disjunction of prefix queries alggorithm 1017, 221<- algoithm 1017<- algoirthm 61, 472<- alggorithluq 59, 201<- cluster 59 Use efficient prefix search to process the transformed query (we use the HYB index) logarithm 1017<- aglorithm cluster 1017 59, 472<- algorithmica … algorithmic … Efficient Two-Sided Error-Tolerant Search
Computing a Clustering • How to compute a clustering with favorable properties (C1) – (C4) ? • It’s easy to optimize for (C1) alone, but then (C2) will suffer • It’s easy to optimize for (C1) - (C3) alone ,but then (C4) will suffer etc. v algoirtm algoithm y a1gor1thm C:x:algorithm algorithm z C:y:algorithm algorithm aglorithmm algortm C:z:algorithm C:v:algorithm algoritluq algoritw2 … = x Efficient Two-Sided Error-Tolerant Search
Experimental results Average query times Average number of clusters and similar words Efficient Two-Sided Error-Tolerant Search
Experimental results Average cover precision and recall Index sizes Efficient Two-Sided Error-Tolerant Search