Efficient Diverse Search

Efficient Diverse Search HY-562: Advanced Topics in Databases Orfanoudakis Nikolaos 2059 Spring 2014

Introduction • Efficient Computation of Diverse Query Results • Diversity Definition and Impossibility Results • One-pass Algorithms • Probing Algorithms • Experiments • Efﬁcient Diversity-Aware Search • Diversity Aware Search • The DIVGIVEN Approach • The DIVGIVEN Algorithm • References

Introduction 1/2 • On the Web users make a search query with combinations of form and keywords • So only the most relevant results are shown • In general an important concern in such applications is the ability to return a diverse set of results which best reﬂects the inventory of available listings

Introduction 2/2 There are some solutions for this problem: • Obtain all the query results and pick a diverse subset from them. • But this method does not scale to large data sets where a query returns a large number of results • Issue multiple queries to obtain diverse results, this method guarantees diverse results, it is inefﬁcient for two reasons: • It issues multiple queries, which hurts performance • Many of these queries may return empty results • A ﬁnal method that is sometimes used is to retrieve only a sample of the query results and pick a diverse subset from the sample • this method often misses rare but important listings that are missed in the sample

Diversity Definition and Impossibility Results • Devise evaluation algorithms that implement diversity inside the database/IR engine. • Algorithms use an inverted list index that contains item ids encoded using Dewey identiﬁers We ﬁrst develop a one-pass algorithm that produces k diverse answers with a single scan over the inverted lists

Dewey Ids What is Dewey Ids: The path vector of the numbers from the root to an element uniquely identifies the element, and can be used as the element ID. The Dewey encoding captures the notion of distinct values from which we need a representative subset in final result.

Algorithm • The key idea of thealgorithm is to explore a bounded number of answers within the same distinct value and use B+-trees to skip over similar answers. • This algorithm is optimal when we are allowed only a single pass over the data. • It can be improved when we are allowed to make a small number of probes into the data. • The algorithm uses just a small number of probes — at most 2k. • Algorithms are provably correct, they can support both unscored and score versions of diversity.

Diversity Definition If the userissues a query for all cars and we can only display threeresults, then clearly it good to show one Honda andtwo Toyotas. It is important to vary the valuesof higher priority attributes before varying the values oflower priority attributes.

Diversity Definition (2) • Definition 1: Diversity Ordering. • A diversity orderingof a relation R with attributes A, denoted by ≺R, is a totalordering of the attributes in A. • In our example, Make ≺ Model ≺ Color ≺ Year ≺Description ≺ Id • A definition of a similarity measure between pairs of items, denoted as SIM(x, y), with the goal of finding a result set S whose items are least similar to each other and most diverse. • Similarity function: SIM(x, y) = 1 if x and y agree on the highest priority attribute, and 0 otherwise.

Impossibility Results • The IR score of an item depends on the item and possibly statistics from the entire item corpus, but diversity depends on the other item in the query result set. • The class of IR systems based on inverted lists: • each unique attribute value/keyword contains the list of items that contain that attribute value/keyword. • Given a query Q, we find the lists corresponding to the attributes values/keywords in Q, and aggregate the lists to find a set of k top-scored results.

One-pass Algorithms • Sometimes the need is to iterate over a set of inverted lists. • These lists are being merged into one list, mergedList. • Making calls to mergedList.next(id), which simply returns the smallest deweyID in mergedList that is greater than or equal to id. • mergedList.next(id, RIGHT): returns the largest deweyID that is less than or equal to id.

One-pass Unscored Algorithm • The core idea of this algorithm is to explore buckets of distinct values sequentially and ensure each time that k answers are found. Algorithm 1 Unscored one-pass Algorithm Driver Routine: 1: id = mergedList.next(0) 2: root = new Node(id, 0) 3: id = mergedList.next(id) 4: while (root.numItems() < k && id != NULL) 5: root.add(id) 6: id = mergedList.next(id+1) 7: while (id != NULL) 8: root.add(id) 9: root.remove() 10: skipId = root.getSkipId() 11: id = mergedList.next(skipId) 12: return root.returnResults()

To produce that output the algorithm uses an inverted list to retrieve all cars matching the query and scans them in the order in which they appear (left-to-right). Given query Q which looks for descriptions with ‘Low’, we are left with the tree. If k = 3, a balanced result would return (at least) one car of each make from the database. This corresponds to a Honda and a Toyota.

One-pass Scored Algorithm • The difference with the unscored algorithm lies is what parts of the tree we can skip over. • In former Algorithm it was easy to determine the smallest ID that would not be removed on the next call to remove(). • In the scored case, we must add any item whose score is strictly greater than the current minScore. • But we can find the smallest ID that would immediately be removed, given that its score is no greater than minScore. • So we replace the line 11 of previous algorithm with this: id = mergedList.next(id+1, skipId, root.minScore) This line returns the smallest id greater than or equal to id+1

Unscored Probing Algorithm • The main idea of probing algorithm is to traverse the available levels many times by picking one item at a time until K answers are found Consider query Q which looks for cars with ‘Low’ in the description (shown in Figure 3). Assuming k = 3, the algorithm would first pick the first Honda Civic, then the last Toyota. It then looks for a car make “between” Honda and Toyota. Since there are not any, it continues to pick the next car which guarantees diversity, in this case, the first Toyota Prius.

Scored Probing Algorithm • The first stage of the algorithm is to call WAND (or any scoring algorithm) to obtain an initial top-K list. • Let θ be the score of the lowest-scoring item in top-K list returned. • Diversity is only guaranteed among items whose score is θ.

Experiments • MultQ is based on rewriting the input query to multiple queries and merging their result to produce a diverse set. • Naive evaluates a query and returns all its results. • Basic returns the k first answers it finds without guaranteeing diversity U: unscored case S: scored case

Response time of ofUNaive, UBasic, UOnePass and UProbe. (i) all our algorithms outperforms the naive case which evaluates the full query and (ii) diversity incurs negligible overhead evenfor large values of k.

The response time of the scored algorithms as the number of results requested is varied

The naive approaches, MultQ, UNaive, SNaive are orders of magnitude slower than the other approaches.

Conclusion • The experiments showed that the algorithms are scalable and efficient • Diversity can be implemented with little additional overhead when compared to traditional approaches

Efﬁcient Diversity-Aware Search Introduction 1/2 • The information retrieval scenario consists of identifying a fairly small number of documents in response to a query expressing the need. • Diversification becomes especially important in a new class of emerging data exploration problems, arising in vast scenarios of information aggregation. • Important challenge in this process is to identify a small number of informative and representative documents to present.

Example 1 • Grapevine aggregates information from millions of documents -blogs, tweets, news sources, etc.- on a daily basis, and identifies high-impact and emerging events. • To enable the effective exploration of a news story Grapevine needs to present all aspects. • Using standard information retrieval techniques in this example would yield somewhat redundant news articles, covering only the main sub-thread of discussions around the event. • To show other important aspects of the news story diversification techniques need to be employed. • Grapevine also enables users to explore each sub-thread of the story.

Introduction 2/2 • An intuitive diversification approach that encloses existing tried and proven semantics for diversification. • Reduces the pleonasm of search results in a principled way, by favoring results that are dissimilar to higher ranking results. • Has few, intuitive tunable parameters, which can easily be related back to user expectations, and/or learned from search engine logs

DIVGEN • The first highly efficient algorithm for such general diversification semantics, to the best of our knowledge. • It is a threshold-style algorithm, which can account for diversity by intelligently using new data access primitives. • The first low-overhead data access prioritization scheme with theoretical quality guarantees, and good performance in practice.

DIVERSITY AWARE SEARCH • DAS in terms of content-based diversification can explicitly handle intent-based diversification • Provides a view of the problem better suited to analysis and algorithm development

DAS: Data Model • Represent documents using a vector space model and view documents as weighted sets of features • In the general case of textual documents, features • can be keywords • features can be paths in the corpus graph • features can be the set of users who recommend a document • The answer to a query is a ranked list of k documents • The goal is to return the answer whose documents are of most use to the user.

DAS: User Behavior Model • Posed a query, a user is presented a list of results, and examines them in their order of presentation • The user’s goal is to locate one or more “useful” documents. • The usefulness of a document d is the product of its relevance and novelty • The query focus parameter is used to control the amount of diversification and has a clear probabilistic interpretation that can be related back to user expectations

DAS: Answer Quality • To quantify the overall quality of an answer : • measures such as the probability that at least one result is useful are used • measures that take into account the order in which results are presented are better suited to our user model

DAS Properties • In general given our model of user behavior, answer quality can be quantified by any ordering of answers that satisfies the following property

DIVGEN APPROACH • First step is compute the relevance of each document to the query • Output the highest scoring document d • Update the usefulness of all other documents, based on their similarity to d and repeat this procedure k times • Too inefficient to even consider at a moderately large scale if this approach need to access the entire corpus

DAS Threshold Algorithm ½ • To overcome the need of processing the entire corpus, observe that a threshold-style algorithm can be used to incrementally compute documents in descending order of relevance • A Sequential Access (SA) on a feature i, will retrieve the id of document with the next highest weight for feature iand can provide : • the exact weight of a feature in a document • an upper bound on said weight • A Random Access (RA) on a feature i and document d will retrieve the exact weight of i in d

DAS Threshold Algorithm 2/2 • The stream of documents produced by GENERATE needs to be reranked taking diversity into account, to compute the answers and for this we use FILTER • FILTER reads documents as they are produced by GENERATE in descending order of relevance • After retrieving the actual contents of each document, FILTER computes its pleonasm

DAS GENFILT • Documents already emitted, and places it in a max-heap, based on its usefulness • The document at the head has usefulness greater than the relevance of the last document it is the next document in the answer • The entire procedure is repeated until k documents have been emitted • This algorithm is called GENFILT and is a two-step pipeline, with a first GENERATE step that incrementally produces documents in descending order of relevance • And by a second FILTER step, which incrementally reranks them, taking diversity into account

The DIVGEN Algorithm • DIVGEN is the result of “pushing” the notion of diversity into the core of the GENERATE algorithm • Novel data access primitives: DIVGEN overcomes the need for computing the usefulness scores of candidate documents, as well as for retrieving their entire content • A Bound Access (BA) on a document d will retrieve features with the highest weight in d, as well as an upper bound w on the weight of any other features of d

The DIVGEN Algorithm 2 • Batch Sequential Access (BSA) on feature iwill retrieve the documents with the highest weight of i, as well as an upper bound w on the weight of i in any other doc • Document Random Access (DocRA) on a document d will retrieve all the features with nonzero weight in d, along with their exact weights

The DIVGEN Algorithm 3 • Index maintenance: data access primitives utilize known indexing structures in novel ways • The necessary index structures efficiently support insertions, deletions, updates and have a reasonable space overhead • Observe that forward and inverted indices are typically maintained in all IR systems

The DIVGEN Algorithm 4 DI VGEN maintains a set of candidate documents keeps track of the current most promising document outputs documents in order of decreasing usefulness performs a variety of data accesses on documents, obtaining increasingly tighter bounds on their usefulness

The DIVGEN Scheduling 1/2 • the scheduling of accesses needs to be able to account for many more factors: • the number of highly similar documents that are relevant to the query • the scoring functions used • the state of query processing • The goal is, clearly, to perform the data accesses that will lead faster to query processing completion.

The DIVGEN Scheduling 2/2 • the scheduling procedure needs to be very lightweight, if it has a high overhead, the latter will overshadow any potential performance benefits This algorithm performs the accesses with the highest benefit /cost ratio, until the cost budget for the current round has been used up.

Evaluation • In following experiments with real-world data, we show that DIVGEN is an efficient and promising solution to DAS • It is almost two orders of magnitude faster than the GENFILT baseline, and has a reasonable runtime • The performance of DIVGEN is very promising

Conclusion • Studied diversity-aware search, in a setting that captures • extends established approaches, focusing on content-based result diversification • DIVGEN presented, an efficient threshold algorithm for diversity aware search • The choice of data accesses to be performed is crucial to performance

Personal Opinion 1st • The algorithms cover in high percentage the diversification of query results • It is not written in the right way and it is difficult to understand the whole details • A future work is to produce weighted results depending on different values of attribute

Personal Opinion 2nd • There is still room for improvement because there isn’t an existing approach to solve it and needs to improve more the data access • It would be quite interesting to explore DAS from a user's perspective and answer questions as how to best tune query focus parameters for each query in collaboration with the user to maximize user satisfaction.

Reference • Erik Vee, Utkarsh Srivastava, JayavelShanmugasundaram, Prashant Bhat, SihemAmer-Yahia: Efficient Computation of Diverse Query Results. ICDE 2008 • Albert Angel, Nick Koudas: Efficient Diversity-aware Search. SIGMOD 2011

Efficient Diverse Search

Efficient Diverse Search

Presentation Transcript

Efficient Search Engine Measurements

Helping Diverse Clients Overcome Job Search Problems

Minwise Hashing and Efficient Search

DPM - efficient storage in diverse environments

Efficient Interactive Fuzzy Keyword Search

Chapter 10 Efficient Binary Search Trees

An Efficient Video Similarity Search Algorithm

Efficient Search on Encrypted Data

Efficient full-text search in databases

Evolving Efficient List Search Algorithms

Evolving Efficient List Search Algorithms

IO-Efficient Faceted Search

Efficient computation of diverse query results

An Efficient Video Similarity Search Algorithm

Efficient Computation of Diverse Query Results

Efficient Algorithms for Motif Search

Efficient Search - Overview

Search Results Need to be Diverse

Object Localization by Efficient Subwindow Search

Evolving Efficient List Search Algorithms

Evolving Efficient List Search Algorithms