Automatic Categorization of Query Results

Automatic Categorization of Query Results A Paper by Kaushik Chakarbati, Surajit Chaudhari, Seung -won Hwang Presented by Arjun Saraswat

Flow of the Presentation 1.Introduction 2.Motivation 3.Basics of Model 4.Cost Estimation 5.Algorithms 6.Experimental Evaluation 7.Conclusion

INTRODUCTION

Introduction • This paper basically solves the “too many answers” problem. • This phenomenon of too many answers is often referred to as Information overload . • Information overload happens when the user is not certain what she is looking for, In such situations user generally fires a broad query in order to avoid exclusion of potentially interesting results. • There are two techniques to handle information overload Categorization and Ranking, this paper talks about the categorization Technique.

MOTIVATION

Motivation Example :A user fires a query on the MSN House & Home Database with following specifications: Area: Seattle/Bellevue Area of Washington, USA Price Range :$200,000 to $300,000 • The query returns 6045 results, it is hard for the user to Separate the interesting ones from the uninteresting ones, which leads to lot of wastage of user time and effort. • This problem is solved by the Categorization techniques introduced by this paper, such queries are answered by hierarchal category structure that are based on the contents of the answer set. • The main motive is to reduce the information overload

MotivationFig1.Structured hierarchal categorization results of the Example Query

Basics of Model

Basics of Model • R = set of tuples or it can be either base relation or materialized view or result of a query Q. • Q = SPJ (select-project-join) query. • A hierarchal categorization of R is a recursive partitioning of the tuples in R based on the data attributes and their values, this is shown in Fig.1. • Base Case : At the root or level 0 contains all the tuples in R, this tuple set is partitioned into mutually disjoint categories using a single attribute. • Inductive Step : At a given node C at level (l-1), the partitioning of set of tuples tset(C) contained in C in ordered mutually disjoint subcategories (level l nodes) is done using the attribute which is same for all nodes at level(l-1).

Basics of Model • The partitioning of node C is only done if it contains more than certain number of tuples and the attribute on which it is done is called categorizing attribute of level l and sub-categorizing Attribute of level (l-1). • An attribute used once is not used again at later levels. • Category Label : The predicate label (C) describing node C. Example :`Neighborhood : Redmond, Bellevue’ and `Price : 200k - 225k’ • Tuple-set (tset(C)) : The set of tuples contained in C, either occurring directly or indirectly under its subcategories. Example : tset for category with label 'Neighborhood :Seattle’, is the set of all homes in R that are located in Seattle.

Basics of Model Important points to remember for each level: • Determine the categorizing attribute for that level. • Attribute partitioning is done in such a way as to minimize the information overload on the user. Exploration Model: It has two models that capture the two common scenarios. 1.All Scenario. 2.One Scenario.

Basics of Model The model of exploration of the subtree rooted at an arbitrary node C : EXPLORE C if C is non leaf node CHOOSE one of the following : (1)Examine all tuples in tset(C)//option SHOWTUPLES (2)for(i=1;i≤n;i++)//option SHOWCAT Examine Label of this subcategory Ci CHOOSE one of the following : (2.1)EXPLORE Ci (2.2)Ignore Ci else//C is a leaf node Examine all tuples in tset(C)//SHOWTUPLES is only option

Basic of Model 2.One Scenario : EXPLORE C if C is non leaf node CHOOSE one of the following : (1)Examine all tuples in tset(C) from the beginning till first relevant tuple found//option SHOWTUPLES (2)for(i=1;i≤n;i++)//option SHOWCAT Examine Label of the ith subcategory Ci CHOOSE one of the following : (2.1)EXPLORE Ci (2.2)Ignore Ci If (choice=Explore) break;//examine till first relevant tuple else//C is a leaf node Examine all tuples in tset(C) from beginning till first relevant tuple found//SHOWTUPLES is only option

Cost Estimation

Cost Estimation Cost Model for ‘All’ Scenario CostAll (X,T) = information overload cost or simply cost. X = a given user exploration. T = Tree. • We want to generate the tree that would minimize the number of items this particular user needs to examine. • We use the aggregate knowledge of previous user behavior in order to estimate the information overload cost CostAll(T) that a user will face, on average, during an exploration using a given category tree T.

Cost Estimation Exploration Probability: The probability P(C) that the user exploring T explores category C, using either SHOWTUPLES or SHOWCAT, upon examining its label. SHOWTUPLES Probability: The probability Pw(C) that the user goes for option ‘SHOWTUPLES’ for category C given that she explores C. The SHOWCAT probability of C. Cost Model for All Scenario Consider a Non-Leaf Node C of T CostAll(Tc) = cost of exploring the subtree Tc rooted at C we denote CostAll(Tc) by CostAll (C) as we know the cost is always calculated in context of the given Tree.

Cost Estimation • If ‘SHOWTUPLES’ is Chosen for C then Cost = Pw(C)*|tset(C)| • If ‘SHOWCAT’ is Chosen for C the Cost = • Cost of first component = K *n (where K is the cost of examining a category label relative to the cost of examining a data tuple.) • Cost of Second Component = CostAll(Ci), if she chooses to explore Ci, 0 if she chooses to ignore it. CostAll (C) = Pw(C)*|tset(C)|+(1-Pw(C)) * (K*n + Σ P(Ci)*CostAll(Ci)) (1) If C is leaf node then CostAll (C) = |tset(C)|

Cost Estimation Cost Model for ‘ONE’ Scenario CostOne(T) = information overload cost Let us consider the Cost for SHOWTUPLES Option = Pw(C)*frac(C)*|tset(C)| Cost for SHOWCAT option = (1-Pw(C)) * Σ (Prob. that Ci is the first category explored* (K*i + CostOne (Ci))) Total Cost of One Scenario= CostOne(C) =Pw(C)*frac(C)*|tset(C)| + (1-Pw(C)) * Σ (Prob. that Ci is the first category explored* (K*i + CostOne (Ci)))

Cost Estimation Σ (Prob. that Ci is the first category explored* (K*i + CostOne(Ci))) The probability that Ci is the first category explored (i.e., probability that the user explores Ci but none of C1 to C(i-1)), is (i-1) j =1∏(1-P(Cj)) * P(Ci) Final CostOneTerm CostOne(C) = Pw(C)*frac(C)*|tset(C)| + (1-Pw(C)) * i=1nΣ(i-1) j =1∏(1-P(Cj)) * P(Ci) *(K*i + CostOne (Ci))) (2) In case C is a leaf node then CostOne(C) = frac(C)*|tset(C)|

Cost Estimation Using Workload to Estimate Probabilities • P(C) and Pw(C) are needed for the CostOne(T) and CostAll(T) • We use the aggregate knowledge of previous user behavior to estimate these probabilities automatically. Computing SHOWTUPLES Probability: • When a User explores the a non-leaf node C, there are two Choices SHOWCAT or SHOWTUPLES. • SA(C) = Subcategorizing attribute of C

Cost Estimation SHOWCAT Probability Wi = Workload Query. Ui =User. • If Ui has specified a selection condition on SA(C) in Wi, given a condition on SA(C) means user is interested in few values, if there is no condition it means user is interested in all values SA(C).

Cost Estimation • NAttr(A) = the number of queries in the workload that contain selection condition on attribute A and N is the total number of queries in the workload. • NAttr(SA(C))/N = fraction of users that are interested in a few values of SA(C). • SHOWCAT probability of C = NAttr (SA(C))/N • SHOWTUPLES probability of C = (1- NAttr (SA(C))/N)

Cost Estimation Computing Exploration Probability P(C): P(C) = probability that user explores category C. P(C) = P(User explores C | User examines label of C). P(C) = P(User explores C) / P(User examines label of C). • User examines label if she explores the parent of label say C’ and chooses SHOWCAT for C’. P(C) = P( user explores C)/P(User explores C’ and chooses SHOWCAT for C’) P( user explores C) P( user explores C’)*P(User chooses SHOWCAT for C’|user explores C’) P(User chooses SHOWCAT for C'| User explores C’) is the SHOWCAT probability of C’ = NAttr (SA(C’))/N.

Cost Estimation • A user explores C if she, upon examining the label of C, thinks that there may be one or more tuples in tset(C) that is of interest to her. • P(User explores C) / P(User explores C’) is simply the probability that the user is interested in predicate label(C). So, P (user interested in predicate label C) P(C) = NAttr(SA(C))/ N

Cost Estimation • CA(C) = Categorizing attribute of C • selection condition on CA(C) overlaps with the predicate label(C), it means that Ui is interested in the predicate label(C). • NOverlap(C) = number of queries in the workload Whose selection condition on CA(C) overlaps with label(C) • P(User interested in predicate label(C)) = NOverlap (C)/N. • So now we get NOverlap (C) P(C) = NAttr(CA(C))

Algorithms

Algorithms Now we can calculate the information overload cost for a given Tree, we can enumerate all possible category tree’s on R, and Chose the one with minimum cost. This will give the Cost- Optimal tree but will be expensive the sense that there will be large number of categorization trees possible. In order to solve this problem we need to : • Eliminate a subset of relatively unattractive attributes without considering any of their partitioning . • For every attribute selected above, obtain a good partitioning efficiently instead of enumerating all the possible partitioning.

Algorithms Reducing the Choices of Categorizing Attribute : 1.)Eliminate the uninteresting attributes using the following simple heuristic: if an attribute A occurs in less than a fraction x of the queries in the workload, i.e., NAttr(A)/N < x, we eliminate A. The threshold x will need to be specified by the system designer/domain expert. 2.) For attribute elimination, we preprocess the workload and maintain, for each potential categorizing attribute A, the number NAttr(A) of queries in the workload that contain selection condition on A.

Algorithms Partitioning for Categorical Attributes • In this paper only single value partitioning of R is considered. • Consider the case where the user query Q contains a selection condition of the form “A IN {v1, …, vk}” on A. Example : Neighborhood IN {“Neighborhood:Redmond”, “Neighborhood:Bellevue”, etc.).} • Among the single-value partitioning, we want to choose the one with the minimum cost. • Since the set of categories is identical in all possible single-value partitioning, the only factor that impacts the cost of a single valued partitioning is the order in which the categories are presented to the user.

Algorithms • The CostAll (T) is not affected by the ordering, so we will consider Only cost CostOne(T), now CostOne(T) is minimum when categories are presented in increasing order of 1/(P(Ci)+ CostOne(Ci). • Heuristic to present the categories in decreasing order of P(Ci). • P(Ci) =NOverlap (Ci)/NAttr(A) , as Ci corresponds to a single value NOverlap (Ci) = is the number of queries in the workload, whose selection condition on A contains vi in the IN Clause • To obtain the partitioning we simply sort the values in IN clause in decreasing order of occ(vi).

Algorithms

Algorithms Partitioning for Numeric Attributes • Let Vmin and Vmax be the minimum and maximum values that the tuples in R can take in attribute A. • Let us consider a point v (Vmin < v < Vmax). If a significant number of query ranges in the workload begin or end at v, it is a good point to split as the workload suggests that most users would be interested in just one bucket, • If none of them begin or end at v, hence v is not a good point to split, if we partition the range into m-buckets then (m-1) points should be selected where queries begin or end splitpoints. • The splitpoints are not the only factors determining cost, the other factor is the number of tuples in each bucket. This kind of heuristic will not give best partitioning in the sense of cost.

Algorithms Let us consider the point v again (Vmin < v < Vmax). Let startV and endV denote the number of query ranges in the Workload starting and ending at v respectively. We use SUM (startV, endV) as the “goodness score” of the point v.

Algorithms Multilevel Categorization: ALGORITHM : 1.For multilevel categorization, for each level l, we need to determine the categorizing attribute A and for each category C in level (l-1), partition the domain of values of A in tset(C) such that the information overload is minimized. 2.The algorithm creates the categories level by level all categories at level (l-1) are created and added to tree T before any category at level l. S denote the set of categories at level (l-1) with more than M tuples. 3.For each such candidate attribute A, we partition each category C in S using the partitioning for Categorical Attributes and Numerical attributes 4. Compute the cost of the attribute-partitioning combination for each candidate attribute A and select the attribute α with the minimum cost. For each category C in S, we add the partitions of C based on α to T. 5. This Completes the node creation at level l.

Experimental Evaluation Evaluation is done on the following : • Evaluate the accuracy of cost models in modeling information overload. • Evaluate our cost based categorization algorithm and compare them with categorization that do not consider such cost models. Database : MSN House&Home M = 20 All Experiments are a conducted on Compaq Evo W8000 1.7Ghz CPU 768MB RAM, running on Windows XP. Dataset : for both the experiments Single table called ListProperty , it contains 1.7 million rows. Workload comprises 176,262 query strings representing searches conducted by home buyers on MSN House & Home website. In both the studies paper’s cost based is compared to two techniques No-Cost and Attr-Cost No-Cost : it uses the same level by categorization but categorizing attr- -butes at each level arbitrarily (without replacement).

Experimental Evaluation Attr-Cost: Attr-cost’ technique selects the attribute with the lowest cost as the categorizing attribute at each level but considers only those partitioning considered by the ‘No cost’ technique. Simulated User-Study Due to the difficulty of conducting a large- scale real-life user study, we develop a novel way to simulate a large scale user study. We pick a subset of 100 queries from the workload and imagine them as user explorations, Workload Query W is referred to as Synthetic Exploration. • estimated (average) cost =CostAll(T) • actual cost = CostAll(W,T) of exploration • 8 Mutually disjoint subsets of 100 synthetic explorations are considered. Figure is Correlation between actual cost and Estimated Cost.

Experimental Evaluation • Figure on the left is Cost of various techniques for 8 Subsets. • Figure on the right is Pearson's correlation between the estimated cost and actual cost.

Experimental EvaluationReal Life –User study Tasks 1. Any neighborhood in Seattle/Bellevue, Price < 1 Million. 2. Any neighborhood in Bay Area – Penin /SanJose, Price between 300K and 500K 3. 15 selected neighborhoods in NYC – Manhattan, Bronx, Price < 1 Million 4. Any neighborhood in Seattle/Bellevue, Price between 200K and 400K, Bedroom Count between 3 and 4.

Experimental EvaluationReal Life –User study • Figure on the left is Average Cost (no. of items examined till she finds all the relevant tuples) of various techniques • Figure on the right is Average number relevant tuples found by users for the various techniques.

Experimental EvaluationReal Life –User study • Figure on the left is Average Normalized Cost (items examined by user/ relevant tuple found) of various techniques • Figure on the right is Average Cost (till she finds the first relevant tuple found) of various techniques

Experimental EvaluationReal Life –User study • Figure on the left Results of the post study survey • Figure on the right Average execution time of the cost – based categorization algorithm.

Conclusion This paper gives a solution for the problem Information Overload by purposing the automatic categorization of Query results. The solution is to dynamically generate a labeled, hierarchical category structure the user can determine whether a category is relevant or not by examining simply its label and explore only the relevant categories, thereby reducing information overload.

Thank You

Automatic Categorization of Query Results

Automatic Categorization of Query Results

Presentation Transcript

Ranking of Database Query Results

Probabilistic Ranking of Database Query Results

Automated Ranking Of Database Query Results

Automated Ranking Of Database Query Results

Automatic Categorization Algorithm for Evolvable Software Archive

Automatic Categorization of Patent Applications

Automatic Categorization Tool for Open Software Repositories

Query Operations: Automatic Global Analysis

Results of Canadian Categorization and Next Steps

Probabilistic Ranking of Database Query Results

Automatic Classification of Text Databases Through Query Probing

Efficient computation of diverse query results

Probabilistic Ranking of Database Query Results

Automatic Categorization of Patent Applications

Efficient Computation of Diverse Query Results

Automatic Classification of Text Databases Through Query Probing