1 / 47

Using Trees to Depict a Forest

35 th International Conference on Very Large Databases 2009. Reading Assignment. Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor. Presentation Courtesy of. Using Trees to Depict a Forest. Motivation – Too Many Results.

Download Presentation

Using Trees to Depict a Forest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 35th International Conference on Very Large Databases 2009 Reading Assignment Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presentation Courtesy of Using Trees to Depict a Forest

  2. Motivation – Too Many Results • In interactive database querying, we often get more results than we can comprehend immediately • Try search a popular keyword • When do you actually click over 2-3 pages of results? • 85% of users never go to the second page [1,2]

  3. Why IR Solutions Do NOT Apply • Sorting and ranking are standard IR techniques • Search engines show most relevant hits in the first page • However, for a database query, all tuples in the query result set are equally relevant • For example, Select * from Cars where price < 13,000 • All matching results should be available to user • What to do when there are millions of results?

  4. Make the First Page Count • If no user preference information available, how to best arrange results? • Sort by attribute? • Random selection? • Others? • Show the most “representative” results • Best help users learn what is in the result set • User can decide further actions based on representatives

  5. Our Proposal – MusiqLens Experience

  6. Suppose a user wants a 2005 Civic but there are too many of them…

  7. MusiqLens on the Car Data

  8. MusiqLens on the Car Data

  9. Zooming in: 2005 Honda Civics ~ ID 132

  10. Now Suppose User Filters by “Price < 9,500”

  11. After Filtering by “Price < 9,500”

  12. Challenges • Metric challenge • What is the best set of representatives? • Representative finding challenge • How to find them efficiently? • Query challenge • How to efficiently adapt to user’s query operations?

  13. Finding a Suitable Metric • Users should be the ultimate judge • Which metric generates the representatives that I can learn the most from • User study • Use a set of candidates • Users observe the representatives • Users estimate more data points in the data • Representatives lead to best estimation wins

  14. Metric Candidates • Sort by attributes • Uniform random sampling • Density-biased sampling [3] • Sort by typicality [4] • K-medoids • Average • Maximum

  15. Density-biased Sampling • Proposed by C. R. Palmer and C. Faloutsos [3] • Sample more from sparse regions, less from dense regions • To counter the weakness of uniform sampling where small clusters are missed

  16. Sort by Typicality Proposed by Ming Hua, Jian Pei, et al [4] Figure source: slides from Ming Hua

  17. Metric Candidates - K-medoids • A medoid of a cluster is the object whose average or maximum dissimilarity to others is smallest • Average medoid and max medoid • K-medoids are k objects, each from a cluster where the object is the medoid • Why not K-means • K-means cluster centers do not exist in database • We must present real objects to users C

  18. Plotting the Candidates Data: Yahoo! Autos, 3922 data points. Normalized price and mileage to 0-1.

  19. Plotting the Candidates - Typicality

  20. Plotting the Candidates – k-medoids

  21. User Study Procedure • Users are given • 7 sets of data, generated using the 7 candidate methods • Each set consists of 8 representative points • Users predict 4 more data points • That are most likely in the data set • Should not pick those already given • Measure the predication error

  22. Predication Quality Measurement P1 D1 D2 P2 So MinDist: D1 For data point So: MaxDist: D2 AvgDist: (D1+D2)/2

  23. Performance – AvgDist and MaxDist For AvgDist: Avg-Medoid is the winner. For MaxDist: Max-Medoid is the winter.

  24. Performance – MinDist Avg-Medoid seems to be the winner

  25. Verdict • Statistical Significance of Result: • Although result is insignificant in MinDist, overall AvgMeoid is better than Density • Based on AvgDist and MinDist: Avg-Medoid • Based on MaxDist: Max-Medoid • In this paper, we choose average k-medoids • Our algorithm can extend to max-medoids with small changes

  26. Challenges • Metric challenge • What is the best set of representatives? • Representative finding challenge • How to find them efficiently? • Query challenge • How to efficiently adapt to user’s query operations?

  27. Cover Tree Based Algorithm • Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 [5] • Briefly discuss Cover Tree properties • Cover Tree based algorithms for computing k-medoids

  28. Cover Tree Properties (1) Assume all pair-wise distance <= 1. Nesting: for all i, Ci Ci+1 Points in the Data (One Dimension) Repeating every node in each lower Level after its first appears Figure modified from slides of Cover Tree authors

  29. Cover Tree Properties (2) Covering: node in Ciis within distance of to its children in Ci+1 Ci Ci+1 Distance from node to any descendant is less than This value is called the “span” of the node. Ensure that nodes are close enough to their children

  30. Cover Tree Properties (3) Separation: nodes in Ci are separated by at least Ci Ci+1 Points in the Data Nodes at higher levels are more separated Figure modified from slides of Cover Tree authors

  31. Additional Stats for Cover Tree (2D Example) DS = 10 DS = 3 p Density (DS): number of points in the subtree Centroid (CT): geometric center of points in the subtree

  32. k-medoid Algorithm Outline • We descend the cover tree to a level with more than k nodes • Choose an initial k points as first set of medoids (seeds) • Bad seeds can lead to local minimums with a high distance cost • Assigning nodes and repeated update until medoids converge

  33. Cover Tree Based Seeding • Descend the cover tree to a level with more than k nodes (denote as level m) • Use the parent level (m-1) as starting point for seeds • Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) • Expand nodes using a priority queue • Fetch the first k nodes from the queue as seeds

  34. A Simple Example: k = 4 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S3 (5), S8 (3), S5 (2) S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2) Final set of seeds

  35. Update Process • Initially, assign all nodes to closest seed to form k clusters • For each cluster, calculate the geometric center • Use centroid and density information to approximate subtree • Find the node that is closest to the geometric center, designate as a new medoid • Repeat from step 1 until medoids converge

  36. Challenges • Metric challenge • What is the best set of representatives? • Representative finding challenge • How to find them efficiently? • Query challenge • How to efficiently adapt to user’s query operations?

  37. Query Adaptation • Handle user actions • Zooming • Selection (filtering) • Zooming • Expand all nodes assigned to the medoid • Run k-medoid algorithm on the new set of nodes

  38. Selection • Effect of selection on a node • Completely invalid • Fully valid • Partially valid • Estimate the validity percentage (VG) of each node • Multiply the VG with weight of each node

  39. What about Projection? • What if user removes one attribute? • Distance between pair will change.. So should the cover tree be recomputed?

  40. Experiments – Initial Medoid Quality • Compare with R-tree based method [6] • Data sets • Synthetic dataset: 2D points with zipf distribution • Real dataset: LA data set from R-tree Portal, 130k points • Measurement • Time to compute the medoids • Average distance from a data point to its medoid

  41. Results on Synthetic Data Distance Time For various sizes of data, Cover-tree based method outperforms R-tree based method

  42. Result on Real Data For various k values, Cover-tree based method outperforms R-tree based method on real data

  43. Query Adaptation Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Synthetic Data Real Data Time cost of re-building is orders-of-magnitude higher than incremental computation.

  44. Related Work • Classic/textbook k-medoid methods • Partition Around Medoids (PAM) and Clustering LARge Applications (CLARA), L. Kaufman and P. Rousseeuw, 1990 • CLARANS, R. T. Ng and J. Han, TKDE 2002 • Tree-based methods • Focusing on Representatives (FOR), M. Ester, H. Kriegel, and X. Xu, KDD 1996 • Tree-based Partitioning Querying (TPAQ), K. Mouratidis, D. Papadias, and S. Papadimitriou, VLDBJ 2008

  45. Related Work (2) • Clustering methods • For example, BIRCH, T. Zhang, R. Ramakrishnan, and M. Livny, SIGMOD 1996 • Result presentation methods • Automatic result categorization, K.Chakrabarti, S.Chaudhuri, and S.wonHwang, SIGMOD 2004 • DataScope, T. Wu, et al, VLDB 2007 • Other recent work • Finding representative set from massive data, ICDM 2005 • Generalized group by, C. Li, et al, SIGMOD 2007 • Query result diversification, E. Vee et al., ICDE 2008

  46. Conclusion • We proposed MusiqLens framework for solving the many-answer problem • We conducted user study to select a metric for choosing representatives • We proposed efficient method for computing and maintaining the representatives under user actions • Part of the database usabilityproject at Univ. of Michigan • Led by Prof. H.V. Jagadish • http://www.eecs.umich.edu/db/usable/

  47. Example of Questions • How does the tree gets constructed in the given example? • Would removing many attributes (using Projection) affecting the shape of the cover tree? • What would be the examples of dataset using k with the value greater than 100? • What’s the impact of increasing dimension on building a cover tree?

More Related