1 / 16

Parsimonious Explanations of Change in Hierarchical Data

Parsimonious Explanations of Change in Hierarchical Data. Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal. Background. Dimensions in data warehouses are hierarchical Variety of applications aggregate along hierarchy

traylorj
Download Presentation

Parsimonious Explanations of Change in Hierarchical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parsimonious Explanations of Change in Hierarchical Data Dhiman Barman, Flip Korn, Divesh Srivastava, Dimitrios Gunopulos, Neal Young and Deepak Agarwal

  2. Background • Dimensions in data warehouses are hierarchical • Variety of applications aggregate along hierarchy • e.g., population summarized by geographic location (state/county/city/zip) • Existing OLAP tools for static data • summarize and navigate via drill-down and roll-up operators • e.g., population of each city for the year 2005 • Want to summarize and explain changes • e.g., population in 2004 compared to population in 2005 across different locations

  3. Motivation Census 2004 Census 2005 Output changes, e.g., whole California population doubled except LA county which tripled

  4. Hierarchical Representation T1 T2 1.67 200 120 R Aggregate count 1.22 110 90 3 90 30 • Two hierarchical summaries and • corresponds to two different snapshots in time • Naïve: take point-to-point ratios  R • Verbose and non-hierarchical • Want holistic as well as hierarchical explanations 3 3 1.4 1 30 60 70 40 10 20 50 40 1 2 40 3 3 1 40 40 30 60 30 20 10 20 30 Leaf count T1 T2

  5. Problem Context • Explanations can be verbose or parsimonious • Census data of US population • Hierarchically organized as (state/county/city/zip) • 50 states • 81,000 leaves • 130,000 nodes state 200 California San Bernardino LA County county 110 90 Pasadena Victorville city Fontana 30 60 70 40 LA 40 40 30 60 30 zip codes 92334 91729 90001 90002 91101

  6. T1 T2 Ratio Tree 120 200 90 110 30 90 10 20 50 40 30 60 70 40 2 1 3 3 1 40 40 20 40 10 20 30 30 60 30 Ratio Tree • Given trees and , Ratio Tree is a tree with the same structure as or and value in leaf l equals value(l, )/value(l, ) • Assume two trees are isomorphic and leaf counts > 0 T1 T2 T1 T2 T2 T1

  7. Naïve Solution: Bottom-up Ratio Tree • Not hierarchical • Leaf-weight is ratio of corresponding leaf counts • Non-leaf weights are 1; not part of explanation • Verbose if significant number of leaves have different counts 1 2 3 3 1 • 3 explanations found

  8. Naïve Solution: Top-down 5/3 Ratio Tree • Hierarchical but not holistic • Compute aggregate of subtree leaves for each node • Root-to-node product of weights equal to node aggregate, for each node • 7 explanations found 11/15 9/5 1 1 63/55 3/11 1 10/7 1 1 5/7

  9. DIFF Solution [Sarawagi’99] k=1 k=1.5 5/3 5/3 • DIFF operator is not parsimonious • Tries to adjust whole tree while finding explanations • 7 explanation weights with k=1; 4 with k=1.5 11/9 3 3 1 1 1.4 2 1 1

  10. Finding Parsimonious Solutions • Root-to-leaf explanations • along an ancestor path P(n) from root to node n • count(leaf,T2)= nP(leaf)weight(n)  count(leaf, T1) • Underconstrained system of equations • Parsimonious explanations  weight assignment with minimum number of non-1s • Tolerance parameter k • weights [1/k, k], k 1 are reported as explanations to accommodate noise

  11. Weight Assignment w0 r2 = W0 x W1 x W12 x W121 w2 w1 w11 w12 w21 w22 w221 w111 w212 w121 w211 r2 r3 r4 r5 r1

  12. Parsimonious: Example k=1 k=1.5 1 8/9 • Parsimonious model explains changes optimally • k > 1 case captures similar changes among leaves having near-equal aggregates • 2 explanation weights with k=1; none with k=1.5 1 1 3 3/2 1 1 1 1 3/2 3/2 3/2 1 1 9/8 2 3/2 1 1 3/2 3/2 3/4 1

  13. Scale-up • #explanations vs. #leaves • k=1.05 has smallest #exp • Population counts do not change > 5% in 4-5 yrs • Bottom-up close to parsimonious solution with k=1 • Leaf counts are different in many leaves 8 samples from same ratio tree, comparison between Year 2003 and 2004 of Census data

  14. Effect of Threshold Parameter • Population for years 2003 and 2004 are compared • #explanations decreases dramatically with k • k=1 is a special case • Extra tolerance for grouping similar ratios

  15. Stability • Shk = set of nodes at level h where “explanations” occur • Sc = Stability of output at level h as tolerance changes from k-k to k • Average Stability across all levels, h • Stability is > 0.6

  16. Future Work • Global Budget on Error Tolerance • Bound on sum of node tolerances, where individual node weights can be unequally distributed • Using a Prediction Model • Statistical model provides predictions and confidence intervals on counts, which are compared to observed counts • Multiple Dimension Hierarchies • e.g., geography x Dewey Decimal Number

More Related