1 / 49

Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

Sunita Sarawagi IIT Bombay http://www.it.iitb.ernet.in/~sunita. I 3 : Intelligent, Interactive Investigation of multidimensional data. Multidimensional OLAP databases. Fast, interactive answers to large aggregate queries . Multidimensional model: dimensions with hierarchies

Download Presentation

Sunita Sarawagi IIT Bombay it.iitb.ernet/~sunita

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sunita SarawagiIIT Bombayhttp://www.it.iitb.ernet.in/~sunita I3: Intelligent, Interactive Investigation of multidimensional data

  2. Multidimensional OLAP databases Fast, interactive answers to large aggregate queries. • Multidimensional model: dimensions with hierarchies • Dim 1: Bank location: • branch-->city-->state • Dim 2: Customer: • sub profession --> profession • Dim 3: Time: • month --> quarter --> year • Measures: loan amount, #transactions, balance

  3. OLAP • Navigational operators: Pivot, drill-down, roll-up, select. • Hypothesis driven search: E.g. factors affecting defaulters • view defaulting rate on age aggregated over other dimensions • for particular age segment detail along profession • Need interactive response to aggregate queries..

  4. Motivation • OLAP products provide a minimal set of tools for analysis: • simple aggregates • selects/drill-downs/roll-ups on the multidimensional structure • Heavy reliance on manual operations for analysis • tedious on large data with multiple dimensions and levels of hierarchy • GOAL: automate through complex, mining-like operations integrated with Olap.

  5. State of art in mining OLAP integration • Decision trees [Information discovery, Cognos] • find factors influencing high profits • Clustering [Pilot software] • segment customers to define hierarchy on that dimension • Time series analysis: [Seagate’s Holos] • Query for various shapes along time: spikes, outliers etc • Multi-level Associations [Han et al.] • find association between members of dimensions

  6. The Diff operator

  7. Total sales dropped 30% in N. America. Why? Unravel aggregate data What is the most compact answer that user can quickly assimilate?

  8. Solution • A new DIFF-operator added to OLAP systems that provides the answer • in a single-step • is easy-to-assimilate • and compact --- configurable by user. • Obviates use of the lengthy and manual search for reasons in large multidimensional data.

  9. Example query

  10. Compact answer

  11. Example: explaining increases

  12. Compact answer

  13. Model for summarization • The two aggregated values correspond to two subcubes in detailed data. Cube-A Cube-B

  14. Detailed answers Explain only 15% of total difference as against 90% with compact

  15. Summarizing similar changes

  16. MDL model for summarization • Given N, find the best N rows of answer such that: • if user knows cube-A and answer, • number of bits needed to send cube-B is minimized. N row answer Cube-A Cube-B

  17. Transmission cost: MDL-based • Each answer entry has a ratio that is • sum of measure values in cube-B and cube-A not covered by a more detailed entry in answer. • For each cell of cube-B not in answer • r: ratio of closest parent in answer • a (b): measure value of cube A (B). • Expected value of b = a r • #bits = -log(prob(b, ar)) where prob(x,u) is probability at value x for a distribution with mean u. • We use a poisson distribution when x are counts, normal distribution otherwise

  18. Algorithm • Challenges • Circular dependence on parent’s ratio • Bounded size of answer • Greedy methods do not work • Bottom up dynamic programming algorithm

  19. N=0 N=0 Level 1 N=1 N=1 N=2 N=2 A new group formed N=0 N=0 N=1 Level 0 N=1 + tuple i min N=2 N=2 i Tuples with same parent Tuples in detailed data grouped by common parent..

  20. Single pass on data --- all indexing/sorting in the DBMS: interactive. Low memory usage: independent of number of tuples: O(NL) Easy to package as a stored procedure on the data server side. When detailed subcube too large: work off aggregated data. Integration

  21. 80% time spent in data access. Quarter million records processed in 10 seconds Performance 333 MHz Pentium 128 MB memory Data on DB2 UDB NT 4.0 Olap benchmark: 1.36 million tuples 4 dimensions

  22. The Relax operator

  23. Example query: generalizing drops

  24. Ratio generalization

  25. Problem formulation Inputs • A specific tuple Ts • An upper bound N on the answer size • Error functions • R(Ts,T) measures the error of including a tuple T in a generalization around Ts • S(Ts,T) measures the error of excluding T from the generalization Goal To find all possible consistent and maximal generalizations around Ts

  26. Algorithm Considerations • Need to exploit the capabilities of the OLAP data source • Need to reduce the amount of data fetches to the application 2-stage approach • Finding generalizations • Getting exceptions

  27. Finding generalizations n = number of dimensions Li = levels of hierarchy of dimension Di Dij = jth level in the ith dimension hierarchy candidate_set {D11, D21…Dn1} // all single dimension candidate gen. k = 1 while (candidate_set ) g  candidate_set if (ΣTg S(Ts,T) > ΣTg R(Ts,T)) Gk  Gk  g // generating candidates for pass (k+1) from generalizations of pass k candidate_set  generateCandidates(Gk) //Apriori style // if gen is possible at level j of dimension Di , add its parent level to the candidate set candidate_set  candidate_set  {Di(j+1)|Dij Gk & j< Li} k k +1 Return iGi

  28. Finding Summarized Exceptions Goal Find exceptions to each maximal generalization compacted to within N rows and yielding the minimum total error Challenges • No absolute criteria for determining whether a tuple is an exception or not for all possible R functions • Worth of including a child tuple is circularly dependent on its parent tuple • Bounded size of answer Solution Bottom up dynamic programming algorithm

  29. Single dimension with multiple levels of hierarchies Optimal solution for finite domain R functions soln(l,n,v) : the best solution for subtree l for all n between 0 and N and all possible values of the default rep. soln(l,n,v,c) : the intermediate value of soln(l,n,v) after the 1st to the cth child of l are scanned Err(soln(l,n,v,c+1))=min0kn(Err(soln(l,n,v,c))+Err(soln(c+1,n-k,v))) Err(soln(l,n,v))=min(Err(soln(l,n,v,*)), minv  v’ Err(soln(1,n-1,v’,*)+rep(v’)))

  30. soln(1,1,*) + - 1.2 : - 1.1 : + + - + + - - 1.2.1 : + 1.3 : + 1.2 : - 1.1 : + 1.2 : - 1.2 : - 1 : + 1.1 : + 1.1.4 : - 1.4 : + 1.2.1 :+ 1.4 : + 10 10 15 13 + - Error 8 10 9 14 13 19 N=3 + + + - - - 1.1.4 : - 1.1.4 : - 1.1.4 : - 1.1 : + 1.1 : + 1.1 : + + - 1.1.8 : - 1.1.8 : - 1.1.8 : - 1.1.4 : - 1.1.4 : - 1.1.4 : - 1.2 : - 1.2.1 : + 1.1.9 : - 1.1.9 : - 1.1.9 : - 1.1.8 : - 1.1.8 : - 1.1.8 : - 1.2.1 : + 1.2.3 : + Error 0 0 0 1 1 1 1.2.3 : + 0 0 N=3 N=2 N=1 N=0 1 1.1 (+) 1.2 (-) 1.3 (+) 1.4 (+) + + + - + + + - - + 1 2 3 4 5 6 7 8 9 10 + - + - - - - 1 2 3 4 5 6 7 + + - + + + 1 2 3 4 5 6 • - - + + + + - + • 1 2 3 4 5 6 7 8 9 soln(1.1,3,*) soln(1.2,3,*) soln(1.3,3,*) soln(1.4,3,*)

  31. The Inform operator

  32. User-cognizant data exploration: overview • Monitor to find regions of data user has visited • Model user’s expectation of unseen values • Report most informative unseen values • How to • Model expected values? • Define information content?

  33. Views seen by user Database hidden from user Modeling expected values

  34. The Maximum Entropy Principle • Choose the most uniform distribution while adhering to all the constraints • E.T.Jaynes..[1990] it agrees with everything that is known but carefully avoids assuming anything that is not known. It is transcription into mathematics of an ancient principle of wisdom… • Characterizing uniformity: maximum when all pi-s are equal • Solve the constrained optimization problem: • maximize H(p) subject to k constraints

  35. Modeling expected values Visited views Database

  36. Change in entropy

  37. Finding expected values • Solve the constrained optimization problem: • maximize H(p) subject to k constraints • Each constraint is of the form: sum of arbitrary sets of values • Expected values can be expressed as a product of k coefficients one from each of the k constraints

  38. Iterative scaling algorithm Initially all p values are the same While convergence not reached For each constraint Ci in turn Scale p values included in Ci by Converges to optimal solution when all constraints are consistent.

  39. Information content of an unvisited cell • Defined as how much adding it as a constraint will reduce distance between actual and expected values • Distance between actual and expected: • Information content of (k+1)th constraint Ck+1: • Can be approximated as:

  40. Information content of unseen data

  41. Adapting for OLAP data:Optimization 1: Expand expected cube on demand • Single entry for all cells with same expected value • Initially everything aggregated but touches lot of data • Later constraints touch limited amount of data. Expected cube Views

  42. Optimization 2: Reduce overlap • Number of iterations depend on overlap between constraints • Remove subsumed constraints from their parents to reduce overlap

  43. Finding N most informative cells • In general, most informative cells can be any of value from any level of aggregation. • Single-pass algorithm that finds the best difference between actual and expected values [VLDB-99]

  44. Information gain with focussed exploration

  45. Illustration from Student enrollment data 35% of information in data captured in 12 out of 4560 cells: 0.25% of data

  46. Top few suprising values 80% of information in data captured in 50 out of 4560 cells: 1% of data

  47. Summary • Our goal: enhance OLAP with a suite of operations that are • richer than simple OLAP and SQL queries • more interactive than conventional mining ...and thus reduce the need for manual analysis • Proposed three new operators: Diff, Generalize, Surprise • Formulations with theoretical basis • Efficient algorithms for online answering • Integrates smoothly with existing systems. • Future work: More operators.

More Related