1 / 69

Hierarchies in Data Mining

Hierarchies in Data Mining. Raghu Ramakrishnan ramakris@yahoo-inc.com Chief Scientist for Audience and Cloud Computing Yahoo!. About this Talk. Common theme—multidimensional view of data: Reveals patterns that emerge at coarser granularity

ganya
Download Presentation

Hierarchies in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchies in Data Mining Raghu Ramakrishnan ramakris@yahoo-inc.com Chief Scientist for Audience and Cloud Computing Yahoo!

  2. About this Talk • Common theme—multidimensional view of data: • Reveals patterns that emerge at coarser granularity • Widely recognized, e.g., generalized association rules • Helps handle imprecision • Analyzing imprecise and aggregated data • Helps handle data sparsity • Even with massive datasets, sparsity is a challenge! • Defines candidate space of subsets for exploratory mining • Forecasting query results over “future data” • Using predictive models as summaries • Potentially, space of “mining experiments”?

  3. Background: The Multidimensional Data ModelCube Space

  4. Star Schema TIME timeid date week year SERVICE pid timeid locid repair PRODUCT pid pname Category Model LOCATION locid country region state “FACT” TABLE DIMENSION TABLES

  5. Dimension Hierarchies • For each dimension, the set of values can be organized in a hierarchy: PRODUCT TIME LOCATION year automobile quarter country category week month region model date state

  6. Multidimensional Data Model • One fact table D=(X,M) • X=X1, X2, ...Dimension attributes • M=M1, M2,…Measure attributes • Domain hierarchy for each dimension attribute: • Collection of domains Hier(Xi)= (Di(1),..., Di(k)) • The extended domain: EXi = 1≤k≤t DXi(k) • Value mapping function: γD1D2(x) • e.g., γmonthyear(12/2005) = 2005 • Form the value hierarchy graph • Stored as dimension table attribute (e.g., week for a time value) or conversion functions (e.g., month, quarter)

  7. Multidimensional Data Automobile 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region DIMENSION ATTRIBUTES 1 Model Civic Camry F150 Sierra p3 p4 MA East NY p1 p2 ALL LOCATION TX West CA

  8. Cube Space • Cube space: C = EX1EX2…EXd • Region: Hyper rectangle in cube space • c = (v1,v2,…,vd) , vi EXi • E.g., c1= (NY, Camry); c2 = (West, Sedan) • Region granularity: • gran(c) = (d1, d2, ..., dd), di = Domain(c.vi) • E.g., gran(c1) = (State, Model); gran(c2) = (State, Category) • Region coverage: • coverage(c) = all facts in c • Region set: All regions with same granularity

  9. OLAP Over Imprecise Datawith Doug Burdick, Prasad Deshpande, T.S. Jayram, and Shiv VaithyanathanIn VLDB 05, 06 joint work with IBM Almaden

  10. Imprecise Data Automobile 3 1 2 3 ALL ALL 2 Category Truck Sedan ALL State Region 1 Model Civic Camry F150 Sierra p3 p4 MA p5 East NY p1 p2 ALL LOCATION TX West CA

  11. Querying Imprecise Facts Auto = F150 Loc = MA SUM(Repair) = ??? How do we treat p5? Truck F150 Sierra p5 p4 MA p3 East NY p1 p2

  12. Allocation (1) F150 Sierra Truck p5 MA p3 p4 East NY p1 p2

  13. Allocation (2) F150 Sierra (Huh? Why 0.5 / 0.5? - Hold on to that thought) Truck p5 p5 MA p3 p4 East NY p1 p2

  14. Allocation (3) F150 Sierra Auto = F150 Loc = MA SUM(Repair) = 150 Query the Extended Data Model! Truck p5 p5 MA p3 p4 East NY p1 p2

  15. Allocation Policies • Procedure for assigning allocation weights is referred to as an allocation policy • Each allocation policy uses different information to assign allocation weight • Key contributions: • Appropriate characterization of the large space of allocation policies (VLDB 05) • Designing efficient algorithms for allocation policies that take into account the correlations in the data (VLDB 06)

  16. Motivating Example p1 p3 p4 p2 Query: COUNT Truck F150 Sierra We propose desiderata that enable appropriate definition of query semantics for imprecise data MA p5 East NY

  17. Consistency specifies the relationship between answers to relatedqueries on a fixeddata set Desideratum I: Consistency p4 Truck F150 Sierra p3 MA p5 East NY p1 p2

  18. Desideratum II: Faithfulness F150 Sierra F150 Sierra MA MA NY NY p1 p3 p1 p4 p2 p5 p3 p3 p5 p2 p4 p2 p1 p5 p4 Data Set 1 Data Set 2 Data Set 3 F150 Sierra MA NY • Faithfulness specifies the relationship between answers to a fixed query on related data sets

  19. F150 F150 F150 F150 F150 Sierra Sierra Sierra Sierra Sierra MA MA MA MA MA p3 p4 p5 NY NY NY NY NY p2 p1 p5 p4 p3 Imprecise facts lead to many possible worlds [Kripke63, …] p1 p2 p3 w1 p5 p4 w4 w2 w3 p2 p1 p5 p4 p4 p5 p3 p3 p2 p2 p1 p1

  20. Query Semantics • Given all possible worlds together with their probabilities, queries are easily answered using expected values • But number of possible worlds is exponential! • Allocation gives facts weighted assignments to possible completions, leading to an extended version of the data • Size increase is linear in number of (completions of) imprecise facts • Queries operate over this extended version

  21. Dealing with Data Sparsity Deepak Agarwal, Andrei Broder, Deepayan Chakrabarti, Dejan Diklic, Vanja Josifovski, Mayssam Sayyadian Estimating Rates of Rare Events at Multiple Resolutions, KDD 2007

  22. Motivating ApplicationContent Match Problem pages ads • Problem: • Which ads are good on what pages • Pages: no control; Ads: can control • First simplification: • (Page, Ad) completely characterized by a set of high-dimensional features • Naïve Approach: • Experiment with all possible pairs several times and estimate CTR. • Of course, this doesn’t work • Most (ad, page) pairs have very few impressions, if any, • and even fewer clicks • Severe data sparsity

  23. Estimation in the “Tail” • Use an existing, well-understood hierarchy • Categorize ads and webpages to leaves of the hierarchy • CTR estimates of siblings are correlated • The hierarchy allows us to aggregate data • Coarser resolutions • provide reliable estimates for rare events • which then influences estimation at finer resolutions Similar “coarsening”, different motivation: Mining Generalized Association Rules Ramakrishnan Srikant, Rakesh Agrawal , VLDB 1995

  24. Sampling of Webpages • Naïve strategy: sample at random from the set of URLs • Sampling errors in impression volume AND click volume • Instead, we propose: • Crawling all URLs with at least one click, and • a sample of the remaining URLs • Variability is only in impression volume

  25. Imputation of Impression Volume Z(0) • Region node= (page node, ad node) • Build a Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Z(i) Leaf Region Page leaves Ad leaves Page hierarchy Ad hierarchy

  26. Exploiting Taxonomy Structure • Consider the bottom two levels of the taxonomy • Each cell corresponds to a (page, ad)-class pair • Key point: Children under a parent node are alike and expected to have similar CTRs (i.e., form a cohesive block)

  27. Imputation of Impression Volume Ad classes Clicked pool Sampled Non-clicked pool Excess impressions(to be imputed) Page classes For any level Z(i) #impressions = nij + mij + xij sums to ∑nij + K.∑mij[row constraint] sums toTotal impressions(known) sums to #impressions on ads of this ad class[column constraint]

  28. Imputation of Impression Volume sums to [block constraint]

  29. Imputing xij • Iterative Proportional Fitting [Darroch+/1972] • Initialize xij = nij + mij • Top-down: • Scale all xij in every block in Z(i+1) to sum to its parent in Z(i) • Scale all xij in Z(i+1) to sum to the row totals • Scale all xij in Z(i+1) to sum to the column totals • Repeat for every level Z(i) • Bottom-up: Similar Z(i) Z(i+1) block Page classes Ad classes

  30. Imputation: Summary • Given • nij (impressions in clicked pool) • mij (impressions in sampled non-clicked pool) • # impressions on ads of each ad class in the ad hierarchy • We get • Estimated impression volumeÑij = nij + mij + xijin each region ij of every level Z(.)

  31. Dealing with Data Sparsity Deepak Agarwal, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, Joe Zachariah Real-time Content Optimization through Active User Feedback, NIPS 2008

  32. Yahoo! Home Page Featured Box • It is the top-center part of the Y! Front Page • It has four tabs: Featured, Entertainment, Sports, and Video

  33. Novel Aspects • Classical: Arms assumed fixed over time • We gain and lose arms over time • Some theoretical work by Whittle in 80’s; operations research • Classical: Serving rule updated after each pull • We compute optimal design in batch mode • Classical: Generally. CTR assumed stationary • We have highly dynamic, non-stationary CTRs

  34. Bellwether Analysis:Global Aggregates from Local Regionswith Beechung Chen, Jude Shavlik, and Pradeep TammaIn VLDB 06

  35. Motivating Example • A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) • By looking at features and profits of previous (similar) movies, we predict expected total profit (1-year US sales) for new movie • Wait a year and write a query! If you can’t wait, stay awake … • The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods). • Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc. • Gathering this data has a cost (e.g., marketing expenses, waiting time) • Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”

  36. Key Ideas • Large datasets are rarely labeled with the targets that we wish to learn to predict • But for the tasks we address, we can readily use OLAP queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining • We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result • The central problem is to find data subsets (“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case

  37. Motivating Example • A company wants to predict the first year’s worldwide profit for a new item, by using its historical database • Database Schema: • The combination of the underlined attributes forms a key

  38. A Straightforward Approach • Build a regression model to predict item profit • There is much room for accuracy improvement! By joining and aggregating tables in the historical database we can create a training set: Item-table features Target An Example regression model: Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense

  39. Using Regional Features • Example region: [1st week, HK] • Regional features: • Regional Profit: The 1st week profit in HK • Regional Ad Expense: The 1st week ad expense in HK • A possibly more accurate model: Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense + 4Profit[1wk, HK] + 5AdExpense[1wk, HK] • Problem: Which region should we use? • The smallest region that improves the accuracy the most • We give each candidate region a cost • The most “cost-effective” region is the bellwether region

  40. Basic Bellwether Problem Features i,r(DB) Target i(DB) Total Profit in [1-52, All] Aggregate over data records in region r = [1-2, USA] r • For each region r, build a predictive model hr(x); and then choose bellwether region: • Coverage(r) fraction of all items in region  minimum coverage support • Cost(r,DB) cost threshold • Error(hr) is minimized

  41. Experiment on a Mail Order Dataset Error-vs-Budget Plot • Bel Err: The error of the bellwether region found using a given budget • Avg Err: The average error of all the cube regions with costs under a given budget • Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget [1-8 month, MD] (RMSE: Root Mean Square Error)

  42. Experiment on a Mail Order Dataset Uniqueness Plot • Y-axis: Fraction of regions that are as good as the bellwether region • The fraction of regions that satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region • We have 99% confidence that that [1-8 month, MD] is a quite unusual bellwether region [1-8 month, MD]

  43. Basic Bellwether Computation • OLAP-style bellwether analysis • Candidate regions: Regions in a data cube • Queries: OLAP-style aggregate queries • E.g., Sum(Profit) over a region • Efficient computation: • Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01) • Infeasible regions: Regions with cost > B or coverage < C • Share computation by generating the features and target values for all the feasible regions all together • Exploit distributive and algebraic aggregate functions • Simultaneously generating all the features and target values reduces DB scans and repeated aggregate computation

  44. Subset-Based Bellwether Prediction • Motivation: Different subsets of items may have different bellwether regions • E.g., The bellwether region for laptops may be different from the bellwether region for clothes • Two approaches: Bellwether Cube Bellwether Tree R&D Expenses Category

  45. Characteristics of Bellwether Trees & Cubes • Result: • Bellwether trees & cubes have better accuracy than basic bellwether search • Increase noise  increase error • Increase complexity  increase error • Dataset generation: • Use random tree to generate • different bellwether regions • for different subset of items • Parameters: • Noise • Concept complexity: # of tree nodes 15 nodes Noise level: 0.5

  46. Efficiency Comparison Naïve computation methods Our computation techniques

  47. Scalability

  48. Exploratory Mining:Prediction Cubeswith Beechung Chen, Lei Chen, and Yi LinIn VLDB 05

  49. The Idea • Build OLAP data cubes in which cell values represent decision/prediction behavior • In effect, build a tree for each cell/region in the cube—observe that this is not the same as a collection of trees used in an ensemble method! • The idea is simple, but it leads to promising data mining tools • Ultimate objective: Exploratory analysis of the entire space of “data mining choices” • Choice of algorithms, data conditioning parameters …

  50. Example (1/7): Regular OLAP Location Time Z: Dimensions Y: Measure Goal: Look for patterns of unusually high numbers of applications:

More Related