1 / 16

Speeding Up Warehouse Physical Design Using A Randomized Algorithm

Speeding Up Warehouse Physical Design Using A Randomized Algorithm. Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida. View Selection Problem. What’s a Data Warehouse?

lise
Download Presentation

Speeding Up Warehouse Physical Design Using A Randomized Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speeding Up Warehouse Physical DesignUsing A Randomized Algorithm Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University of Florida

  2. View Selection Problem • What’s a Data Warehouse? • stores info. collected from multiple, heterogeneous info. sources to support complex querying and analysis • Materialized Views in a DW • pre-computed portions of frequently asked queries • maintenance : incremental, periodic refresh • View Selection Problem • decide which views to materialize in DW • considers query response time, maintenance cost(?), and storage cost

  3. Overview of Our Problem • Maintenance-cost View Selection Problem [GM99] • decide which views to materialize in DW • minimize query response time, given an upper bound on maintenance cost (storage space is not considered) • DW Configuration based on OR view graphs • Any view can be computed from any of its related views

  4. Problems with existing Solutions • Existing Solutions to the View Selection Problem • Heuristics-based Search • Greedy Heuristics [Gup97,GM99] • A* [Rou82, LQA97, GM99] • Exhaustive Search [TS97] • Problems • Does not scale up well for more than 20 views • Time complexity is polynomial • DW evolution requires efficient re-computation of a configuration

  5. Outline of Our Approach • Use Randomized Algorithms • Randomized algorithms provide good solution within a small amount of time (time/quality tradeoff) • Specifically, use Genetic Algorithms (GA) • Advantages of Our Approach • Near linear scaleup with a solution within 90% of optimal • Support DW evolution with fast reconfiguration

  6. Genetic Algorithms Loop until termination condition = true t=t+1 Select P(t) from P(t-1) Recombine P(t) Evaluate P(t)

  7. GA : Representation of Solution • Genome • Candidate Solution of the problem to be solved • Represented as a String • Ordering problems : Alphanumeric String Selection problems : Binary String • Binary String Representation • ex) v1 v2 v3 v4 v5 0 1 0 0 1  views v2 and v5 are selected

  8. GA : Initialization of Population • Initial Population in our experiments • Pool of randomly generated bit strings • population size is 300 • Future experiments • generate more favorable initial population • use external knowledge of problem

  9. GA : Selection, Crossover, Mutation, Termination • Selection • Select superior genomes among previous population • Roulette Wheel Method [Mic94] • Crossover • applied to two genomes by exchanging information • Mutation • applied to a single genome : ex) flip a bit in the genome • Termination • termination condition : 400 generations

  10. GA : Evaluation Process • Fitness Function • measures how good a genome is as a solution • high : close to optimal, low : further from optimal • Use Penalty Function in Fitness Function • similar solution to 0/1 knapsack solution[Mic94] • Evaluate query benefit. If maintenance limit is exceeded, apply penalty.

  11. GA : Evaluation Process • Penalty Functions • Logarithmic (LG) • Linear (LN) • Exponential (EX) • Penalty Application Methods • Subtract (S) • Divide (D) • Subtract&Divide (SD)

  12. Evaluation of the Algorithm • Environment • Pentium II 450 MHz PC, Windows NT 4.0 • OR-view graphs • number of base tables : 10 tables • number of views : 5-20 views • edge density of graph : 15%, 30%, 50%, 75% • parameters for node (view) & parameters for edge RC : 100 - 10000 for base tables QC : 10 - 80% of RC of QF : 0.1- 0.9 source view UF : 0.1 - 0.9 MC : 10 - 150% of QC

  13. Results : Quality of Solutions

  14. Results : Execution Time

  15. Prototype Development • Used version 2.4.3 of Galib from MIT • Microsoft Visual C++ • Encoded our own Fitness Function • strategy for penalty is controlled by a control variable • Encoded OR-view graph cost evaluation functions • total query cost, total maintenance cost • OR-view graph costs • Node: Read Cost, Query Frequency, Update Frequency • Edge: Query Cost, Maintenance Cost

  16. Conclusion • Use of Genetic Algorithm for Maintenance-cost View Selection Problem • yields a solution within 10% of optimal solution • linear scale up for execution time w.r.t number of views • EX-D and EX-SD strategy produce best results • Suitable for use in DW evolution • Future work • experiments with better initial population • various crossover and mutation operators, termination condition • AND-OR views, indexes • parallel version of GA

More Related