1 / 19

A Fully Distributed, Fault-Tolerant Data Warehousing System

A Fully Distributed, Fault-Tolerant Data Warehousing System. Katerina Doka , Dimitrios Tsoumakos , Nectarios Koziris Computing Systems Laboratory National Technical University of Athens. Motivation. Large volumes of data Everyday life (Web 2.0) Science (LHC, NASA)

julie
Download Presentation

A Fully Distributed, Fault-Tolerant Data Warehousing System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Fully Distributed, Fault-Tolerant Data Warehousing System KaterinaDoka, DimitriosTsoumakos, NectariosKoziris Computing Systems Laboratory National Technical University of Athens

  2. Motivation D. Tsoumakos, HDMS 2010 • Large volumes of data • Everyday life (Web 2.0) • Science (LHC, NASA) • Business domain (automation, digitization, globalization) • New regulations – log/digitize/store everything • Sensors • Immense production rates • Distributed by nature

  3. Motivation (contd.) D. Tsoumakos, HDMS 2010 • Demand for on always-on analytics • Store huge datasets • Both structured and semi-structured bulk data • Detection of real time changes in trends • Fast retrieval – Point, range, aggregate queries • Intrusion or DoS detection, effects of product’s promotion • Online, near real-time updates • From various locations, at big rates

  4. (Up till) now D. Tsoumakos, HDMS 2010 • Traditional Data Warehouses • Vast amounts of historical data – data cubes • Centralized, off-line approaches • Querying vs. Updating • Distributed warehousing systems • Functionality remains centralized • Cloud Infrastructures • Resource as a service • Elasticity, commodity hardware • Pay-as-you-go pricing model

  5. Our Goal D. Tsoumakos, HDMS 2010 • Distributed DataWarehousing-like system • Store, query, update • Multi-d, hierarchical • Scalable, always-on • Shared-nothing architecture • Commodity nodes • No proprietary tool needed • Java libraries, socket APIs

  6. Brown Dwarf in a nutshell D. Tsoumakos, HDMS 2010 • Complete system for datacubes • Distributed storage • Online updates • Efficient query resolution • Point, aggregate • Various levels of granularity • Elastic resources according to • Workload skew • Node churn

  7. Dwarf • Centralized structure with d levels • Root contains all distinct values of first dimension • Each cell points to node of the next level D. Tsoumakos, HDMS 2010 Dwarf computes, stores, indexes and updates materialized cubes Eliminates prefix and suffix redundancies

  8. Why distribute it? D. Tsoumakos, HDMS 2010 • Store larger amounts of data • Dwarf may reduce but may also blow-up data • High dimensional, sparse >1,000 times • Update and query the system online • Accelerate creation, query and update speed • Parallelization • What about… • Failures, load-balancing, comm. costs? • Performance

  9. Brown Dwarf (BD) Overview D. Tsoumakos, HDMS 2010 • Dwarf nodes mapped to overlay nodes • UID for each node • Hint tables of the form (currAttr, child) • Resolve/update along network path • Mirrors on per-node basis

  10. BD Operations – Insert+Query D. Tsoumakos, HDMS 2010 • One-pass over the fact table • Gradual structure of hint tables • Creation of cell → insertion of currAttr • Creation of dearf node → registration of child • Follow path (d hops)along the structure

  11. BD Operations - Update D. Tsoumakos, HDMS 2010 • Longest common prefix with existing structure • Underlying nodes recursively updated • Nodes expanded with new cells • New nodes created • ALL cells affected

  12. Elasticity of Brown Dwarf D. Tsoumakos, HDMS 2010 • Static and adaptive replication vs: • Load (min/max load) • Churn (require≥k replicas) • Local only interactions • Ping/exchange hint Tables for consistency • Query forwarding to balance load

  13. Experimental Evaluation D. Tsoumakos, HDMS 2010 • 16 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory) • Synthetic and real datasets • 5d-25d, various levels of skew (Zipf θ=0.95) • APB-1 Benchmark generator • Forest and Weather datasets • Simulation results with 1000s nodes

  14. Cube Construction D. Tsoumakos, HDMS 2010 • Acceleration of cube creation up to 3.5 times compared to Dwarf • Better use of resources through parallelization • More noticeable effect for high dimensional, skewed datasets • Storage overhead • Mainly attributed to mapping between dwarf node and network IDs • Shared among network nodes

  15. Updates D. Tsoumakos, HDMS 2010 • 1% updates • Up to 2.3 times faster for skewed dataset • Dimensionality increases the cost

  16. Queries D. Tsoumakos, HDMS 2010 1K querysets, 50% aggregate Impressive acceleration of up to 60 times Message cost bound by d+1

  17. Elasticity • 10-d 100k datasets, 5k query-sets • λ=10qu/sec → 100qu/sec • BD adapts according to demand → elasticity • k=3, Nfail failing nodes every Tfail sec • 5k queries, 10-d uniform dataset • No loss for Nfail < k+1 • Query time increases due to redirections Dimitrios Tsoumakos, UoI Talk

  18. What have we achieved so far? D. Tsoumakos, HDMS 2010 • BD optimizations – work in progress • Replication units (chunks, …), • Hierarchies – faster updates (MDAC 2010), … • Brown Dwarf focuses on • +Efficient answering of aggregate queries • +Cloud - friendly • - Preprocessing • - Costly updates • HiPPIS project • +Explicit support for Hierarchical data • +No preprocessing • +Ease of insertion and updates • - Processing for aggregate queries

  19. Questions D. Tsoumakos, HDMS 2010

More Related