efficient updates for a shared nothing analytics platform n.
Skip this Video
Download Presentation
Efficient Updates for a Shared Nothing Analytics Platform

Loading in 2 Seconds...

play fullscreen
1 / 18

Efficient Updates for a Shared Nothing Analytics Platform - PowerPoint PPT Presentation

  • Uploaded on

Efficient Updates for a Shared Nothing Analytics Platform. Katerina Doka , Dimitrios Tsoumakos , Nectarios Koziris { katerina , dtsouma , nkoziris }@ cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens. Motivation. Large volumes of data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Efficient Updates for a Shared Nothing Analytics Platform' - hazel

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
efficient updates for a shared nothing analytics platform

Efficient Updates for a Shared Nothing Analytics Platform

KaterinaDoka, DimitriosTsoumakos, NectariosKoziris

{katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr

Computing Systems Laboratory

National Technical University of Athens

  • Large volumes of data
    • Everyday life, science and business domain
  • Time-series data
    • Temporally ordered, organized in hierarchies (Day<Month<Year)
      • E.g., date of a credit card purchase, time of a phone call
    • Important for monitoring a process of interest
  • On-line processing
    • Fast retrieval – Point, range, aggregate queries
    • Detection of real time changes in trends
      • Intrusion or DoS detection, effects of product’s promotion
    • Online, cost-efficient updates
up till now
Up till now
  • Data Warehouses
    • Centralized, off-line approaches
    • Distributed warehousing systems
      • Functionality remains centralized
  • Distributed Warehouse-like initiative: Brown Dwarf
    • Distribution of centralized Dwarf
    • Deployed on shared-nothing, commodity hardware
      • Scalability, fault tolerance, performance
    • No special consideration for time-series data
    • Update procedure costly → unfit for frequent updates
our goals
Our Goals
  • Cloud based DataWarehousing-like system
    • Targeted to time-series data
      • Arriving at high rate
    • Store, update, query data at various granularity levels
      • Multidimensional, hierarchical
    • Shared nothing architecture
      • Commodity nodes
    • Without use of any proprietary tool
      • Java libraries, socket APIs
our contribution
Our Contribution
  • Complete system for multidimensional time-series data
    • Store with one pass
    • Update online
    • Query efficiently
      • Point, aggregate
      • Various levels of granularity
  • Adaptive materialization
    • According to data recency
    • Accelerate cube creation/update
    • Minimize storage consumption

Dwarf computes, stores, indexes and updates materialized cubes

Eliminates prefix and suffix redundancies

Any query (point or aggregate) is answered through traversal of structure

brown dwarf
Brown Dwarf
  • Dwarf nodes mapped to overlay nodes
    • UID for each node
    • Hint tables of the form (currAttr, child)
  • Insertion
    • One-pass over the fact table
    • Gradual structure of hint tables
  • Queries
    • Overlay path of d hops
  • Incremental Updates
  • Elasticity through adaptive mirroring
advantages and drawbacks
Advantages and Drawbacks
  • Store even larger amounts of data!
    • Dwarf reduces but may also blow-up data
      • High dimensional, sparse >1,000 times
  • Handle many more requests
  • Query the system online
  • Accelerate creation (up to 5 times ) and querying (up to 60 times)
    • Parallelization
  • Update remains costly
time series dwarf tsd
Time Series Dwarf (TSD)
  • A concept hierarchy characterizes time
    • and any other dimension
  • Updates are applied in temporal order
  • Temporal granularity of queries relative to the time of querying
    • More detailed queries for recent events
    • More coarse grained queries for past events
tsd operations insertion
TSD Operations - Insertion
  • Time first in order
  • Lack of ALL cell in Time
  • Aggregate created after completion of a level
tsd operations querying
TSD Operations - Querying
  • Follow path along the structure
  • Roll-up query for aggregate already created
    • Within d hops (e.g., <Y1, ALL, P1>)
  • Roll-up query for recent records
    • Initial query substituted by multiple lower level queries

(e.g., <Y2, S1, P1>)

tsd operations updating
TSD Operations - Updating

Insertion of a new tuple

Longest common prefix with existing structure

Underlying nodes recursively updated

Lack of ALL cell for Time + temporal ordering = fewer existing cells affected

Example: 3 TSD nodes vs. 12 Dwarf nodes affected

adaptive materialization
Adaptive Materialization
  • A daemon process asynchronously
    • creates roll-up views
    • deletes corresponding drill-down ones
  • The period of this process depends on application
  • Tradeoff: cube size vs. response accuracy
experimental evaluation
Experimental Evaluation
  • 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory)
  • Synthetic and real datasets
    • APB-1 Benchmark generator
      • 4-d, 3 levels for Time, various densities
    • DARPA Intrusion Detection audit data
      • 1M tuples, 7-d, 3 levels for Time
  • TSD: static mode
  • TSDad: adaptive mode
cube construction
Cube Construction
  • Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset)
    • Lack of the ALL cell in the first dimension
  • Acceleration of cube creation up to 89% compared to Dwarf
    • Better use of resources through parallelization (BD)
    • Further reduction due to lack of ALL and selective materialization
  • 10k updates
  • TSD up to 3 times faster than Dwarf and 30% faster than BD
    • Ordered updates – do not affect already created views
    • No recursive updates for ALL cell of first dimension → smaller communication overhead (3-fold reduction)
  • TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%
  • DARPA 10k datasets – 3 kinds of querysets, 50% aggregates
    • Q1: Ideal
    • Q2: Recent records are queried upon in more detail (Zipfian)
    • Q3: Random
  • As queryset approximates uniform distribution
    • Message cost increases
    • Accuracy decreases