Efficient updates for a shared nothing analytics platform
Sponsored Links
This presentation is the property of its rightful owner.
1 / 18

Efficient Updates for a Shared Nothing Analytics Platform PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Efficient Updates for a Shared Nothing Analytics Platform. Katerina Doka , Dimitrios Tsoumakos , Nectarios Koziris { katerina , dtsouma , nkoziris }@ cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens. Motivation. Large volumes of data

Download Presentation

Efficient Updates for a Shared Nothing Analytics Platform

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Efficient updates for a shared nothing analytics platform

Efficient Updates for a Shared Nothing Analytics Platform

KaterinaDoka, DimitriosTsoumakos, NectariosKoziris

{katerina, dtsouma, nkoziris}@cslab.ece.ntua.gr

Computing Systems Laboratory

National Technical University of Athens



  • Large volumes of data

    • Everyday life, science and business domain

  • Time-series data

    • Temporally ordered, organized in hierarchies (Day<Month<Year)

      • E.g., date of a credit card purchase, time of a phone call

    • Important for monitoring a process of interest

  • On-line processing

    • Fast retrieval – Point, range, aggregate queries

    • Detection of real time changes in trends

      • Intrusion or DoS detection, effects of product’s promotion

    • Online, cost-efficient updates

Up till now

Up till now

  • Data Warehouses

    • Centralized, off-line approaches

    • Distributed warehousing systems

      • Functionality remains centralized

  • Distributed Warehouse-like initiative: Brown Dwarf

    • Distribution of centralized Dwarf

    • Deployed on shared-nothing, commodity hardware

      • Scalability, fault tolerance, performance

    • No special consideration for time-series data

    • Update procedure costly → unfit for frequent updates

Our goals

Our Goals

  • Cloud based DataWarehousing-like system

    • Targeted to time-series data

      • Arriving at high rate

    • Store, update, query data at various granularity levels

      • Multidimensional, hierarchical

    • Shared nothing architecture

      • Commodity nodes

    • Without use of any proprietary tool

      • Java libraries, socket APIs

Our contribution

Our Contribution

  • Complete system for multidimensional time-series data

    • Store with one pass

    • Update online

    • Query efficiently

      • Point, aggregate

      • Various levels of granularity

  • Adaptive materialization

    • According to data recency

    • Accelerate cube creation/update

    • Minimize storage consumption



Dwarf computes, stores, indexes and updates materialized cubes

Eliminates prefix and suffix redundancies

Any query (point or aggregate) is answered through traversal of structure

Brown dwarf

Brown Dwarf

  • Dwarf nodes mapped to overlay nodes

    • UID for each node

    • Hint tables of the form (currAttr, child)

  • Insertion

    • One-pass over the fact table

    • Gradual structure of hint tables

  • Queries

    • Overlay path of d hops

  • Incremental Updates

  • Elasticity through adaptive mirroring

Advantages and drawbacks

Advantages and Drawbacks

  • Store even larger amounts of data!

    • Dwarf reduces but may also blow-up data

      • High dimensional, sparse >1,000 times

  • Handle many more requests

  • Query the system online

  • Accelerate creation (up to 5 times ) and querying (up to 60 times)

    • Parallelization

  • Update remains costly

Time series dwarf tsd

Time Series Dwarf (TSD)

  • A concept hierarchy characterizes time

    • and any other dimension

  • Updates are applied in temporal order

  • Temporal granularity of queries relative to the time of querying

    • More detailed queries for recent events

    • More coarse grained queries for past events

Tsd operations insertion

TSD Operations - Insertion

  • Time first in order

  • Lack of ALL cell in Time

  • Aggregate created after completion of a level

Tsd operations querying

TSD Operations - Querying

  • Follow path along the structure

  • Roll-up query for aggregate already created

    • Within d hops (e.g., <Y1, ALL, P1>)

  • Roll-up query for recent records

    • Initial query substituted by multiple lower level queries

      (e.g., <Y2, S1, P1>)

Tsd operations updating

TSD Operations - Updating

Insertion of a new tuple

Longest common prefix with existing structure

Underlying nodes recursively updated

Lack of ALL cell for Time + temporal ordering = fewer existing cells affected

Example: 3 TSD nodes vs. 12 Dwarf nodes affected

Adaptive materialization

Adaptive Materialization

  • A daemon process asynchronously

    • creates roll-up views

    • deletes corresponding drill-down ones

  • The period of this process depends on application

  • Tradeoff: cube size vs. response accuracy

Experimental evaluation

Experimental Evaluation

  • 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory)

  • Synthetic and real datasets

    • APB-1 Benchmark generator

      • 4-d, 3 levels for Time, various densities

    • DARPA Intrusion Detection audit data

      • 1M tuples, 7-d, 3 levels for Time

  • TSD: static mode

  • TSDad: adaptive mode

Cube construction

Cube Construction

  • Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset)

    • Lack of the ALL cell in the first dimension

  • Acceleration of cube creation up to 89% compared to Dwarf

    • Better use of resources through parallelization (BD)

    • Further reduction due to lack of ALL and selective materialization



  • 10k updates

  • TSD up to 3 times faster than Dwarf and 30% faster than BD

    • Ordered updates – do not affect already created views

    • No recursive updates for ALL cell of first dimension → smaller communication overhead (3-fold reduction)

  • TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%



  • DARPA 10k datasets – 3 kinds of querysets, 50% aggregates

    • Q1: Ideal

    • Q2: Recent records are queried upon in more detail (Zipfian)

    • Q3: Random

  • As queryset approximates uniform distribution

    • Message cost increases

    • Accuracy decreases



  • Login