Efficient updates for a shared nothing analytics platform
This presentation is the property of its rightful owner.
Sponsored Links
1 / 18

Efficient Updates for a Shared Nothing Analytics Platform PowerPoint PPT Presentation


  • 88 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Updates for a Shared Nothing Analytics Platform. Katerina Doka , Dimitrios Tsoumakos , Nectarios Koziris { katerina , dtsouma , nkoziris }@ cslab.ece.ntua.gr Computing Systems Laboratory National Technical University of Athens. Motivation. Large volumes of data

Download Presentation

Efficient Updates for a Shared Nothing Analytics Platform

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient Updates for a Shared Nothing Analytics Platform

KaterinaDoka, DimitriosTsoumakos, NectariosKoziris

{katerina, dtsouma, [email protected]

Computing Systems Laboratory

National Technical University of Athens


Motivation

  • Large volumes of data

    • Everyday life, science and business domain

  • Time-series data

    • Temporally ordered, organized in hierarchies (Day<Month<Year)

      • E.g., date of a credit card purchase, time of a phone call

    • Important for monitoring a process of interest

  • On-line processing

    • Fast retrieval – Point, range, aggregate queries

    • Detection of real time changes in trends

      • Intrusion or DoS detection, effects of product’s promotion

    • Online, cost-efficient updates


Up till now

  • Data Warehouses

    • Centralized, off-line approaches

    • Distributed warehousing systems

      • Functionality remains centralized

  • Distributed Warehouse-like initiative: Brown Dwarf

    • Distribution of centralized Dwarf

    • Deployed on shared-nothing, commodity hardware

      • Scalability, fault tolerance, performance

    • No special consideration for time-series data

    • Update procedure costly → unfit for frequent updates


Our Goals

  • Cloud based DataWarehousing-like system

    • Targeted to time-series data

      • Arriving at high rate

    • Store, update, query data at various granularity levels

      • Multidimensional, hierarchical

    • Shared nothing architecture

      • Commodity nodes

    • Without use of any proprietary tool

      • Java libraries, socket APIs


Our Contribution

  • Complete system for multidimensional time-series data

    • Store with one pass

    • Update online

    • Query efficiently

      • Point, aggregate

      • Various levels of granularity

  • Adaptive materialization

    • According to data recency

    • Accelerate cube creation/update

    • Minimize storage consumption


Dwarf

Dwarf computes, stores, indexes and updates materialized cubes

Eliminates prefix and suffix redundancies

Any query (point or aggregate) is answered through traversal of structure


Brown Dwarf

  • Dwarf nodes mapped to overlay nodes

    • UID for each node

    • Hint tables of the form (currAttr, child)

  • Insertion

    • One-pass over the fact table

    • Gradual structure of hint tables

  • Queries

    • Overlay path of d hops

  • Incremental Updates

  • Elasticity through adaptive mirroring


Advantages and Drawbacks

  • Store even larger amounts of data!

    • Dwarf reduces but may also blow-up data

      • High dimensional, sparse >1,000 times

  • Handle many more requests

  • Query the system online

  • Accelerate creation (up to 5 times ) and querying (up to 60 times)

    • Parallelization

  • Update remains costly


Time Series Dwarf (TSD)

  • A concept hierarchy characterizes time

    • and any other dimension

  • Updates are applied in temporal order

  • Temporal granularity of queries relative to the time of querying

    • More detailed queries for recent events

    • More coarse grained queries for past events


TSD Operations - Insertion

  • Time first in order

  • Lack of ALL cell in Time

  • Aggregate created after completion of a level


TSD Operations - Querying

  • Follow path along the structure

  • Roll-up query for aggregate already created

    • Within d hops (e.g., <Y1, ALL, P1>)

  • Roll-up query for recent records

    • Initial query substituted by multiple lower level queries

      (e.g., <Y2, S1, P1>)


TSD Operations - Updating

Insertion of a new tuple

Longest common prefix with existing structure

Underlying nodes recursively updated

Lack of ALL cell for Time + temporal ordering = fewer existing cells affected

Example: 3 TSD nodes vs. 12 Dwarf nodes affected


Adaptive Materialization

  • A daemon process asynchronously

    • creates roll-up views

    • deletes corresponding drill-down ones

  • The period of this process depends on application

  • Tradeoff: cube size vs. response accuracy


Experimental Evaluation

  • 25 LAN commodity nodes (dual core, 2.0 GHz, 4GB main memory)

  • Synthetic and real datasets

    • APB-1 Benchmark generator

      • 4-d, 3 levels for Time, various densities

    • DARPA Intrusion Detection audit data

      • 1M tuples, 7-d, 3 levels for Time

  • TSD: static mode

  • TSDad: adaptive mode


Cube Construction

  • Noticeable reduction of cube size for TSD, impressive for TSDad (up to 85% for the APB dataset)

    • Lack of the ALL cell in the first dimension

  • Acceleration of cube creation up to 89% compared to Dwarf

    • Better use of resources through parallelization (BD)

    • Further reduction due to lack of ALL and selective materialization


Updates

  • 10k updates

  • TSD up to 3 times faster than Dwarf and 30% faster than BD

    • Ordered updates – do not affect already created views

    • No recursive updates for ALL cell of first dimension → smaller communication overhead (3-fold reduction)

  • TSDad does not include roll-up view creation (asynchronous) → further acceleration ~20%


Queries

  • DARPA 10k datasets – 3 kinds of querysets, 50% aggregates

    • Q1: Ideal

    • Q2: Recent records are queried upon in more detail (Zipfian)

    • Q3: Random

  • As queryset approximates uniform distribution

    • Message cost increases

    • Accuracy decreases


Questions


  • Login