scalable data intensive analytics l.
Download
Skip this Video
Download Presentation
Scalable Data-intensive Analytics

Loading in 2 Seconds...

play fullscreen
1 / 26

Scalable Data-intensive Analytics - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

Scalable Data-intensive Analytics. Meichun Hsu Intelligent Information Management Lab HP Labs August 24, 2008. Joint work with Qiming Chen, Bin Zhang, Ren Wu. Joint work with Qiming Chen, Bin Zhang, Ren Wu. Outline. Introduction Illustrated Computation Patterns

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Scalable Data-intensive Analytics' - tolla


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
scalable data intensive analytics

Scalable Data-intensive Analytics

Meichun Hsu

Intelligent Information Management Lab

HP Labs

August 24, 2008

Joint work with Qiming Chen, Bin Zhang, Ren Wu

Joint work with Qiming Chen, Bin Zhang, Ren Wu

outline
Outline

Introduction

Illustrated Computation Patterns

Groupby with User-defined Aggregate

Table Function over Graph-structured data

Summary and other on-going work

challenges in bi analytics

BI Services delivered to broad base of users / applications

OLTP systems, sensors, external feeds, web content…

Business Operational Analytics

Data Transformation Analytics

Massively Parallel Data Warehouse

Files / Tables

Challenges in BI Analytics
  • Scaling of data-intensive analytics components have not kept pace
  • Plus new challenges:
    • Bigger and bigger data sets
    • More and more complex transformation and analysis
    • Demand on near real-time responses to enable Operational BI (OpBI)
  • Data-intensive transformation and operational analytics increasingly recognized as the bottleneck

“In fact, it’s THE bottleneck in most VLDW/VLDB and very large data integration systems.”-

challenges in bi analytics4
Challenges in BI Analytics

(regarding a media mix problem)…“The result of a non-linear model of promotional lift as a function of media spend by channel, some coupon-related variables for each store is outstanding in terms of fit.  The bad news is that generating the coefficients using our application server and SPSS takes about two weeks of CPU time…. is this the type of problem we can throw at a parallel DB on…? “

- Director, Research and Analysis, BonTon, April 2008

“With the vast amounts of data growing, we have realized the fact that we often have to move data across networks for analysis. It's actually going to be better if we can stay inside the database and move some of our computations down to the individual nodes on a (parallel data warehouse) box.”

- Jim Goodnight, founder and CEO of SAS, October 2007

available parallelism grows exponentially

Number of HW Threads

UltraSparcT2

Power4

80486

2005

2000

1990

1995

Available Parallelism Grows Exponentially
  • How will trends in multicores ease or exacerbate bottlenecks in current transformation/analytics components?
  • Will 100’s of cores in a server, and 10,000s of cores in a scale-out parallel data warehouse, present an opportunity?

Courtesy: Anatasia Ailamaki, 2008

implications
Implications

Opportunity to design Massively Data-Parallel Analytics Layer to dramatically improve end-to-end BI performance with

  • enhanced software parallelism to take better advantage of explosion of hardware threads,
  • enhanced data locality to better optimize utilization of limited memory and data bandwidth
parallel query engine vs google s map reduce
Parallel Query Engine vs Google’s Map Reduce
  • Both are elegant and successful parallel processing models
  • Parallel Query Engine
    • Rich patterns of execution (pipelining, composition, multiple source sets, integration with schema mgmt, to name a few)
    • Focused on built-in query operators; UDF as an exception
  • Google’s Map Reduce
    • Limited patterns of execution
    • Focused more on supporting user-supplied programs
approach to scalable analytics for bi
Approach to Scalable Analytics for BI
  • Integrates high-performance parallel computation with parallel query processing
    • Leverage SQL’s schema management and declarative query language
    • Fuse declarative data access with computation functions in a scale-out shared-nothing parallel processing infrastructure
    • Create a highly parallel data flow-oriented infrastructure for data-intensive analytics

8

12 March 2014

research issues
Research Issues
  • Richer dependency semantics for UDF and flexibility for UDF optimization, e.g.
    • GroupBy with User-defined Aggregate
    • Structuring the computation - taking into account derivation and side effects
  • High performance implementation of parallel processing primitives
    • Efficient management of memory hierarchy in new architectures – e.g. multicore for high performance analytics
    • UDF execution environment – process boundaries, data flow considerations in new hybrid cluster environments
  • Enhance composability of User Defined Functions (UDFs)
    • Express a “process flow” using UDFs for ETL, information extraction, and information derivation

Plan / Spec

Root

Independent parallelism

Esp

exchange

Esp

exchange

UDF

group by

File scan

Partitioned parallelism

Execution

ODBC

ESP

ESP

ESP

UDF

UDF

UDF

DP22

DP22

DP22

DP21

DP21

DP21

9

12 March 2014

outline10
Outline

Introduction

Illustrated Computation Patterns

GroupBy with User-defined Aggregate

Table Function over Graph-structured data

Summary and other on-going work

udfs with groupby example k means clustering

Find the positions of

that minimize

UDFs with GroupBy Example – K-Means Clustering

Data points

Center of clusters

.

Init: Starting from an initial position of the centers,

Assign_Center: assigns each data point to the closest center

Recalculatecenters: as the geometric center of each cluster

Iterate until no change happens.

K-means algorithm:

map reduce k means
Map-Reduce K-means

Init. Centroids

Push down to

Parallel database?

Calc. centroid

Assign Cluster

Calc new cetroids

Execution

Recalculate Centroids

(4)

(3) Reduce

the answer

Aggregate per cluster_id

Done?

(1) Map

. . .

Assign center per datapoint

(2) Hash-distribute by cluster_id

. . .

. . .

Intermediate key-value pair is [cluster_id, x]

Map: Cx=argmin (x-Ck)2

Aggregate over each center_id

Reduce: Sp[k]=Σx[k],

Qp[k]=Σx[k],Np[k]=Σ1

Aggregate per cluster

Compute: C[k]=Σ Sp[k] /Σ Np[k]

k means by udfs

Execution

Recalculate Centroids

(4)

(3) UDF ssl, ssq, count

the answer

Aggregate per cluster_id

(1) UDF assign_center

. . .

Assign center per datapoint

(2) Hash-distribute by cluster_id

. . .

. . .

K-Means by UDFs
  • UDF assign_cluster (datapoint, k_centroids)
    • For each datapoint compute its distances to all centroids in k_centroids, and assign it to the cluster with the closest centroid
  • UDF ssl (datapoint), ssq (datapoint)
    • Aggregate each data point to produce sufficient statistics ssl and ssq

SELECT cluster_id, s = ssl(x), q = ssq(x), count (*) FROM

(SELECT INTO temp x, cluster_id = assign_cluster(x, kc) FROM r)

GROUPBY cluster_id

parallelism in aggregate udfs

Assemble results

ssl( ): aggregate function

{init(); iterate(); final(); merge();}

Select oid, ssl(x)

From r

Local aggregate

Parallelism in Aggregate UDFs

ssl.merge()

ssl.iterate()

ssl.iterate()

ssl.iterate()

tuple-wise apply

tuple-wise apply

tuple-wise apply

partitioning data

parallel aggregate udf plan in k means

Recalculate Centroids

the answer

ssl.merge()

ssl.merge()

Assemble results

Assemble results

ssl.iterate()

ssl.iterate()

ssl.iterate()

ssl.iterate()

ssl.iterate()

ssl.iterate()

tuple-wise apply

tuple-wise apply

tuple-wise apply

tuple-wise apply

tuple-wise apply

tuple-wise apply

Hash-distr. by cluster_id

. . .

. . .

. . .

Parallel Aggregate UDF Plan in K-means

But: This plan is very high in communication overhead

efficient parallel computation plan

VS

“merge” stage of Reduce

{ }

{ }

{ }

“iterate” stage of Reduce

{ assign,

Sums }

{ assign, Sums }

{ assign, Sums }

……

Efficient Parallel Computation Plan?

Recalculate Centroids

(4)

the answer

(3) Reduce

the answer

Aggregate per cluster_id

ssl.merge()

(1) Map

ssl.iterate()

ssl.iterate()

ssl.iterate()

ssl.iterate()

ssl.iterate()

ssl.iterate()

(2) Hash-distribute by cluster_id

Assign center per datapoint

. . .

Hash-distr. by cluster_id

. . .

. . .

. . .

. . .

. . .

pushing udf w groupby down to partition level

Calc new cetroids

Done?

Pushing UDF w GroupBy down to Partition Level

SELECT cluster_id, s = ssl(x), q = ssq(x), count (*)

FROM

(SELECT INTO temp x, cluster_id = assign_cluster(x, kc)

FROM r)

GROUPBY cluster_id

Init. Centroids

For each partition

Calc. centroid

Assign Cluster

do

Aggregate-local per

partition per cluster

Each local-aggregate returns a table with one row for each groupby column

these s.s. are much smaller than the data set itself

Local rows are combined at global level for each groupby column

Combine-global

per cluster

outline18
Outline

Introduction

Illustrated Computation Patterns

Groupby with User-defined Aggregate

Table Function over Graph-structured data

Summary and other on-going work

analytics over a structured data set an example a prediction system for water resources

Hydraulic dynamics

River Network

Model of a river segment

Upstream segments

Downstream segment

Condition at a segment at time t depends on own properties at time t and conditions at upstream segments at time t-1, calculated based on hydraulic dynamics

Analytics over a Structured Data Set An Example - A prediction system for water resources
computation involves multi dimensional dependencies spatial temporal
Computation involves multi-dimensional dependencies (spatial, temporal)

a(t2)

a

downstream

a

b

c

b

c

f

d

e

upstream

d

t2

e

f

t1

  • Output – predicted properties of all river segments
    • Time series, for all segments
      • Water level
      • Volume
      • Flow velocity
      • Flow and sand discharge
  • Input – a table of all river segments
      • Geometric & environmental parameters
      • Topology tree
      • Rainfall, weather sensor data
  • Precipitation
  • Evaporation
  • Runoff yield
  • Soil erosion

River segment tree

(millions of river segments)

(tens of thousands of time intervals)

udfs generally cannot be applied to tuples structured as a graph

P0

P2

P1

C21

C22

C13

C11

C12

hydro()

hydro()

hydro()

Bar.merge()

  • For tree-structured data set, need to allow UDFs to be applied in specific order

Bar.iterate()

Bar.iterate()

Bar.iterate()

UDFs generally cannot be applied to tuples structured as a graph
  • Each tuple processed independently of other tuples in the set

Parallelization of hydro()

hydro( ): table function

Select *

FROM riverCROSS APPLY hydro(*)

graph traversal in sql
Graph traversal in SQL

Pre-order traversal

P0

P2

SELECT *

FROM river

CONNECT BY PRIOR sid = parent_sid

START WITH sid = ‘P0’

P1

C21

C22

C13

C11

C12

Post-order traversal

P0

P2

P1

SELECT name, sid, parent_sid

FROM river

CONNECT BY sid = PRIOR parent_sid

START WITH sid = ‘C12’

C21

C22

C13

C11

C12

extend udf with traversal control forms for graph structured computation
Extend UDF with Traversal Control Forms for graph-structured computation
  • Apply a UDF f() to tree-structured data objects in post order

SELECT *

FROM river CROSS APPLY hydro(*)

CONNECT BY sid = PRIOR ALL parent_sid

START WITH is_leaf= “yes”

hydro( ): table function { }

P0

P2

P1

C21

C22

C13

C11

C12

Semantics: Apply hydro() starting with leaf river segments

Then apply to non-leaf only when all upstream segments are applied

parallel processing strategy for graph structured computation

I1

P-Level 8

I1

H1

P-Level 7

1. Leveling

H1

A19

D4

D4

P-Level 6

G1

A18

G1

C4

F2

C4

F2

A17

F2

F1

P-Level 5

E2

E1

P-Level 4

E3

A15

F1

B5

F1

A13

E3

E3

D4

D1

D2

D3

P-Level 3

D3

D3

A12

A16

A16

E2

E2

C2

C3

C4

P-Level 2

C1

D1

D1

C3

B4

A10

A10

C3

B4

B2

B1

B3

B5

P-Level 1

A19

A15

A16

A8

A1

A4

A5

A9

A10

A17

A6

A7

A12

A18

A13

B3

A6

B3

A7

A8

A9

A8

A9

C2

C2

P-Level 0

A5

B2

E1

C1

D2

A5

C1

B2

E1

D2

B1

B1

A1

A4

A1

A4

partition ‘0’ at level 3

0

A6

partition ‘000’ at level 2

000

001

00000

000000

partition ‘001’ at level 0

partition ‘00000’ at level 1

partition ‘000000’ at level 0

Parallel processing strategy for graph-structured computation

3. Distribution: Distribute partitions to servers with load balancing based on size of partition and levels of partitions

2. Partition as connected subgraphs

- Keep track of metadata of each partition

4. Compute in parallel

Each server sorts properly then process tuples in sort order, recording metadata for parent firing and for transmitting computed tuples to other servers

summary and on going work
Summary and on-going work
  • Illustrate how parallel query processing and map-reduce paradigms can be enriched for advanced scalable analytics
  • Explore primitives tha allow explicit declaration of semantics and dependencies in analytic computation have potential
    • Discover important patterns
    • Devise efficient parallelization support in infrastructure
  • Additional on-going investigations
    • Importance of shared-nothing principle in shared-memory (multicore) architecture
    • Hybrid clusters and paradigms for data flow among heterogenous clusters
  • Goal: Combine general data flow-driven computing paradigm with data management infrastructure to achieve data-intensive analytics
slide26

Q&A

Thank You!