Scalable performance of system s for extract transform load processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 53

Scalable Performance of System S for Extract-Transform-Load Processing PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Scalable Performance of System S for Extract-Transform-Load Processing. Toyotaro Suzumura , Toshihiro Yasue and Tamiya Onodera IBM Research - Tokyo. Outline. Background and Motivation System S and its suitability for ETL Performance Evaluation of System S as a Distributed ETL Platform

Download Presentation

Scalable Performance of System S for Extract-Transform-Load Processing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Scalable performance of system s for extract transform load processing

Scalable Performance of System S for Extract-Transform-Load Processing

Toyotaro Suzumura, Toshihiro Yasue and Tamiya OnoderaIBM Research - Tokyo


Outline

Outline

  • Background and Motivation

  • System S and its suitability for ETL

  • Performance Evaluation of System S as a Distributed ETL Platform

  • Performance Optimization

  • Related Work and Conclusions


What is etl

What is ETL ?

ETL = Extraction + Transformation + Loading

  • Extraction : handle the extraction of data from different distributed data sources

  • Transformation :cleansing and customizing the data for the business needs and rules while transforming the data to match the data warehouse schema

  • Loading : Load the data into data warehouse

Data

Warehouse

Data

Sources

ETL

Extract

Load

Transform


Data explosion in etl

Data Explosion in ETL

  • Data Explosion

    • The amount of data stored in a typical contemporary data warehouse may double every 12 to 18 months

  • Data Source Examples:

    • Logs for Regulatory Compliance (e.g. SOX)

    • POS (Point-of-sale) Transaction of Retail Store (e.g. Wal-Mart)

    • Web Data (e.g. internet auction sites, EBay)

    • CDR (Call Detail Record) for Telecom companies to analyze customer’s behavior

Trading Data


Near real time etl

Near-Real Time ETL

  • Given the data explosion problem, there are strong needs for ETL processing to be as fast as possible so that business analysts can quickly grasp the trends of customer activities


Our motivation

Our Motivation:

  • Assess the applicability of System S, data stream processing system to the ETL processing for, considering both qualitative and quantitative ETL constraints

  • Thoroughly evaluate the performance of System S as a scalable and distributed ETL platform to achieve “Near-Real Time ETL” and solve the data explosion in the ETL domain


Outline1

Outline

  • Background and Motivation

  • System S and its suitability for ETL

  • Performance Evaluation of System S as a Distributed ETL Platform

  • Performance Optimization

  • Related Work and Conclusions


Stream computing and system s

Stream Computing and System S

  • System S: Stream Computing Middleware developed by IBM Research

  • System S is productized as “InfoSphere Streams” now.

Stream Computing

Traditional Computing

Fact finding with data-at-rest

Insights from data in motion


Infosphere streams programming model

InfoSphere Streams Programming Model

Sink Adapters

Operator Repository

Source Adapters

Application Programming (SPADE)

Platform optimized compilation


Spade advantages of stream processing as parallelization model

A stream-centric programming language dedicated for data stream processing

Streams as first class entity

Explicit task and data parallelism

Intuitive way to exploit multi-core and multi-nodes

Operator and data source profiling for better resource management

Reuse of operators across stored and live data

Support for user-customized operators (UDOP)

SPADE : Advantages of Stream Processing as Parallelization Model


A simple spade example

Functor

Aggregate

Source

Sink

A simple SPADE example

[Application]

SourceSink trace

[Nodepool]

Nodepool np := (“host1”, “host2”, “host3)

[Program]

// virtual schema declaration

vstream Sensor (id : id_t, location : Double, light : Float, temperature : Float, timestamp : timestamp_t)

// a source stream is generated by a Source operator – in this case tuples come from an input file

stream SenSource( schemaof(Sensor))

:= Source( ) [ “file:///SenSource.dat” ] {}

-> node(np, 0)

// this intermediate stream is produced by an Aggregate operator, using the SenSource stream as input

stream SenAggregator ( schemaof(Sensor) )

:= Aggregate( SenSource<count(100),count(1)> ) [ id . location ]

{ Any(id), Any(location), Max(light), Min(temperature), Avg(timestamp) }

-> node(np, 1)

// this intermediate stream is produced by a functor operator

stream SenFunctor( id: Integer, location: Double, message: String )

:= Functor( SenAggregator) [ log(temperature,2.0)>6.0 ]

{ id, location, “Node ”+toString(id)+ “ at location ”+toString(location) }

-> node(np, 2)

// result management is done by a sink operator – in this case produced tuples are sent to a socket

Null := Sink( SenFunctor) [ “udp://192.168.0.144:5500/” ] {}

-> node(np, 0)


Infosphere streams runtime

InfoSphere Streams Runtime

Streams Data Fabric

Transport

X86

Blade

X86 Blade

X86

Blade

X86 Blade

X86

Blade

X86

Box

X86 Blade

FPGA

Blade

X86 Blade

Cell

Blade

Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Template Documentation


System s as a distributed etl platform

?

?

System S as a Distributed ETL Platform ?

Can we use System S as a distributed ETL processing platform ?


Outline2

Outline

  • Background and Motivation

  • System S and its suitability for ETL

  • Performance Evaluation of System S as a Distributed ETL Platform

  • Performance Optimization

  • Related Work and Conclusions


Target application for evaluation

Target Application for Evaluation

Inventory processing for multiple warehouses that includes most of the representative ETL primitives (Sort,Join,and Aggregate)h


Spade program for distributed processing

Compute host (1)

Data Distribution Host

0100-0300-00

0100-0900-00

Source

WarehouseItems 1(Warehouse_20090901_1.txt)

Sort

Join

Sort

ODBCAppend

Join

Sort

6 million

Split

Source

bundle

WarehouseItems 2

(Warehouse_20090901_2.txt)

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Source

WarehouseItems 3

(Warehouse_20090901_3.txt)

Key=item

Compute host (2)

Functor

Sort

Join

Sort

ODBCAppend

Join

Sort

Sink

Around 60

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Compute host (N)

Sort

Sort

Join

Sort

ODBCAppend

Source

Join

Sort

Item Catalog

Functor

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Sink

SPADE Program for Distributed Processing


Spade program 1 2

SPADE Program (1/2)

## stream for computing subindex

stream StreamWithSubindex(schemaFor(Warehouse1Schema), subIndex: Integer)

:= Functor(warehouse1Bundle[:])[] {

subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM))-2 }

-> node(np, 0), partition["Sources"]

for_begin @i 1 to COMPUTE_NODE_NUM

stream [email protected](schemaFor(Warehouse1Schema), subIndex:Integer)

for_end

:= Split(StreamWithSubindex) [ subIndex ]{}

-> node(np, 0), partition["Sources"]

for_begin @i 1 to COMPUTE_NODE_NUM

stream [email protected](schemaFor(Warehouse1Schema))

:= Sort([email protected] <count([email protected])>)[item, asc]{}

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse1Schema))

:= Functor([email protected])[ Onhand="0001.000000" ] {}

-> node(np, @i-1), partition["[email protected]"]

Nil := Sink([email protected])["file:[email protected]",

csvFormat, noDelays]{}

-> node(np, @i-1), partition["[email protected]"]

for_end

[Nodepools]

nodepool np[] := ("s72x336-00", "s72x336-02",

"s72x336-03", "s72x336-04")

[Program]

vstream Warehouse1Schema(id: Integer, item : String, Onhand : String,

allocated : String, hardAllocated : String, fileNameColumn : String)

vstream Warehouse2OutputSchema(id: Integer, item : String, Onhand : String,

allocated : String, hardAllocated : String,

fileNameColumn : String, description: StringList)

vstream ItemSchema(item: String, description: StringList)

##===================================================

## warehouse 1

##===================================================

bundle warehouse1Bundle := ()

for_begin @i 1 to 3

stream [email protected](schemaFor(Warehouse1Schema))

:= Source()["file:///SOURCEFILE", nodelays, csvformat]{}

-> node(np, 0), partition["Sources"]

warehouse1Bundle += [email protected]

for_end


Spade program 2 2

SPADE Program (2/2)

stream [email protected](schemaFor(Warehouse2OutputSchema), count: Integer)

:= Join([email protected] <count([email protected])>;

[email protected] <count([email protected])>)

[ LeftOuterJoin, {id, item} = {id, item} ] {}

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse2OutputSchema), count: Integer)

:= Sort([email protected] <count([email protected])>)[id(asc).fileNameColumn(asc)]{}

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse2OutputSchema), count: Integer)

stream [email protected](schemaFor(Warehouse2OutputSchema), count: Integer)

:= Udop([email protected])["FilterDuplicatedItems"]{}

-> node(np, @i-1), partition["[email protected]"]

Nil := Sink([email protected])["file:[email protected]", csvFormat, noDelays]{}

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](item: String, recorded_indicator: Integer)

:= Functor([email protected])[] { item, 1 }

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](LoadNum: Integer, Item_Load_Count: Integer)

:= Aggregate([email protected] <count([email protected])>)

[ recorded_indicator ]

{ Any(recorded_indicator), Cnt() }

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](LoadNum: Integer, Item_Load_Count: Integer, LoadTimeStamp: Long)

:= Functor([email protected])[] { LoadNum, Item_Load_Count, timeStampMicroseconds() }

-> node(np, @i-1), partition["[email protected]"]

Nil := Sink([email protected])["file:///final_result.out", csvFormat, noDelays]{}

-> node(np, @i-1), partition["[email protected]"]

for_end

##====================================================

## warehouse 2

##====================================================

stream ItemsSource(schemaFor(ItemSchema))

:= Source()["file:///ITEMS_FILE", nodelays, csvformat]{}

-> node(np, 1), partition["ITEMCATALOG"]

stream SortedItems(schemaFor(ItemSchema))

:= Sort(ItemsSource <count(ITEM_COUNT)>)[item, asc]{}

-> node(np, 1), partition["ITEMCATALOG"]

for_begin @i 1 to COMPUTE_NODE_NUM

stream [email protected](schemaFor(Warehouse2OutputSchema))

:= Join([email protected] <count([email protected])>;

SortedItems <count(ITEM_COUNT)>)

[ LeftOuterJoin, {item} = {item} ]{}

-> node(np, @i-1), partition["[email protected]"]

##=================================================

## warehouse 3

##=================================================

for_begin @i 1 to COMPUTE_NODE_NUM

stream [email protected](schemaFor(Warehouse2OutputSchema))

:= Sort([email protected] <count([email protected])>)[id, asc]{}

-> node(np, @i-1), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse2OutputSchema), count: Integer)

:= Aggregate([email protected] <count([email protected])>)

[item . id]

{ Any(id), Any(item), Any(Onhand), Any(allocated),

Any(hardAllocated), Any(fileNameColumn), Any(description), Cnt() }

-> node(np, @i-1), partition["[email protected]"]


Qualitative evaluation of spade

Qualitative Evaluation of SPADE

  • Implementation

    • Lines of SPADE: 76 lines

    • # of Operators: 19 (1 UDOP Operator)

  • Evaluation

    • With the built-in operators of SPADE, we could develop the given ETL scenario in a highly productive manner

    • The functionality of System S for running a SPADE program on distributed nodes was a great help


Performance evaluation

Performance Evaluation

  • Total Nodes: 14 nodes and 56 CPU cores

    • Spec. for Each Node : Intel Xeon X5365 3.0 GHz Xeon (4 physical cores with HT), 16GB memory, RHEL 5.3 64bit (Linux Kernel 2.6.18.-164.el5)

    • Network : Infiniband Network (DDR 20Gbps) Or 1Gbps Network

  • Software: InfoSphere Streams: beta version

  • Data : 9 Million Records (1 Record is around 100 Byte)

Item Sorting

Data Distribution

Total = 14 Nodes (Each node has 4 cores)

1

2

3

4

e0101b0${n}e1

n

9

10

11

12

13

14

5

6

7

8

1

21

2

22

3

23

4

24

5

25

6

26

7

27

8

28

9

29

10

30

11

31

12

32

13

14

15

16

17

18

19

20

Compute Host (10 Nodes, 40 Cores)


Node assignment

Node Assignment

Item Sorting

Data Distribution

Total = 14 Nodes (Each node has 4 cores)

1

2

3

4

e0101b0${n}e1

n

Not used

9

10

11

12

13

14

5

6

7

8

1

21

2

22

3

23

4

24

5

25

6

26

7

27

8

28

9

29

10

30

11

31

12

32

13

14

15

16

17

18

19

20

Compute Host (10 Nodes, 40 Cores)


Throughput for processing 9 million data

Throughput for Processing 9 MillionData

Maximum Throughput : around 180000 records per second (144 Mbps)

Speed-up

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A


Analysis i a breakdown the total time

Analysis (I-a) : Breakdown the Total Time

Data Distribution is Dominant

Computation

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A


Analysis i b speed up ratio against 4 cores when focusing on only computation part

Analysis (I-b) Speed-up ratio against 4 cores when focusing on only “computation part”

Over Linear-Scale

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A


Cpu utilization at compute hosts

CPU Utilization at Compute Hosts

Computation

Idle

Computation


Outline3

Outline

  • Background and Motivation

  • System S and its suitability for ETL

  • Performance Evaluation of System S as a Distributed ETL Platform

  • Performance Optimization

  • Related Work and Conclusions


Performance optimization

Performance Optimization

  • The previous experiment shows that most of the time is spent in the data distribution or I/O processing

  • For performance optimization, we implemented a SPADE program in such a way that all the nodes are participated in the data distribution while each source operator is only responsible for certain chunk of data records divided by the number of source operators


Performance optimization1

Compute host (1)

Data Distribution Host

0100-0300-00

0100-0900-00

Source

WarehouseItems 1(Warehouse_20090901_1.txt)

Sort

Join

Sort

ODBCAppend

Join

Sort

6 million

Split

Source

bundle

WarehouseItems 2

(Warehouse_20090901_2.txt)

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Source

WarehouseItems 3

(Warehouse_20090901_3.txt)

Key=item

Compute host (2)

Functor

Sort

Join

Sort

ODBCAppend

Join

Sort

Sink

Around 60

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Compute host (N)

Sort

Sort

Join

Sort

ODBCAppend

Source

Join

Sort

Item Catalog

Functor

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Sink

Performance Optimization

  • We modified the SPADE data-flow program in such a way that multiple Source operators participate in the data distribution

  • Each data distribution node can read a chunk of the whole data

Original SPADE Program

Optimized SPADE Program

Data Distribution Host

Compute host (1)

Key=item

WarehouseItems 1

Source

Split

Sort

Join

Sort

ODBCAppend

Join

Sort

0100-0300-00

0100-0900-00

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Compute host (2)

WarehouseItems 1

Source

Split

Sort

Join

Sort

ODBCAppend

Join

Sort

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

WarehouseItems 1

Source

Split

Around 60

Compute host (N)

Sort

Sort

Source

Join

Sort

ODBCAppend

Join

Sort

Item Catalog

Functor

Aggregate

Functor

Sink

UDOP(SplitDuplicatedTuples)

Sink

Sink


Node assignment1

Total = 14 Nodes (Each node has 4 cores)

1

2

3

4

1

15

2

16

3

17

4

18

e0101b0${n}e1

n

disk

disk

disk

disk

9

10

11

12

13

14

5

6

7

8

5

19

6

20

7

21

8

22

9

23

10

24

11

12

13

14

disk

disk

disk

disk

disk

disk

disk

disk

disk

disk

Node Assignment

  • All the 14 nodes participate in the data distribution

  • Each operator reads the number of records that divide the total data records (9M records) with the number of source operators.

  • The node assignment for compute nodes are the same as Experiment I

Data Distribution

Data Distribution / Compute Host


Elapsed time with varying number of compute nodes and source operators

Elapsed time with varying number of compute nodes and source operators

# of source operators

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C


Throughput over 800000 records sec

Throughput : Over 800000 records / sec

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C


Scalability achieved super linear with data distribution optimization

Scalability : Achieved Super-Linear with Data Distribution Optimization

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C


Outline4

Outline

  • Background and Motivation

  • System S and its suitability for ETL

  • Performance Evaluation of System S as a Distributed ETL Platform

  • Performance Optimization

  • Related Work and Conclusions


Related work

Related Work

  • Near Real-Time ETL

    • Panos et.al. reviewed the state of the art of both conventional and near real-time ETL [2008 Springer]

  • ETL Benchmarking

    • Wyatt et.al. identifies a common characteristics of ETL workflows in an effort of proposing a unified evaluation method for ETL [2009 Springer Lecture Notes]

    • TPC-ETL: formed in 2008 and still under the development by the TPC subcommittee


Conclusions and future work

Conclusions and Future Work

  • Conclusions

    • Demonstrated the software productivity and scalable performance of System S in the ETL domain

    • After the data distribution optimization, we achieved over linear scalability performance by processing around 800000 records per second on 14 nodes

  • Future Work

    • Comparison with the existing ETL tools / systems and various application scenarios (TPC-ETL?)

    • Automatic Data Distribution Optimization


Future direction automatic data distribution optimization

Node Pool

1

2

3

P

ComputeOperators

Source Operators

C1

n(S1, C1)

S1

d1

C2

S2

C3

d2

S3

d3

n(Sn, C3)

Sn

dn

Cm

Future Direction: Automatic Data Distribution Optimization

  • We were able to identify the appropriate number of source operators through a series of long-running experiments.

  • However, It is not wise for such a distributed systems as System S to force users/developers to experimentally find the appropriate number of source nodes.

  • We will need to have an automatic optimization mechanism that maximizes the throughput by automatically finding the best number of source nodes in a seamless manner from the user.


Questions

?

?

Questions

Thank You


Backup

Backup


Towards adaptive optimization

ComputeOperators

Source Operators

C1

ComputeOperators

S1

d1

C2

Source Operator

C1

S2

C3

d2

C2

D

S

C3

Data

DistributionOptimizer

S3

d3

Sn

dn

Cm

Cm

Optimized SPADE Program

Original SPADE Program

Towards Adaptive Optimization

  • The current SPADE compiler has compile-time optimizer by obtaining the statistical data such as tuple/byte rates and CPU ratio for each operator.

  • We would like to let users/developers to write a SPADE program in a left manner without considering the data partitioning and data distribution.

  • By extending the current optimizer, the system automatically could convert the left-hand side program to right-hand program that achieves the maximum data distribution


Executive summary

Executive Summary

Optimized version vs. others

Elapsed Time for Baseline

  • Motivation:

  • Evaluate System S as an ETL platform at a large experimental environment, Watson cluster

  • Understand the performance characteristics at such a large testbed such as scalability and performance bottlenecks

  • Findings:

  • A series of our experiments have shown that data distribution cost is dominant in the ETL processing

  • The optimized version in right hand side shows that when changing the number of data feed (or source) operators, the throughput is dramatically increased and obtains higher speed-ups than the others

  • Using the Infiniband network is critical for such an ETL workload that includes barrier before aggregating all the data for sorting operation, and we achieved almost double performance against the one with 1Gbs network

Optimized version

Comparison between 1Gbs network and Infiniband Network

Throughput

Infiniband Network

1Gbps Network


Node assignment b for experiment ii

Node Assignment (B) for Experiment II

Experimental Environment is comprised of 3 source nodes for data distribution, 1 node for item sorting, and 10 nodes for computation. The compute node has 4 cores and we manually allocate each operator with the following scheduling policy. The following diagram shows the case in that 32 operators are used for the computation. Each operator is allocated to adjunct node in order

Data Distribution

Item Sorting

Total = 14 Nodes (Each node has 4 cores)

1

2

3

4

e0101b0${n}e1

n

9

10

11

12

13

14

5

6

7

8

1

21

2

22

3

23

4

24

5

25

6

26

7

27

8

28

9

29

10

30

11

31

12

32

13

14

15

16

17

18

19

20

Compute Host (10 Nodes, 40 Cores)


Spade program with data distribution optimization

c0101b06

….

c0101b05

c0101b07

s72x336-14

1

1

1

1

1

2

2

2

2

2

4

4

4

4

4

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

ODBCAppend

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Join

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Sort

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Aggregate

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Functor

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

UDOP(SplitDuplicatedTuples)

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Sink

Source

Source

Source

WarehouseItems 2

(Warehouse_20090901_2.txt)

WarehouseItems 2

(Warehouse_20090901_2.txt)

WarehouseItems 2

(Warehouse_20090901_2.txt)

Split

Split

Split

Functor

Functor

Functor

Sink

Sink

Sink

SPADE Program with Data Distribution Optimization

Since 3 nodes are participated in the data distribution, the number of

communication is at maximum 120 (3 x 40).

c0101b01

c0101b02

c0101b03


New spade program

for_begin @j 1 to COMPUTE_NODE_NUM

bundle [email protected] := ()

for_end

#define SOURCE_NODE_NUM 3

for_begin @i 0 to SOURCE_NODE_NUM-1

stream [email protected](schemaFor(Warehouse1Schema))

:= Source()["file:///SOURCEFILE", nodelays, csvformat]{}

-> node(SourcePool, @i), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse1Schema), subIndex: Integer)

:= Functor(Warehouse1Stream1)[] {

subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM)) }

-> node(SourcePool, @i), partition["[email protected]"]

for_begin @j 1 to COMPUTE_NODE_NUM

stream [email protected]@j(schemaFor(Warehouse1Schema), subIndex:Integer)

for_end

:= Split([email protected]) [ subIndex ]{}

-> node(SourcePool, @i), partition["[email protected]"]

for_begin @j 1 to COMPUTE_NODE_NUM

[email protected] += [email protected]@j

for_end

for_end

for_begin @j 1 to COMPUTE_NODE_NUM

stream [email protected](schemaFor(Warehouse1Schema))

:= Functor([email protected][:])[]{}

-> node(np, @j-1), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse1Schema))

:= Sort([email protected] <count([email protected])>)[item, asc]{}

-> node(np, @j-1), partition["[email protected]"]

stream [email protected](schemaFor(Warehouse1Schema))

:= Functor([email protected])[ Onhand="0001.000000" ] {}

-> node(np, @j-1), partition["[email protected]"]

for_end

bundle warehouse1Bundle := ()

for_begin @i 1 to 3

stream [email protected](schemaFor(Warehouse1Schema))

:= Source()["file:///SOURCEFILE", nodelays, csvformat]{}

-> node(np, 0), partition["Sources"]

warehouse1Bundle += [email protected]

for_end

Experiment I

New SPADE Program

After

warehouse2, 3, and 4 are omitted

in this chart, but we executed them for the experiment


Node assignment c for experiment iii

Total = 14 Nodes (Each node has 4 cores)

1

2

3

4

1

15

2

16

3

17

4

18

e0101b0${n}e1

n

disk

disk

disk

disk

9

10

11

12

13

14

5

6

7

8

5

19

6

20

7

21

8

22

9

23

10

24

11

12

13

14

disk

disk

disk

disk

disk

disk

disk

disk

disk

disk

Node Assignment (C) for Experiment III

  • All the 14 nodes participate in the data distribution, and each Source operator is assigned as the manner described in the following diagram. For instance, 24 Source operators are allocated to each node in order and when 14 source operators are allocated to 14 nodes, then the next source operator is allocated to the first node.

  • Each operator reads the number of records that divide the total data records (9M recordss) with the number of source operators. This data division is conducted in prior using a Linux tool called “split”

  • The node assignment for compute nodes are the same as Experiment I

Data Distribution


Performance result for experiment ii and comparison with experiment i

Performance Result for Experiment II and Comparison with Experiment I

When 3 nodes are participated in the data distribution, the throughput is increased

to almost double when compared with the result given by Experiment I

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B


Analysis ii a optimization by changing the number of source operators

Analysis (II-a) Optimization by changing the number of source operators

Node Assignment for 9 Data Distribution Node

  • Motivation for this experiment

  • In the previous page, the throughput is saturated around 16 cores due to the lack of data feeding ratio against computation

  • Experimental Environment

  • We changed the number of source operators while not changing the total volume of data (9M data records), and measured throughput

  • We only tested 9MDATA-32 (32 operators for computation)

  • Experimental Results

  • In this experiment shows that the 9 source nodes obtains the best throughput.

Total = 14 Nodes (Each node has 4 cores)

1

2

3

4

1

2

3

4

e0101b0${n}e1

n

disk

disk

disk

disk

9

10

11

12

13

14

5

6

7

8

5

6

7

8

9

disk

disk

disk

disk

disk

disk

disk

disk

disk

disk

Best

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B


Analysis ii b increased throughput by data distribution optimization

Analysis (II-b) : Increased Throughput by Data Distribution Optimization

  • The following graph shows the overall results by taking the same optimization approach in previous experiment, which increases the number of source operators.

  • 3 source operators are used for 4, 8, 12, 16, and 9 source operators are used for 20, 24, 28 and 32.

  • We achieved 5.84 times speedup against 4 cores at 32 cores

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B


Analysis ii c increased throughput by data distribution optimization

Analysis (II-c) : Increased Throughput by Data Distribution Optimization

The yellow line shows the best performance since 9 nodes are participated in the data distribution

for 20, 24, 28 and 32 cores.

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B


Experiment iii increasing more source operators

Experiment (III): Increasing More Source Operators

  • Motivation

    • In this experiment, we understand the performance characteristics by increasing more source operators than previous experiment (Experiment II).

    • We also identify the performance comparison between Infiniband network and the commodity 1Gbps network

  • Experimental Setting

    • We increase the number of source operators up to 45 from 3, and test this configuration against relatively large number of computes nodes, 20, 24, 28, 32 nodes.

    • Node Assignment for Data Distribution and Computation is the same as previous experiment (Experiment II)


Analysis ii a throughput and elapsed time

Analysis (II-a): Throughput and Elapsed Time

The maximum total throughput, around 640 Mbps, is below the network bandwidth of both Infiniband and 1Gbps LAN.

800000 tuples/sec (1 tuple=100byte) = 640 Mbps

Throughput

Elapsed Time

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C


Analysis iii c performance without infiniband

Analysis (III-c) : Performance Without Infiniband

Throughput

  • In this experiment, we measured the throughput without Infiniband against varying number of source operators.

  • Unlike the performance we obtained with Infiniband, the throughput is saturated around 12 – 15.

  • This result shows that the throughput is around 400000 data records per seconds at maximum, and this accounts for around 360 Mbps.

  • Although the network we used in this experiment is 1Gbps, this assumes to be an upper limit for consuming full network bandwidth while considering the System S overhead.

  • Drastic performance degradation from 15 to 18 can be observed, and we assume that this is because, 14 source operators are allocated to 14 nodes and afterwards 2 or more operators (processes) simultaneously accesses the 1Gbs network card and the resource contention is occurred.

# of source operator

Elapsed Time

Elapsed Time

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C

# of source operator


Analysis iii d comparison between w o infiniband and w infiniband

Analysis (III-d) : Comparison between W/O Infiniband and W/ Infiniband

This chart shows the performance comparison by enabling or disabling the Infiniband network. The absolute throughput number when enabling Infiniband is “double” against w/o Infiniband. This result indicates that using Infiniband in ETL-typed workloads is essential to obtain high throughput

W/O Infiniband

W/ Infiniband

# of source operator

# of source operator

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C


Analysis i c elapsed time for distributing 9m data to multiple cores

Analysis (I-c) Elapsed Time for Distributing 9M Data to Multiple Cores

The following graph demonstrates that the elapsed time for distributing

all the data to varying number of compute cores is nearly constant

Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores,

16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A


  • Login