OmniStorage: Performance Improvement by Data Management Layer on a Grid RPC System

OmniStorage:Performance Improvement by Data Management Layer on a Grid RPC System Mitsuhisa Sato Yoshihiro Nakajima, Yoshiaki Aida, Osamu Tatebe, Taisuke Boku University of Tsukuba, Japan

Outline • Research background and motivation:Grid RPC and Lessons learned from applications • OmniStorage : a data managementlayer for grid RPC applications • Implementation using different data layers • Synthetic grid RPC workload for evaluation • Performance evaluation according to communication pattern • Conclusion

Background • Grid RPC (Remote Procedure Call)is recognized as one of effective programming models for grid applications. • Grid RPC can be applied to Master-Worker programming model. • For example, parameter search applications are suitable to this model.

Agent invocation communication Internet/Network OmniRPC agent Master rex rex rex Grid RPC • Grid RPC: extended RPC system in order to exploit computing resources on the Grid • One of effective preprogramming model for a Grid application • Easy to implement Grid-enabled application • Grid RPC can be applied to Master-Worker programming model. • We have developed OmniRPC[msato03] as a prototype of Grid RPC system • Provides seamless programming environments from local cluster to multi-clusters on a Grid environment. • Main target is Master/Worker type parallel program

Parameter Search Application • Parameter search applications often need a large amount of common data. • Master-worker Parallel Eigenvalue Solver • Solving large-scale eigenvalue problems by RPC model • Common data = large-scale sparse matrix • If size of the matrix is very large, it takes long time to send it to every worker master Parameters worker worker worker Large Initial Data Large Initial Data Large Initial Data

Problems in Grid RPC • In RPC model, master communicates with workers one-by-one. • Only supports direct communication between master and worker • This is not corresponding with actual network topology Worker Worker Worker Worker Worker Master Worker Worker Worker • network bandwidth between master and worker becomes a bottleneck of performance

Problem of Application Requiring a Large Amount of Initial Data • Master sends data on demand when workers are invoked • Workers cannot begin processing before data transfer is finished Site A OmniRPC client Common Data data data data data data WAN Site B a Never Used ! Finished. Waiting execute execute execute execute execute execute Waiting Finished. Waiting OmniRPC Workers Waiting Waiting Waiting

Performance issues in RPC model • RPC mechanism performs a point-to-point communication between a master and a worker • NOT network-topology-aware transmission. • No functionality of direct communication between workers • Lessons learned from real grid RPC applications Case 1: On parametric search type application • Transfers of a large amount of initial data to all workers by RPC parametersThe data transfer from the master would be a bottleneck(n times data transfer from a worker are required) Case 2: On task farming type application • Processing a set of RPCs in a pipeline manner that requires a data transfer between workersTwo more RPCs are required • RPC to send data from a worker to a master • RPC to send data from a master to another worker decouple a data management layer to solve the issues

Our Proposal • We propose a programming model that decouples data transfer layer from RPC layer • It enable to optimize data transfer among master and workers. • enables to optimize data transfer among a master and workers using several data transfer method • We have developed OmniStorage as a prototype to investigate this model • Provides a set of benchmark program according to communication patterns for performance evaluation • Can be used as the common benchmark for similar middleware

OmniStorage Overview • OmniStorage is a communication layer for OmniRPC’s Data Transfer • Independent from RPC communication • It enables topology-aware data transfer • It optimizes communications independently • Transferring data by independent process • Users can make use of OmniStorage system with simple APIs • Users don’t have to consider system’s configuration

OmniStorage Overview OmniRPC Layer Worker one-by-one Worker Master Worker Worker Invocation+Argument Worker + Large Data Transfer OmstPutData(); Register Data OmstGetData(); Retrieve Data Data Transfer Layer “OmniStorage”

Programming example using OmniRPC only Master program int main(){ double initialdata[1000*1000], output[100][1000]; ... for(i = 0; i < 100; i++){ req[i] = OmniRpcCallAsync("MyProcedure", i, initialdata, output[i]); } OmniRpcWaitAll(100, req); ... } Sending dataas a RPC parameter Worker program (Worker’s IDL) Define MyProcedure(int IN i, double IN initialdata[1000*1000], double OUT output[1000]){ ... /* Worker’s program in C language */ }

An Example code of OmniRPC applicationwith OmniStorage Master program int main(){ double initialdata[1000*1000], output[100][1000]; ... for(i = 0; i < 100; i++){ req[i] = OmniRpcCallAsync("MyProcedure", i, output[i]); } OmniRpcWaitAll(100, req); ... } User write code explicitly OmstPutData(“MyInitialData”, initialdata, 8*1000*1000,OMSTBROADCAST); Hint information ofcommunication pattern Not sending data as a RPC parameter Worker program (Worker’s IDL) Define MyProcedure(int IN i, double OUT output[1000]){ ... /* Worker’s program in C language */ } OmstGetData(“MyInitialData”, initialdata, 8*1000*1000);

Omst/Tree: First prototype implementation of OmniStorage • Only for broadcast communication from master to workers constructing tree-topology network • Relay node relays the communication between master and workers • User specifies network topology in configuration file • Relay node works as a data cache server • Reduces the data transmission where the network bandwidth is lower • Reduces the access requests to the master • We set up a relay node per cluster. • We start up OmniStorage’s servers on client host, relay hosts, and worker hosts • Relay nodes and workers can cache received data • Data transfer of the same content occurs only once

Prototype Implementation of Omst/Tree Communication through WAN done only once OmniRPC’s client host Site A OmniRPC client host data OmstPutData(); WAN Send data only once Requests only once Relay node Site B Master node of cluster data req a communication inside cluster Send data Send request execute req execute data data req execute execute data req data req begin processing execute data req data req execute OmstGetData(); OmniRPC workers OmniRPC Workers

Scalability Test – Experimental settings • We used the master-worker parallel eigenvalue solver (Sakurai et al.) • It has 80 jobs and each jobs takes 30 seconds • Each worker requires 50MB of common data • We examined the scalability of “OmniRPC+OmniStorage” and “OmniRPC only” • We measured execution time varying number of workers 1 to 64 • Cluster: Kaede, • Client： cTsukuba • Clusters • “Dennis” 16nodes @HPCC.JP • Dual Xeon 2.4Ghz, 1GB Memory, 1Gb Ethernet 10nodes • Dual Xeon 3.06Ghz, 1GB Memory, 1Gb Ethernet 6nodes • “Kaede” 64nodes @Univ. of Tsukuba • Dual Xeon 3.2Ghz, 2GB Memory, 1Gb Etnernet 64nodes • Client Hosts • “cTsukuba” @Univ. of Tsukuba

Scalability – Results (1) Execution time Performance ratio vs. 1 node

Scalability – Results (2) When we used 2 clusters, the performance improves 1.8 times

Outline • Research background and motivation:Grid RPC and Lessons learned from applications • OmniStorage : a data managementlayer for grid RPC applications • Implementation using different data layers • Synthetic grid RPC workload for evaluation • Performance evaluation according to communication pattern • Conclusion For More complex communication pattern …

What OmniStorage can solve more … Communicationbetween workers Communication between workers Worker A Worker A RPC Master RPC+Data Master Data Worker B RPC Worker B RPC+Data NOT achievable by RPC Achievable by RPC, But NOT efficient Worker A Worker A Broadcasting Broadcasting RPC Data Master Worker B Data Master Worker D RPC Data Worker C Data RPC RPC Worker C RPC+Data Worker D Worker B NOT efficient

Direct communication among workersby using OmniStorage OmniRPC Layer Worker Control sequence Worker Master Worker (“dataB”, 3.1242,…) Worker (“dataA”, “abcdefg”,…) OmstPutData(id,data, hint); Data registration API Data management layer OmniStorage OmstGetData(id,data,hint); Data retrieving API (“dataB”, 3.1242,…) (“dataD”, 321.1,…) (“dataC”, 42.32,…)

Data broadcast from master to workers by using OmniStorage OmniRPC Layer Worker Control sequence Worker Master Worker Worker (“dataA”, “abcdefg”,…) (“dataA”, “abcdefg”,…) Data management layer OmniStorage OmstPutData(id, data,hint); Data registration API OmstGetData(id, data); Data retrieving API (“dataA”, “abcdefg”,…)

Implementations of OmniStorage • We have implemented three kinds of data transfer methods for various data transmission pattern • Omst/Tree • Using our tree-network-topology-aware data transmission implementation • Omst/BT • Using BitTorrent which is designed to large-scale file distribution on widely distributed peers. • Omst/GF • Using Gfarm which is a Grid-enabled distributed file system developed by AIST

Omst/BT • Omst/BT uses BitTorrent as a data transfer method for OmniStorage • BitTorrent: P2P file sharing protocol • Specialized for sharing large amount of data on large-scale nodes • Automatically optimizes data transfer among the peers • When # of peer increase, the effectiveness of file distribution gets better • Omst/BT automates the step to use bittorrent protocol ( to register “torrent” file and etc. )

Omst/GF • Omst/GF uses Gfarm file system[Tatebe02] as a data transfer method on OmniStorage • Gfarm is a grid-enabled large-scale distributed file system for data intensive application • Data are stored/accessed by Gfarm file system • Exploits data replication of Gfarm in order to improve scalability and performance • Gfarm may optimize data transmission . . .

Synthetic benchmark programs for performance comparison of data layers • W-To-W • models a program that an output of a previous RPC becomes the input of the next RPC. (staging files) • Transfers of one file between one worker to another worker • BROADCAST • models a program to broadcast common initial data from a master to workers • Broadcasts one file from the master to all workers • ALL-EXCHANGE • models a program that every worker exchanges their own data files each other for subsequent processing • Each worker broadcasts its own one file to every other worker

Testbed Configuration Cluster Computer “Dennis” with 8 nodes@hpcc.jp • Dual Xeon 2.4Ghz, 1GB Mem, 1GbE “Alice” with 8 nodes@hpcc.jp • Dual Xeon 2.4Ghz, 1GB Mem, 1GbE “Gfm” with 8 nodes@apgrid.org • Dual Xeon 3.2Ghz, 1GB Mem,1GbE A OmniRPC master program is executed on “cTsukuba” @ Univ. of Tsukuba Two configuration of testbed for performance evaluation • Two clusters connected by high bandwidth network Dennis with 8 nodes + Alice with 8 nodes • Two clusters connected by lower bandwidth networkDennis with 8 nodes + Gfm with 8 nodes

W-To-W: Transfers of one file between one worker to another worker Omst/BT could not achieve better performance than OmniRPC Omst/GF achieves 3x faster than only OmniRPC

BROADCAST: Broadcasts one file from a master to all 16 workers Omst/BT broadcast efficiently (5.7x faster) Many communications between master and worker occurred Better performance in case of 1GB data Omst/Tree broadcast efficiently (6.7x faster) Big overhead in Omst/BT

ALL-EXCHANGE: Each worker broadcasts its own one file to every other worker (16 workers) Omst/BT 7x faster than OmniRPC Omst/Gf 21x faster than OmniRPC

Discussion on performance according to basic communication pattern • W-To-W • Omst/GF is preferred • Omst/BT could not exploit the merit of BitTorrent protocol due to too small number of workers in execution platform • BROADCST • Omst/Tree achieves better performance when the network topology is known • Omst/BT may be preferred in case that the network topology is unknown • When more than 1000 workers exists, Omst/BT is suitable • ALL-EXCHANGE • Omst/GF provides a better solution • Omst/BT has a chance to improve its performance by tuning BitTorrent’s parameters

Conclusion • We have proposed the new programming model that decouples data transfer from RPC mechanism. • We have designed and implemented OmniStorage as a prototype of data management layer • OmniStorage enables topology-aware effective data transfer. • Characterized OmniStorage’s performance according to data communication patterns.

Future work • Automatic optimal “data transfer method” selection using a hint of data communication pattern • Parameter optimization of BitTorrent on Omst/BT • More experiments and benchmarking on large-scale distributed computing platform (such as Intrigger in Japan)

Thank you for your attention! • Our website: • HPCS Laboratory in University of Tsukuba • http://www.hpcs.cs.tsukuba.ac.jp/ • http://www.omni.hpcc.jp/

OmniStorage • Data management layerfor grid RPC’s data transfer • Independent from RPC communication • enables topology-aware data transfer and optimizes data communications • Transferring data by independent process • Users can make use of OmniStorage system with simple APIs (getdata(), putdata()) • Provides multiple data transfer methods for communication pattern • User can make choice of a suitable data transfer method • Exploits hints information of data communication pattern to be required (BROADCAST, WORKER2WORKER… )

Conclusion • We have proposed the new programming model that decouples data transfer from RPC • We have designed and implemented OmniStorage as a prototype of data layer. • OmniStorage enables topology-aware effective data transfer • In the experiment, we found that OmniStorage can improve the performance of OmniRPC

Objective • Propose a programming model that decouples data transfer layer from RPC layer • enables to optimize data transfer among a master and workers using several data transfer method • Conduct performance evaluation using a set of benchmark program according to communication patterns • Can be used as the common benchmark for similar middleware

Future Work • Implement a function of collecting result data from workers to a master • Now OmniStorage has only facility to sends data from a master to workers. • Examine another systems for data management layer because the APIs doesn’t depend on implementation • Distributed Hash Table • Gfarm – Distributed file system • Bittorrent – P2P file sharing

Any question? • E-mail • ynaka@hpcs.cs.tsukuba.ac.jp or • omrpc@omni.hpcc.jp • Our website: • HPCS Laboratory in University of Tsukuba • http://www.hpcs.cs.tsukuba.ac.jp/ • OmniStorage will be released soon? • http://www.omni.hpcc.jp/

Case study of performance improvementby OmniStorage on an OmniRPC application • Benchmark program: A program of master / worker type parallel eigen value solver algorithm with OmniRPC（developed by Prof Sakurai@UoTsukuba） • 80 RPCs to be issued • A RPC may take 30 sec. • Initial data size: about 50MB for each RPC • Evaluation details • Since data transmission pattern of initial data is broadcast, we choose Omst/Tree as a data transfer method • We examine application scalability with/without Omst/Tree • We measure the execution time varying # of nodes from 1 to 64.

Performance evaluationby a real application (parallel eigen value solver) Execution time Speed up

Overview of Omst/Tree • Only for broadcast communication from master to workers constructing tree-topology network • Relay node relays the communication between master and workers • Relay node works as a data cache server • Reduces the data transmission where the network bandwidth is lower • Reduces the access requests to the master Worker 1 Master Relay Worker 2 Worker N

Overview of Omst/BT • Omst/BT uses BitTorrent as a data transfer method for OmniStorage • BitTorrent: P2P file sharing protocol • Specialized for sharing large amount of data on large-scale nodes • Automatically optimizes data transfer among the peers • The more # of peer is, the better the effectiveness of file distribution get • Omst/BT automates the way to register “torrent” file. • Basically this way is done by manual

Overview of Omst/GF • Omst/GF uses Gfarm file system[Tatebe02] as a data transfer method for OmniStorage • Gfarm is a grid-enabled large-scale distributed file system for data intensive application • Omst/Gf exploit the data through Gfarm file system • Gfarm hooks the standard system call for file (open, close, write, read). • Gfarm may optimize data transfer Metadata Server Getting file information Application gfmd Gfarm I/O library Remote file access CPU CPU CPU CPU gfsd gfsd gfsd gfsd . . . File system nodes

Point to point communication (One worker sends data to one worker in the another cluster) 1. Clusters connectedhigh bandwidth network 2. Clusters connectedlower bandwidth network Omst/Gf enables direct communication between workers so that its performance is good Omst/Gf is2.5 times fasther than OmniRPC and 6 times faster than Omst/BT

Broadcast from the master to all workers(Master send data to 16 workers) Many communications between master and worker occurred 1. Clusters connectedhigh bandwidth network 2. Clusters connectedlower bandwidth network Each execution time varies widely Big overhead is in Omst/BT but the bigger data is, the better performance Omst/Tree did effective broadcast (2.5 times faster)

All to all communication between 16 workers(Each worker sends own data to other 15 workers) 1. Clusters connectedhigh bandwidth network 2. Clusters connectedlower bandwidth network Omst/Gfis about 2 times faster than Omst/BT

OmniStorage: Performance Improvement by Data Management Layer on a Grid RPC System