Twister: A Runtime for Iterative MapReduce

Twister: A Runtime for Iterative MapReduce HPDC – 2010 MAPREDUCE’10 Workshop, Chicago, 06/22/2010 Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University

Acknowledgements to: • Co authors: Hui Li, Binging Shang, Thilina Gunarathne Seung-Hee Bae, Judy Qiu, Geoffrey Fox School of Informatics and Computing Indiana University Bloomington • Team at IU

Motivation Data Deluge MapReduce Classic Parallel Runtimes (MPI) Input map Data Centered, QoS Efficient and Proven techniques iterations Experiencing in many domains Input Input map map Output Pij reduce reduce Expand the Applicability of MapReduce to more classes of Applications Iterative MapReduce More Extensions Map-Only MapReduce

Features of Existing Architectures(1) • Google, Apache Hadoop, Sector/Sphere, • Dryad/DryadLINQ (DAG based) • Programming Model • MapReduce (Optionally “map-only”) • Focus on Single Step MapReduce computations (DryadLINQ supports more than one stage) • Input and Output Handling • Distributed data access (HDFS in Hadoop, Sector in Sphere, and shared directories in Dryad) • Outputs normally goes to the distributed file systems • Intermediate data • Transferred via file systems (Local disk-> HTTP -> local disk in Hadoop) • Easy to support fault tolerance • Considerably high latencies

Features of Existing Architectures(2) • Scheduling • A master schedules tasks to slaves depending on the availability • Dynamic Schedulingin Hadoop, static scheduling in Dryad/DryadLINQ • Naturally load balancing • Fault Tolerance • Data flows through disks->channels->disks • A master keeps track of the data products • Re-execution of failed or slow tasks • Overheads are justifiable for large single step MapReduce computations • Iterative MapReduce

A Programming Model for Iterative MapReduce Static data Configure() • Distributed data access • In-memory MapReduce • Distinction on static data and variable data (data flow vs. δ flow) • Cacheable map/reduce tasks (long running tasks) • Combine operation • Support fast intermediate data transfers Iterate User Program δ flow Map(Key, Value) Reduce (Key, List<Value>) Close() Combine (Map<Key,Value>) TwisterConstraints for Side Effect Free map/reduce tasks Computation Complexity >> Complexity of Size of the Mutant Data (State)

runMapReduce(..) Iterations Twister Programming Model Worker Nodes configureMaps(..) Local Disk configureReduce(..) Cacheable map/reduce tasks while(condition){ May send <Key,Value> pairs directly Map() Reduce() Combine() operation Communications/data transfers via the pub-sub broker network updateCondition() Two configuration options : • Using local disks (only for maps) • Using pub-sub bus } //end while close() User program’s process space

Twister API configureMaps(PartitionFilepartitionFile) configureMaps(Value[] values) configureReduce(Value[] values) runMapReduce() runMapReduce(KeyValue[] keyValues) runMapReduceBCast(Value value) map(MapOutputCollector collector, Key key, Value val) reduce(ReduceOutputCollector collector, Key key,List<Value> values) combine(Map<Key, Value> keyValues)

Twister Architecture Master Node Pub/sub Broker Network B B B B Twister Driver Main Program One broker serves several Twister daemons Twister Daemon Twister Daemon map reduce Cacheable tasks Worker Pool Worker Pool Local Disk Local Disk Scripts perform: Data distribution, data collection, and partition file creation Worker Node Worker Node

Input/Output Handling Data Manipulation Tool Node 0 Node 1 Node n A common directory in local disks of individual nodes e.g. /tmp/twister_data Partition File • Data Manipulation Tool: • Provides basic functionality to manipulate data across the local disks of the compute nodes • Data partitions are assumed to be files (Contrast to fixed sized blocks in Hadoop) • Supported commands: • mkdir,rmdir, put,putall,get,ls, • Copy resources • Create Partition File

Partition File Partition file allows duplicates One data partition may reside in multiple nodes In an event of failure, the duplicates are used to re-schedule the tasks

The use of pub/sub messaging map task queues E.g. Map workers 100 map tasks, 10 workers in 10 nodes Broker network • ~ 10 tasks are producing outputs at once Reduce() • Intermediate data transferred via the broker network • Network of brokers used for load balancing • Different broker topologies • Interspersed computation and data transfer minimizes large message load at the brokers • Currently supports • NaradaBrokering • ActiveMQ

Scheduling • Twister supports long running tasks • Avoids unnecessary initializations in each iteration • Tasks are scheduled statically • Supports task reuse • May lead to inefficient resources utilization • Expect user to randomize data distributions to minimize the processing skews due to any skewness in data

Fault Tolerance • Recover at iteration boundaries • Does not handle individual task failures • Assumptions: • Broker network is reliable • Main program & Twister Driver has no failures • Any failures (hardware/daemons) result the following fault handling sequence • Terminate currently running tasks (remove from memory) • Poll for currently available worker nodes (& daemons) • Configure map/reduce using static data (re-assign data partitions to tasks depending on the data locality) • Re-execute the failed iteration

Performance Evaluation Hardware Configurations We use the academic release of DryadLINQ, Apache Hadoop version 0.20.2, and Twister for our performance comparisons. Both Twister and Hadoop use JDK (64 bit) version 1.6.0_18, while DryadLINQ and MPI uses Microsoft .NET version 3.5.

Pair wise Sequence Comparison using Smith Waterman Gotoh Typical MapReduce computation Comparable efficiencies Twister performs the best

Pagerank – An Iterative MapReduce Algorithm Partial Adjacency Matrix Current Page ranks (Compressed) M Partial Updates R Partially merged Updates C Iterations [1] Pagerank Algorithm, http://en.wikipedia.org/wiki/PageRank [2] ClueWeb09 Data Set, http://boston.lti.cs.cmu.edu/Data/clueweb09/ Well-known pagerank algorithm [1] Used ClueWeb09 [2] (1TB in size) from CMU Reuse of map tasks and faster communication pays off

Multi-dimensional Scaling While(condition) { <X> = [A] [B] <C> C = CalcStress(<X>) } While(condition) { <T> = MapReduce1([B],<C>) <X> = MapReduce2([A],<T>) C = MapReduce3(<X>) } [1] J. de Leeuw, "Applications of convex analysis to multidimensional scaling," Recent Developments in Statistics, pp. 133-145, 1977. • Maps high dimensional data to lower dimensions (typically 2D or 3D) • SMACOF (Scaling by Majorizing of COmplicated Function)[1]

Conclusions & Future Work • Twister extends the MapReduce to iterative algorithms • Several iterative algorithms we have implemented • K-Means Clustering • Pagerank • Matrix Multiplication • Multi dimensional scaling (MDS) • Breadth First Search • Integrating a distributed file system • Programming with side effects yet support fault tolerance

Related Work • General MapReduce References: • Google MapReduce • Apache Hadoop • Microsoft DryadLINQ • Pregel : Large-scale graph computing at Google • Sector/Sphere • All-Pairs • SAGA: MapReduce • Disco

Questions? Thank you!

Extra Slides

Hadoop (Google) Architecture Job tracker assigns some map outputs to a reducer 4 Job Tracker 3 Task Tracker Task Tracker Task tracker notifies job tracker Reducer downloads map outputs using HTTP Map output goes to local disk first M 2 R Reduce output goes to HDFS 6 Map task reads Input data from HDFS 5 Local Local 1 HDFS HDFS stores blocks, manages replications, handle failures Map/reduce are Java processes, not long running Failed maps are re-executed, failed reducers collect data from maps again

Twister Architecture Twister API configureMaps(PartitionFilepartitionFile) configureMaps(Value[] values) configureReduce(Value[] values) String key=addToMemCache(Value value) removeFromMemCache(String key) runMapReduce() runMapReduce(KeyValue[] keyValues) runMapReduceBCast(Value value) Main program Broker Network B TwisterDriver B B B 4 Broker Connection Receive static data (1) ORVariable data (key,value) via the brokers (2) Twister Daemon Twister Daemon 2 1 R 3 Reduce output goes to local disk OR to Combiner M Map output goes directly to reducer Read static data from local disk 4 1 Local Local Scripts for file manipulations Twister daemon is a process, but Map/Reduce tasks are Java Threads (Hybrid approach)

Twister • In-memory MapReduce • Distinction on static data and variable data (data flow vs. δ flow) • Cacheable map/reduce tasks (long running tasks) • Combine operation • Support fast intermediate data transfers Static data Configure() Iterate User Program δ flow Map(Key, Value) Reduce (Key, List<Value>) Close() Combine (Key, List<Value>) Different synchronization and intercommunication mechanisms used by the parallel runtimes

Publications • Jaliya Ekanayake, (Advisor: Geoffrey Fox) Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing, Accepted for the Doctoral Showcase, SuperComputing2009. • Xiaohong Qiu, Jaliya Ekanayake, Scott Beason, Thilina Gunarathne, Geoffrey Fox, Roger Barga, Dennis Gannon, Cloud Technologies for Bioinformatics Applications, Accepted for publication in 2nd ACM Workshop on Many-Task Computing on Grids and Supercomputers, SuperComputing2009. • Jaliya Ekanayake, Atilla Soner Balkir, Thilina Gunarathne, Geoffrey Fox, Christophe Poulain, Nelson Araujo, Roger Barga, DryadLINQ for Scientific Analyses, Accepted for publication in Fifth IEEE International Conference on e-Science (eScience2009), Oxford, UK. • Jaliya Ekanayake and Geoffrey Fox, High Performance Parallel Computing with Clouds and Cloud Technologies, First International Conference on Cloud Computing (CloudComp2009), Munich, Germany. – An extended version of this paper goes to a book chapter. • Geoffrey Fox, Seung-Hee Bae, Jaliya Ekanayake, Xiaohong Qiu, and Huapeng Yuan, Parallel Data Mining from Multicore to Cloudy Grids, High Performance Computing and Grids workshop, 2008. – An extended version of this paper goes to a book chapter. • Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox, MapReduce for Data Intensive Scientific Analyses, Fourth IEEE International Conference on eScience, 2008, pp.277-284. • Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox, A collaborative framework for scientific data analysis and visualization, Collaborative Technologies and Systems(CTS08), 2008, pp. 339-346. • Shrideep Pallickara, Jaliya Ekanayake and Geoffrey Fox, A Scalable Approach for the Secure and Authorized Tracking of the Availability of Entities in Distributed Systems, 21st IEEE International Parallel & Distributed Processing Symposium (IPDPS 2007).

Twister: A Runtime for Iterative MapReduce