Scalable Data Intensive Computing for Multicore Clusters

UMBC AN HONOR UNIVERSITY IN MARYLAND Department of Computer Science and Electrical Engineering Data Intensive Scientific Compute Model for Multicore clusters Ph.D dissertation defense by Phuong Nguyen Advisors: Prof. Milton Halem and Prof. Yelena Yesha Dissertation committee Prof. Milton Halem (Chair) Prof. Yelena Yesha (Co-Chair) Prof. Tim Finin Prof. YaacovYesha Prof. Tarek El-Ghazawi at George Washington University Dec 21 2012

Acknowledgement • I would like to thank my advisors Prof. Milton Halem and Prof. Yelena Yesha • I would like to thank the committee members Prof. Milton Halem, Prof. Yelena Yesha, Prof. Tim Finin, Prof. YaacovYesha, and Prof. Tarek El-Ghazawi at the George Washington University • I would like to thank my research sponsors: IBM, NASA ACCESS grant, joint CHMPR-CHREC/GWU grant, NASA HQs, NSA/LTS • I would like to thank my teams (David Chapman, Tyler Simon) for joint research and discussions

Outline • Motivation & Thesis statement & Contributions 2. A MapReduce workflow system for scientific data intensive applications 3. Create a Fundamental Decadal Data Record (FDDR) from the Atmospheric Infrared Sounder (AIRS) and study global/regional climate change analysis 4. Conclusion & future work

Motivation • Big data - 3V’s (Stonebraker- MIT) • Big Volume • Simple or complex analytics. NASA, NOAA • Big Velocity • Data increases exponential rate • e.g MODIS 1.7PB/year • Sloan Digital Sky Survey 6PBs/year • Big Variety • Diverse data sources to integrate • Lots of other science domains • Bioinformatics, astronomy, weather, prediction models

Motivation Image: DICI http://dicomputing.pnnl.gov • Analyzing NOAA, NASA PBsof satellite Infrared Radiance (IR) which is highly data intensive • Complex science products need automated transparent workflows to handle and execute computations . Need parallel scalable systems for increased key performance metrics

Modest Challenge at Scale • Typical example to access 1000 TB to complete science experiments in days using moderate sized clusters • requires the distribution of data onto lots of disks • using thousands of cores • and fast networks • Challenge at scale of managing • Computation on thousands of cores • Scalable distributed file systems • Failures • Synchronization, load balancing

Problem Challenges • Current limitation: Hadoop is a scalable systems built to run at large scales (e.g. runs on 8000 cores) • Still need to improve key performance metrics • Limited support for scientific apps • Develop a scientific workflow system to deal with scale • Scalability, reliability, scheduling, data management, provenance • Low overhead • Create a Fundamental Decadal Data Records (FDDRs) directly from all sky satellite observations of Infrared Radiances to study global climate changes • Well calibrated, data quality control, sufficient length, and consistency • Continue to determined climate variability and change

Thesis Statement The purpose of this study is to • Develop a scalable workflow system for scientific data intensive problems • Support scientific application • Established applicability of the workflow system for producing AIRS FCDRs • Developed a MapReduce scheduling algorithm to improve latency and throughput performance metrics • Validated with standard Hadoop benchmarks • Create FDDRs IR from AIRS and study climate changes 2002-2011

Contributions IGARSS: International Geosciences and Remote Sensing Symposium AGU: American Geophysical Union TGARS: Trans. of Geosciences and Remote Sensing

Related work: Workflow systems • Grid and SOA workflows (TavernaOinn et al 2004, KeplerLudascher et al 2006) Pegasus Deelma et al 2004) seek to minimize makespan by manipulate workflow level parameters such as group & mapping WF components

Related Work • Fair Scheduler (EuroSys 2010) “A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling” MateiZaharia at al UC Berkeley • Fairness between users (pools) and • Map data locality by delay • Reduce latency for FIFO problem • Preemption by killing tasks • Available HadoopFairScheduler • Do not consider dynamic progress of jobs • Do not know task’s runtime time • Y. Chen et al. “The Case for Evaluating MapReduce Performance Using Workload Suites,” In MASCOTS 2011 UC Berkeley • Show that workload choice of Hadoop scheduler affects job latency • Capacity Scheduler, Natjam scheduler "Satisfying Strong Application Requirements in Data-Intensive Cloud Computing Environments,“ Cho, Brian PhD Thesis, UIUC 2012. • Prioritize production jobs/research jobs • Efficiently share with research jobs (suspend/resume – evection policies)

Outline • Motivation . Thesis statement . Contributions and Related work 2. A MapReduce workflow system for scientific data intensive applications 3. Create AIRS FCDRs and study global/regional climate change analysis 4. Conclusion & future work

Scientific data intensive applications • Data intensive applications refers to problems which produce, manipulate, analyze or integrate complex patterns of a huge volume of data. • Focus on data intensive problems • Def. Ratio of computation to data communication is low • Characteristics of data intensive scientific apps • Repeat experiments on the different data sets • Computations on high dimensional arrays: spatial, temporal, spectral • Variety of data formats, need math libraries • Complex components e.g. model prediction

Why scientific workflow systems ? • Component model, reuse • Data source discovery (search and manage data) • Provenance • Require the use of parallel and distributed computations for very large amounts of data • HPC systems for compute intensive problems Lack of workflow system to support • Scalability, reliability, scheduling, data management, provenance • Low overhead

MapReduce Programming Model • Map, written by the user, takes an input key/pair and produces a set of intermediate key/value pairs • the library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function • The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key and run the function • Implementation: commodity hardware • Hide parallel, scheduling computation and communication, failures • Handle lists of values that are too large to ﬁt in memory

MapReduce programming model & its implementation cont. • Easy to use, hidden data distribution, managing computation, failures and recovery, scalable, success by industry’s applications • brute force communication/computation patterns Map/Shuffle/Sort/Reduce • not yet have math libraries and complex tools yet, still difficult to integrate existing components

Background: HadoopMapReduce Data flow Source: http://developer.yahoo.com/hadoop/tutorial/module4.html

A MR workflow system • Represent data flows in workflow instance using DAG (Directed Acyclic Graph) • Nodes are a job (current implementation handle MR job many MR tasks) • Edges are data dependencies • Scheduling algorithm • The level of concurrency, monitoring status of jobs (FAILED, SUCCESS, WAIT, RUNNING) • Shares MapReduce cluster resources using fine grain MR task scheduling • Suggest predictive job performance using a Kernel Canonical Correlation Analysis- KCCA model) e.gestimate task runtime, job size • MR job constructs: input, output, configurations, weight, execution path, status (FAILED, SUCCESS, WAIT, RUNNING), jar file • (JAVA APIs: DagJob, DagBuilder, Graphs …) 2. MR workflow system

A MR workflow system

Real industry MR Workloads (Hadoop) Summary of Hadoop workloads analyzed. 5 months CC – Cloudera, FB FaceBook • Source: Yanpei Chen at el • Ranges of job types commons being small (s) mixed with large jobs (hours) • Vast amount of jobs are small • Data is skew some datasets are high accesses • >78% re-access data within few hours Source: YanpeiChen at el VLDB 2012

Background: HadoopMapReduce scheduling model • Scheduling by Map/Reduce slots • Shared resource: jobs share resource (#Map/Reduce slots) K jobs run on cluster at the same time. • Workers pull tasks from the master • Data is stored at workers (nodes) on clusters Issue #1: data locality 3. MR scheduling algorithm

Map/Reduce dependency & reduce scheduling issues Job 1 Job 2 Maps Job 3 Job 4 Time Time Job 4 maps done Issue #2: job3 waits to finish last map tasks Job 3 maps done Job 1 reduces done Reduces Job 3 reduces done Time Job 2 reduces done Job 3 reduces done Issue #3: job4 cannot start reduce – reduce slots taken by job3 3. MR scheduling algorithm

One of the ideal approach to fix the above issues Job 1 Maps Job 2 Job 3 Job 4 wait time saving Time Job 3 maps done Job 4 maps done Job 1 reduces done Reduces Job 4 reduces done Time Job 2 reduces done Job 3 reduces done

MapReduce Task scheduling assumptions • Benefit of sharing with multiple users • Higher utilization due to statistical multiplexing • Same data management (cross data queries, over replications) • Scheduling assumptions: • Single global queue, periodically pulling tasks • Non preemptive (can not stop when running – no killing tasks) • Limit # concurrent jobs W • Assumptions at scheduled time, there are limited amount of resource (cluster capacity) • Multiple users share the same cluster • Using past history/learning job properties KCCA model • Data locality is the key, common hardware and network 1Gb/s 3. MR scheduling algorithm

Dynamic priority and proportional share • Approach: using Ward priority work et al. Ward used priority for relax backfill scheduling algorithm in HPC environment • However, HybS uses both dynamic priority and proportional share assignment • α=1 and β=γ=0 FIFO • α=0, β=-1 and γ=0 shortest job first • α=0, β=0 and γ=-1 smallest job size first while γ=1 produces largest job size first To boot user queue/pool priority 3. MR scheduling algorithm

HybS scheduling algorithm • Choice of scheduling policies α, β, γ affect latency of jobs • Defined service level value for quality of service (Service Level Value) E.g only XX% of tasks from a job • Prevent a job from consuming all available resource • Need how long task will run? to calculate the priority • Using past history/learning job properties KCCA model • Current popular Hadoop scheduler does not know how long task will run 3. MR scheduling algorithm

How we calculated the weight? • Existing approaches • Dynamic predict based on job progress • Historical sample, static estimate • Profiling, trace • Our approach use prediction model based on historical data • Learn performance metric from statistical performance metrics, Kernel Canonical Correlation Analysis (KCCA) by Ganapathi UC Berkely • Estimate resource need to jobs to meet the predict performance • Accuracy 0.84 Map time for HIVE queries Courtesy : Ganapathi at al 2. MR workflow

How we calculated the job performance

HybS scheduling algorithm cont. Fractional knapsack • p(v_i) is the priority function of the job _i • f(w_i) =NumRemainTask_i*ServiceLevelValue_i • For each heartbeat, there are W slots available • UPDATE priority (for all jobs) & SORT by job priority • CALCULATE f(w_i) for all jobs • ASSIGN tasks to slots by fractional knapsack • 3.1 Assign node-local, rack-local tasks (list f(w_i) in priority order) until available slots are assigned (W) • 3.2 Assign non-local tasks if still have available slots. 3. MR scheduling algorithm

Experiment setup • Implemented using Hadoop1.0.1 . Available as Hadoop plug-in scheduler • Micro benchmark using 72 cores/9 nodes, 2.0 GHz quad core. 1TB local disk configured 4 maps/node on the same rack • Experiment 1: 21 jobs (TeraSortand WordCount and AIRS Grid - NASA AIRS data projection both I/O and Compute Intensive) • One day of AIRS data is 14GB and consists of 240 granules or files 3. MR scheduling algorithm

Experiment #1 mixed workloads • Total job's runtime HybSvsFairSvs FIFO scheduler. 21 jobs with different workloads described in above table • HybS with α=0 β=-1 γ= -1 and α=1 β=-1 γ= -1 Balanced Closer to Finish First (experiment 1). 3. MR scheduling algorithm

HybSvsFairS comparisons • HybS performed better for all jobs except job 9 and job 13. For small jobs (18, 19, 20, 21) • HybS performs 4x to 6.7x faster for small jobs, while FairS shows 1.04x to 1.09x faster for big jobs (9, 13). • HybS provides 2.4x faster response time on average than FairS 3. MR scheduling algorithm

Experiment #2 short task runtime workload • A short task runtime workload of 95 small jobs 4-5 seconds • #maps 50 to 148 data set size MB-GB • HybS provides 2.1x faster response time for jobs on average over FairS Total job's runtime HybS α=1 β=-1 γ= -1 vsFairS 3. MR scheduling algorithm

Experiment #3 using open source cloud • Using the Eucalyptus cloud vs same configured physical cluster • Terasort 1.2 GB using 2 jobs each with 60 maps • the overhead is significant • shuffle (7.5x slower) and reduce phases (8.2x slower). • rate of I/O and network I/O for virtual images are an order of magnitude slower than for that of physical system 3. MR scheduling algorithm

A MR Workflow model & its implementation cont. • Support scientific data format (HDF4), float arrays • Multiple dimension data arrays store in BigTable: random real time read/write access, spatial indexing of spatial multi dimension arrays • Implemented as a workflow system on top of Apache Hadoop • Use Hadoop ecosystem (Hadoop & Hbase) • User can to use query tools (HIVE, PIG) on the multiple dimension data arrays in Hbase. • Available as Java APIs 2. MR workflow

Spatial data locality • Bounding box is implemented • Output stores in Hbase tables for queries Image source: David Chapman Thanks to David Chapman for many discussions about the spatial data locality 2. MR workflow

HBASE design for multiple satellites: gridded data • (Table, RowKey, Family, Column, Timestamp) → Value • Hbase Index on rowKey value • The rowKey design for multiple satellite instruments • <InstrumentID>_ <DateTime>_<SpectralChannel>_<Spatial Index> • Column families • e.g. Resolution-Statistics column1_100km, column_1Km. Spatial Index lat, lon bounding box • Index by Instruments, Date Time, Spectral Channel and Spatial Index • Scan rows (which columns) into MapReduce computation Scan into MapReduce computation

MR Workflow system vs comparison with Oozie MR workflow system vsOozie Mix workflow gridding, average, WordCount on AIRS datasets same cluster capacity, different input datasets (days) with same Hadoop Task Scheduler Oozie overhead vs MR workflow 16%, 20% 2. MR workflow

Improvement of AIRS gridding • Estimated based on daily gridding • Used bounding box for spatial data locality • Gridding: compare with regular method, embarrassing parallel, gain 35% improvement in total processing time

Outline • Motivation & Thesis statement & Contributions 2. A MapReduce workflow system for scientific data intensive applications 3. Create AIRS FCDRs and study global/regional climate change analysis 4. Conclusion & future work

Motivation • Goody et al. (1996) showed that IR radiance observations could be used directly to detect the climatic response to greenhouse gas forcing. • J. Houghton IPCC- 1990,1995 “a strong link exists between increases in greenhouse gases and surface temperature”. • Keith,D.W. and Anderson, J.G.,(2001) assert “the direct use of radiances in climate analysis provides a more mathematically direct comparison between theory and observation. • Harries et. al., (2001) Compared OLR spectrum from IRIS 1970 and IMG 1997 and measured significant increases in GHG “consistent with radiativeforcings”. • AIRS and MODIS on the same satellite, observing the same fields of view with completely different calibration techniques and different spectrometers provide a precise relative calibration over near decadal year time frame. • AIRS and MODIS form a unique fundamental 10 year record of inter-calibrated, continuous, stable data from one satellite. • No AIRS or MODIS level 1b gridded data products available from instrument science teams. No long Wave IR channels on VIRS.

Methodologies • Gridding forward method, map center of each footprint into latxlon grid cells • At 0.50x1.00 lon-lat (100km) from 2002-2011 • Produce 9 year AIRS FCDR anomalies (removing yearly circle) for window channels 4.16µ and 12.18µ. • 55TB AIRS data, a non months of MODIS

Methodologies • Calibration validations by inter-comparison between AIRS and MODIS • Convolution method • Correlation validation between all sky Surface Brightness Temperature and Surface Temperatures for trends and changes analysis • GISS Surface Temperature • Data processing model: David Chapman’s Gridderama Need stability and accuracies of 0.01K for global annual changes

Convolving AIRS using MODIS SRFs for comparing stability of AIRS and MODIS • On same satellite • Integrated convolved AIRS channels in MODIS spectral range • Adjusted scan angle of MODIS to match AIRS • Compare 4µ and 12µ . 12.02µ is extremely stable for decade with a trend of 0.0010K and legible AIRS warm bias. AIRS is suitably calibrated for determining decadal trends as small as 0.10K for window ch. 12u Source: David Chapman et al, AIRS science team meeting 2012

Sensitivity of AIRS global surface BT trends • Fig.aAIRS surface BT and GISS ST global annual mean anomalies trends are flat for ch. 4.16µ with an annual correlation of 0.96, slight difference in 2007 • Fig b. AIRS ch. 12.18µ exhibits poorer annual correlations with GISS of 0.7 and a slight decadal cooling trend of 0.018K from possible cloud effects. • Both consistently colder in years 2004, and 2008 by ~0.07K and 0.12K (Fig. 2a ). Similar colder years and magnitude for AIRS ch. 12.18 • However, the AIRS 12.18µ, years 2007, 2010 and 2011 shows larger differences from GISS ST of -0.075, -0.08 and 0.06 (fig. b).

AIRS annual BT trends in Arctic • Significant warming trend in the Arctic of 0.06K per year for channel 4.16µ with yearly changes varying as much as 1.6K and a decadal gain of 0.6K (Fig. 4a). • Slightly larger trend observed by AIRS channel 12.18µ of 0.087K/yr in Fig. 4c • Both AIRS 4.16µ and 12.18µ show correspondingly large year to year Arctic oscillations of ~-0.8K to 0.6K adding credibility to the observation of a decadal warming in the Arctic.

AIRS BT shows high correlations (>0.9) with GISS ST in all seasons except SepOctNov with corr. of 0.84

AIRS annual BT trends in Arctic • Significant warming trend in the Arctic of 0.060K per year for channel 4.16µ with yearly changes varying as much as 0.70K and a decadal gain of 0.60K (Fig. 4a). • Slightly larger trend observed by AIRS channel 12.18µ of 0.0870K/yr in Fig. 4c • Both AIRS 4.16µ and 12.18µ show correspondingly large year to year Arctic oscillations of ~-0.80K to 0.60K adding credibility to the observation of a decadal warming in the Arctic.

AIRS annual BT trends in Antarctic • Fig. 4b and Fig. 4d show that trends are flat for 60S-90S in AIRS channel 4.16µ and less significant of 0.0140K cooling for AIRS channel 12.18µ. • Both AIRS 4.16µ and 12.18µ channels show little cooling trends of 0.004 and 0.008 respectively in the middle latitude (60S-60N) for this reported period (2003-2011).

Climate process: Madden Julian Oscillation Implemented block Jacobi algorithm parallel SVD for MJO First EOF ~ 14.4% variances

Scalable Data Intensive Computing for Multicore Clusters

Scalable Data Intensive Computing for Multicore Clusters

Presentation Transcript

parallel data mining on multicore clusters

Data-Intensive Scientific Computing in Astronomy

MapReduce for Data Intensive Scientific Analyses

Memento : Coordinated In-Memory Caching for Data-Intensive Clusters

Satisfying Data-Intensive Queries Using GPU Clusters

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications

Performance of MapReduce on Multicore Clusters

parallel data mining on multicore clusters

Data Management Challenges of Data-Intensive Scientific Workflows

Extreme Data-Intensive Scientific Computing

Runtime Data Management for Data-Intensive Scientific Applications

Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

parallel data mining on multicore and clusters Systems

From Compute Intensive to Data Intensive Grid Computing

A Service for Data-Intensive Computations on Virtual Clusters

parallel data mining on multicore and clusters Systems

The Case for Tiny Tasks in Compute Clusters

Data-Intensive Scientific Discovery

High Performance Cyberinfrastructure Required for Data Intensive Scientific Research

PERI : Auto-tuning Memory Intensive Kernels for Multicore

Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

Auto-tuning Memory Intensive Kernels for Multicore