1 / 53

Data Intensive Scientific Compute Model for Multicore clusters

UMBC AN HONOR UNIVERSITY IN MARYLAND Department of Computer Science and Electrical Engineering. Data Intensive Scientific Compute Model for Multicore clusters. Ph.D dissertation defense by Phuong Nguyen Advisors: Prof. Milton Halem and Prof. Yelena Yesha Dissertation committee

duman
Download Presentation

Data Intensive Scientific Compute Model for Multicore clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. UMBC AN HONOR UNIVERSITY IN MARYLAND Department of Computer Science and Electrical Engineering Data Intensive Scientific Compute Model for Multicore clusters Ph.D dissertation defense by Phuong Nguyen Advisors: Prof. Milton Halem and Prof. Yelena Yesha Dissertation committee Prof. Milton Halem (Chair) Prof. Yelena Yesha (Co-Chair) Prof. Tim Finin Prof. YaacovYesha Prof. Tarek El-Ghazawi at George Washington University Dec 21 2012

  2. Acknowledgement • I would like to thank my advisors Prof. Milton Halem and Prof. Yelena Yesha • I would like to thank the committee members Prof. Milton Halem, Prof. Yelena Yesha, Prof. Tim Finin, Prof. YaacovYesha, and Prof. Tarek El-Ghazawi at the George Washington University • I would like to thank my research sponsors: IBM, NASA ACCESS grant, joint CHMPR-CHREC/GWU grant, NASA HQs, NSA/LTS • I would like to thank my teams (David Chapman, Tyler Simon) for joint research and discussions

  3. Outline • Motivation & Thesis statement & Contributions 2. A MapReduce workflow system for scientific data intensive applications 3. Create a Fundamental Decadal Data Record (FDDR) from the Atmospheric Infrared Sounder (AIRS) and study global/regional climate change analysis 4. Conclusion & future work

  4. Motivation • Big data - 3V’s (Stonebraker- MIT) • Big Volume • Simple or complex analytics. NASA, NOAA • Big Velocity • Data increases exponential rate • e.g MODIS 1.7PB/year • Sloan Digital Sky Survey 6PBs/year • Big Variety • Diverse data sources to integrate • Lots of other science domains • Bioinformatics, astronomy, weather, prediction models

  5. Motivation Image: DICI http://dicomputing.pnnl.gov • Analyzing NOAA, NASA PBsof satellite Infrared Radiance (IR) which is highly data intensive • Complex science products need automated transparent workflows to handle and execute computations . Need parallel scalable systems for increased key performance metrics

  6. Modest Challenge at Scale • Typical example to access 1000 TB to complete science experiments in days using moderate sized clusters • requires the distribution of data onto lots of disks • using thousands of cores • and fast networks • Challenge at scale of managing • Computation on thousands of cores • Scalable distributed file systems • Failures • Synchronization, load balancing

  7. Problem Challenges • Current limitation: Hadoop is a scalable systems built to run at large scales (e.g. runs on 8000 cores) • Still need to improve key performance metrics • Limited support for scientific apps • Develop a scientific workflow system to deal with scale • Scalability, reliability, scheduling, data management, provenance • Low overhead • Create a Fundamental Decadal Data Records (FDDRs) directly from all sky satellite observations of Infrared Radiances to study global climate changes • Well calibrated, data quality control, sufficient length, and consistency • Continue to determined climate variability and change

  8. Thesis Statement The purpose of this study is to • Develop a scalable workflow system for scientific data intensive problems • Support scientific application • Established applicability of the workflow system for producing AIRS FCDRs • Developed a MapReduce scheduling algorithm to improve latency and throughput performance metrics • Validated with standard Hadoop benchmarks • Create FDDRs IR from AIRS and study climate changes 2002-2011

  9. Contributions IGARSS: International Geosciences and Remote Sensing Symposium AGU: American Geophysical Union TGARS: Trans. of Geosciences and Remote Sensing

  10. Related work: Workflow systems • Grid and SOA workflows (TavernaOinn et al 2004, KeplerLudascher et al 2006) Pegasus Deelma et al 2004) seek to minimize makespan by manipulate workflow level parameters such as group & mapping WF components

  11. Related Work • Fair Scheduler (EuroSys 2010) “A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling” MateiZaharia at al UC Berkeley • Fairness between users (pools) and • Map data locality by delay • Reduce latency for FIFO problem • Preemption by killing tasks • Available HadoopFairScheduler • Do not consider dynamic progress of jobs • Do not know task’s runtime time • Y. Chen et al. “The Case for Evaluating MapReduce Performance Using Workload Suites,” In MASCOTS 2011 UC Berkeley • Show that workload choice of Hadoop scheduler affects job latency • Capacity Scheduler, Natjam scheduler "Satisfying Strong Application Requirements in Data-Intensive Cloud Computing Environments,“ Cho, Brian PhD Thesis, UIUC 2012. • Prioritize production jobs/research jobs • Efficiently share with research jobs (suspend/resume – evection policies)

  12. Outline • Motivation . Thesis statement . Contributions and Related work 2. A MapReduce workflow system for scientific data intensive applications 3. Create AIRS FCDRs and study global/regional climate change analysis 4. Conclusion & future work

  13. Scientific data intensive applications • Data intensive applications refers to problems which produce, manipulate, analyze or integrate complex patterns of a huge volume of data. • Focus on data intensive problems • Def. Ratio of computation to data communication is low • Characteristics of data intensive scientific apps • Repeat experiments on the different data sets • Computations on high dimensional arrays: spatial, temporal, spectral • Variety of data formats, need math libraries • Complex components e.g. model prediction

  14. Why scientific workflow systems ? • Component model, reuse • Data source discovery (search and manage data) • Provenance • Require the use of parallel and distributed computations for very large amounts of data • HPC systems for compute intensive problems Lack of workflow system to support • Scalability, reliability, scheduling, data management, provenance • Low overhead

  15. MapReduce Programming Model • Map, written by the user, takes an input key/pair and produces a set of intermediate key/value pairs • the library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function • The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key and run the function • Implementation: commodity hardware • Hide parallel, scheduling computation and communication, failures • Handle lists of values that are too large to fit in memory

  16. MapReduce programming model & its implementation cont. • Easy to use, hidden data distribution, managing computation, failures and recovery, scalable, success by industry’s applications • brute force communication/computation patterns Map/Shuffle/Sort/Reduce • not yet have math libraries and complex tools yet, still difficult to integrate existing components

  17. Background: HadoopMapReduce Data flow Source: http://developer.yahoo.com/hadoop/tutorial/module4.html

  18. A MR workflow system • Represent data flows in workflow instance using DAG (Directed Acyclic Graph) • Nodes are a job (current implementation handle MR job many MR tasks) • Edges are data dependencies • Scheduling algorithm • The level of concurrency, monitoring status of jobs (FAILED, SUCCESS, WAIT, RUNNING) • Shares MapReduce cluster resources using fine grain MR task scheduling • Suggest predictive job performance using a Kernel Canonical Correlation Analysis- KCCA model) e.gestimate task runtime, job size • MR job constructs: input, output, configurations, weight, execution path, status (FAILED, SUCCESS, WAIT, RUNNING), jar file • (JAVA APIs: DagJob, DagBuilder, Graphs …) 2. MR workflow system

  19. A MR workflow system

  20. Real industry MR Workloads (Hadoop) Summary of Hadoop workloads analyzed. 5 months CC – Cloudera, FB FaceBook • Source: Yanpei Chen at el • Ranges of job types commons being small (s) mixed with large jobs (hours) • Vast amount of jobs are small • Data is skew some datasets are high accesses • >78% re-access data within few hours Source: YanpeiChen at el VLDB 2012

  21. Background: HadoopMapReduce scheduling model • Scheduling by Map/Reduce slots • Shared resource: jobs share resource (#Map/Reduce slots) K jobs run on cluster at the same time. • Workers pull tasks from the master • Data is stored at workers (nodes) on clusters Issue #1: data locality 3. MR scheduling algorithm

  22. Map/Reduce dependency & reduce scheduling issues Job 1 Job 2 Maps Job 3 Job 4 Time Time Job 4 maps done Issue #2: job3 waits to finish last map tasks Job 3 maps done Job 1 reduces done Reduces Job 3 reduces done Time Job 2 reduces done Job 3 reduces done Issue #3: job4 cannot start reduce – reduce slots taken by job3 3. MR scheduling algorithm

  23. One of the ideal approach to fix the above issues Job 1 Maps Job 2 Job 3 Job 4 wait time saving Time Job 3 maps done Job 4 maps done Job 1 reduces done Reduces Job 4 reduces done Time Job 2 reduces done Job 3 reduces done

  24. MapReduce Task scheduling assumptions • Benefit of sharing with multiple users • Higher utilization due to statistical multiplexing • Same data management (cross data queries, over replications) • Scheduling assumptions: • Single global queue, periodically pulling tasks • Non preemptive (can not stop when running – no killing tasks) • Limit # concurrent jobs W • Assumptions at scheduled time, there are limited amount of resource (cluster capacity) • Multiple users share the same cluster • Using past history/learning job properties KCCA model • Data locality is the key, common hardware and network 1Gb/s 3. MR scheduling algorithm

  25. Dynamic priority and proportional share • Approach: using Ward priority work et al. Ward used priority for relax backfill scheduling algorithm in HPC environment • However, HybS uses both dynamic priority and proportional share assignment • α=1 and β=γ=0 FIFO • α=0, β=-1 and γ=0 shortest job first • α=0, β=0 and γ=-1 smallest job size first while γ=1 produces largest job size first To boot user queue/pool priority 3. MR scheduling algorithm

  26. HybS scheduling algorithm • Choice of scheduling policies α, β, γ affect latency of jobs • Defined service level value for quality of service (Service Level Value) E.g only XX% of tasks from a job • Prevent a job from consuming all available resource • Need how long task will run? to calculate the priority • Using past history/learning job properties KCCA model • Current popular Hadoop scheduler does not know how long task will run 3. MR scheduling algorithm

  27. How we calculated the weight? • Existing approaches • Dynamic predict based on job progress • Historical sample, static estimate • Profiling, trace • Our approach use prediction model based on historical data • Learn performance metric from statistical performance metrics, Kernel Canonical Correlation Analysis (KCCA) by Ganapathi UC Berkely • Estimate resource need to jobs to meet the predict performance • Accuracy 0.84 Map time for HIVE queries Courtesy : Ganapathi at al 2. MR workflow

  28. How we calculated the job performance

  29. HybS scheduling algorithm cont. Fractional knapsack • p(v_i) is the priority function of the job _i • f(w_i) =NumRemainTask_i*ServiceLevelValue_i • For each heartbeat, there are W slots available • UPDATE priority (for all jobs) & SORT by job priority • CALCULATE f(w_i) for all jobs • ASSIGN tasks to slots by fractional knapsack • 3.1 Assign node-local, rack-local tasks (list f(w_i) in priority order) until available slots are assigned (W) • 3.2 Assign non-local tasks if still have available slots. 3. MR scheduling algorithm

  30. Experiment setup • Implemented using Hadoop1.0.1 . Available as Hadoop plug-in scheduler • Micro benchmark using 72 cores/9 nodes, 2.0 GHz quad core. 1TB local disk configured 4 maps/node on the same rack • Experiment 1: 21 jobs (TeraSortand WordCount and AIRS Grid - NASA AIRS data projection both I/O and Compute Intensive) • One day of AIRS data is 14GB and consists of 240 granules or files 3. MR scheduling algorithm

  31. Experiment #1 mixed workloads • Total job's runtime HybSvsFairSvs FIFO scheduler. 21 jobs with different workloads described in above table • HybS with α=0 β=-1 γ= -1 and α=1 β=-1 γ= -1 Balanced Closer to Finish First (experiment 1). 3. MR scheduling algorithm

  32. HybSvsFairS comparisons • HybS performed better for all jobs except job 9 and job 13. For small jobs (18, 19, 20, 21) • HybS performs 4x to 6.7x faster for small jobs, while FairS shows 1.04x to 1.09x faster for big jobs (9, 13). • HybS provides 2.4x faster response time on average than FairS 3. MR scheduling algorithm

  33. Experiment #2 short task runtime workload • A short task runtime workload of 95 small jobs 4-5 seconds • #maps 50 to 148 data set size MB-GB • HybS provides 2.1x faster response time for jobs on average over FairS Total job's runtime HybS α=1 β=-1 γ= -1 vsFairS 3. MR scheduling algorithm

  34. Experiment #3 using open source cloud • Using the Eucalyptus cloud vs same configured physical cluster • Terasort 1.2 GB using 2 jobs each with 60 maps • the overhead is significant • shuffle (7.5x slower) and reduce phases (8.2x slower). • rate of I/O and network I/O for virtual images are an order of magnitude slower than for that of physical system 3. MR scheduling algorithm

  35. A MR Workflow model & its implementation cont. • Support scientific data format (HDF4), float arrays • Multiple dimension data arrays store in BigTable: random real time read/write access, spatial indexing of spatial multi dimension arrays • Implemented as a workflow system on top of Apache Hadoop • Use Hadoop ecosystem (Hadoop & Hbase) • User can to use query tools (HIVE, PIG) on the multiple dimension data arrays in Hbase. • Available as Java APIs 2. MR workflow

  36. Spatial data locality • Bounding box is implemented • Output stores in Hbase tables for queries Image source: David Chapman Thanks to David Chapman for many discussions about the spatial data locality 2. MR workflow

  37. HBASE design for multiple satellites: gridded data • (Table, RowKey, Family, Column, Timestamp) → Value • Hbase Index on rowKey value • The rowKey design for multiple satellite instruments • <InstrumentID>_ <DateTime>_<SpectralChannel>_<Spatial Index> • Column families • e.g. Resolution-Statistics column1_100km, column_1Km. Spatial Index lat, lon bounding box • Index by Instruments, Date Time, Spectral Channel and Spatial Index • Scan rows (which columns) into MapReduce computation Scan into MapReduce computation

  38. MR Workflow system vs comparison with Oozie MR workflow system vsOozie Mix workflow gridding, average, WordCount on AIRS datasets same cluster capacity, different input datasets (days) with same Hadoop Task Scheduler Oozie overhead vs MR workflow 16%, 20% 2. MR workflow

  39. Improvement of AIRS gridding • Estimated based on daily gridding • Used bounding box for spatial data locality • Gridding: compare with regular method, embarrassing parallel, gain 35% improvement in total processing time

  40. Outline • Motivation & Thesis statement & Contributions 2. A MapReduce workflow system for scientific data intensive applications 3. Create AIRS FCDRs and study global/regional climate change analysis 4. Conclusion & future work

  41. Motivation • Goody et al. (1996) showed that IR radiance observations could be used directly to detect the climatic response to greenhouse gas forcing. • J. Houghton IPCC- 1990,1995 “a strong link exists between increases in greenhouse gases and surface temperature”. • Keith,D.W. and Anderson, J.G.,(2001) assert “the direct use of radiances in climate analysis provides a more mathematically direct comparison between theory and observation. • Harries et. al., (2001) Compared OLR spectrum from IRIS 1970 and IMG 1997 and measured significant increases in GHG “consistent with radiativeforcings”. • AIRS and MODIS on the same satellite, observing the same fields of view with completely different calibration techniques and different spectrometers provide a precise relative calibration over near decadal year time frame. • AIRS and MODIS form a unique fundamental 10 year record of inter-calibrated, continuous, stable data from one satellite. • No AIRS or MODIS level 1b gridded data products available from instrument science teams. No long Wave IR channels on VIRS.

  42. Methodologies • Gridding forward method, map center of each footprint into latxlon grid cells • At 0.50x1.00 lon-lat (100km) from 2002-2011 • Produce 9 year AIRS FCDR anomalies (removing yearly circle) for window channels 4.16µ and 12.18µ. • 55TB AIRS data, a non months of MODIS

  43. Methodologies • Calibration validations by inter-comparison between AIRS and MODIS • Convolution method • Correlation validation between all sky Surface Brightness Temperature and Surface Temperatures for trends and changes analysis • GISS Surface Temperature • Data processing model: David Chapman’s Gridderama Need stability and accuracies of 0.01K for global annual changes

  44. Convolving AIRS using MODIS SRFs for comparing stability of AIRS and MODIS • On same satellite • Integrated convolved AIRS channels in MODIS spectral range • Adjusted scan angle of MODIS to match AIRS • Compare 4µ and 12µ . 12.02µ is extremely stable for decade with a trend of 0.0010K and legible AIRS warm bias. AIRS is suitably calibrated for determining decadal trends as small as 0.10K for window ch. 12u Source: David Chapman et al, AIRS science team meeting 2012

  45. Sensitivity of AIRS global surface BT trends • Fig.aAIRS surface BT and GISS ST global annual mean anomalies trends are flat for ch. 4.16µ with an annual correlation of 0.96, slight difference in 2007 • Fig b. AIRS ch. 12.18µ exhibits poorer annual correlations with GISS of 0.7 and a slight decadal cooling trend of 0.018K from possible cloud effects. • Both consistently colder in years 2004, and 2008 by ~0.07K and 0.12K (Fig. 2a ). Similar colder years and magnitude for AIRS ch. 12.18 • However, the AIRS 12.18µ, years 2007, 2010 and 2011 shows larger differences from GISS ST of -0.075, -0.08 and 0.06 (fig. b).

  46. AIRS annual BT trends in Arctic • Significant warming trend in the Arctic of 0.06K per year for channel 4.16µ with yearly changes varying as much as 1.6K and a decadal gain of 0.6K (Fig. 4a). • Slightly larger trend observed by AIRS channel 12.18µ of 0.087K/yr in Fig. 4c • Both AIRS 4.16µ and 12.18µ show correspondingly large year to year Arctic oscillations of ~-0.8K to 0.6K adding credibility to the observation of a decadal warming in the Arctic.

  47. AIRS BT shows high correlations (>0.9) with GISS ST in all seasons except SepOctNov with corr. of 0.84

  48. AIRS annual BT trends in Arctic • Significant warming trend in the Arctic of 0.060K per year for channel 4.16µ with yearly changes varying as much as 0.70K and a decadal gain of 0.60K (Fig. 4a). • Slightly larger trend observed by AIRS channel 12.18µ of 0.0870K/yr in Fig. 4c • Both AIRS 4.16µ and 12.18µ show correspondingly large year to year Arctic oscillations of ~-0.80K to 0.60K adding credibility to the observation of a decadal warming in the Arctic.

  49. AIRS annual BT trends in Antarctic • Fig. 4b and Fig. 4d show that trends are flat for 60S-90S in AIRS channel 4.16µ and less significant of 0.0140K cooling for AIRS channel 12.18µ. • Both AIRS 4.16µ and 12.18µ channels show little cooling trends of 0.004 and 0.008 respectively in the middle latitude (60S-60N) for this reported period (2003-2011).

  50. Climate process: Madden Julian Oscillation Implemented block Jacobi algorithm parallel SVD for MJO First EOF ~ 14.4% variances

More Related