Big Data System Environments

Big Data System Environments Wellington, New Zealand Geoffrey Fox, February 13, 2019gcf@indiana.edu, http://www.dsc.soic.indiana.edu/, http://spidal.org/ Digital Science Center

2 Overall Global AI and Modeling Supercomputer GAIMSC http://www.iterativemapreduce.org/

From Microsoft aa • aa

aa From Microsoft • aa https://www.microsoft.com/en-us/research/event/faculty-summit-2018/

Overall Global AI and Modeling Supercomputer GAIMSC Architecture • There is only a cloud at the logical center but it’s physically distributed and owned by a few major players • There is a very distributed set of devices surrounded by local Fog computing; this forms the logically and physically distribute edge • The edge is structured and largely data • These are two differences from the Grid of the past • e.g. self driving car will have its own fog and will not share fog with truck that it is about to collide with • The cloud and edge will both be very heterogeneous with varying accelerators, memory size and disk structure. • What is software model for GAIMSC?

Collaborating on the Global AI and Modeling Supercomputer GAIMSC • Microsoft says: • We can only “play together” and link functionalities from Google, Amazon, Facebook, Microsoft, Academia if we have open API’s and open code to customize • We must collaborate • Open source Apache software • Academia needs to use and define their own Apache projects • We want to use AI and modeling supercomputer for AI-Driven engineering and science studying the early universe and the Higgs boson and not just producing annoying advertisements (goal of most elite CS researchers)

Systems Challenges for GAIMSC • Architecture of the Global AI and Modeling Supercomputer GAIMSC must reflect • Global captures the need to mashup services from many different sources; • AI captures the incredible progress in machine learning (ML); • Modeling captures both traditional large-scale simulations and the models and digital twins needed for data interpretation; • Supercomputer captures that everything is huge and needs to be done quickly and often in real time for streaming applications. • The GAIMSC includes an intelligent HPC cloud linked via an intelligent HPC Fog to an intelligent HPC edge. We consider this distributed environment as a set of computational and data-intensive nuggets swimming in an intelligent aether. • We will use a dataflow graph to define a structure in the aether • GAIMSC requires parallel computing to achieve high performance on large ML and simulation nuggets and distributed system technology to build the aether and support the distributed but connected nuggets. • In the latter respect, the intelligent aether mimics a grid but it is a data grid where there are computations but typically those associated with data (often from edge devices). • So unlike the distributed simulation supercomputer that was often studied in previous grids, GAIMSC is a supercomputer aimed at very different data intensive AI-enriched problems.

Integration of Data and Model functions with ML wrappers in GAIMSC • There is a increasing use in the integration of ML and simulations. • ML can analyze results, guide the execution and set up initial configurations (auto-tuning). This is equally true for AI itself -- the GAIMSC will use itself to optimize its execution for both analytics and simulations. • See “The Case for Learned Index Structures” from MIT and Google • In principle every transfer of control (job or function invocation, a link from device to the fog/cloud) should pass through an AI wrapper that learns from each call and can decide both if call needs to be executed (maybe we have learned the answer already and need not compute it) and how to optimize the call if it really needs to be executed. • The digital continuum (proposed by BDEC2) is an intelligent aether learning from and informing the interconnected computational actions that are embedded in the aether. • Implementing the intelligent aether embracing and extending the edge, fog, and cloud is a major research challenge where bold new ideas are needed! • We need to understand how to make it easy to automatically wrap every nugget with ML.

Implementing the GAIMSC • My recent research aims to make good use of high-performance technologies and yet preserve the key features of the Apache Big Data Software. • Originally aimed at using HPC to run Machine Learning but this is sort of understood and new focus is integration of ML, machine learning, clouds, edge • We will describe Twister2 that seems well suited to build the prototype intelligent high-performance aether. • Note this will mix many relatively small nuggets with AI wrappers generating parallelism from the number of nuggets and not internally to the nugget and its wrapper. • However, there will be also large global jobs requiring internal parallelism for individual large-scale machine learning or simulation tasks. • Thus parallel computing and distributed systems (grids) must be linked in a clean fashion andthe key parallel computing ideas needed for ML are closely related to those already developed for simulations.

2 Application Requirements http://www.iterativemapreduce.org/

Distinctive Features of Applications • Ratio of data to model sizes: vertical axis on next slide • Importance of Synchronization – ratio of inter-node communication to node computing: horizontal axis on next slide • Sparsity of Data or Model; impacts value of GPU’s or vector computing • Irregularity of Data or Model • Geographic distribution of Data as in edge computing; use of streaming (dynamic data) versus batch paradigms • Dynamic model structure as in some iterative algorithms

Big Data and Simulation Difficulty in ParallelismSize of Synchronization constraints Tightly Coupled Loosely Coupled HPC Clouds/Supercomputers Memory access also critical HPC Clouds: Accelerators High Performance Interconnect Commodity Clouds Just two problem characteristicsThere is also data/compute distribution seen in grid/edge computing Size of Disk I/O MapReduce as in scalable databases Graph Analytics e.g. subgraph mining Global Machine Learning e.g. parallel clustering Deep Learning LDA Pleasingly Parallel Often independent events Unstructured Adaptive Sparse Linear Algebra at core (often not sparse) Current major Big Data category Structured Adaptive Sparse Parameter sweep simulations Largest scale simulations Exascale Supercomputers

2 Comparing Spark, Flink and MPI http://www.iterativemapreduce.org/

Machine Learning with MPI, Spark and Flink • Three algorithms implemented in three runtimes • Multidimensional Scaling (MDS) • Terasort • K-Means (drop as no time and looked at later) • Implementation in Java • MDS is the most complex algorithm - three nested parallel loops • K-Means - one parallel loop • Terasort - no iterations (see later) • With care, Java performance ~ C performance • Without care, Java performance << C performance (details omitted)

Multidimensional Scaling: 3 Nested Parallel Sections Flink MDS execution time with 32000 points on varying number of nodes. Each node runs 20 parallel tasks Spark, Flink No Speedup MPI Factor of 20-200 Faster than Spark/Flink Spark MDS execution time on 16 nodes with 20 processes in each node with varying number of points MPI • Flink especially loses touch with relationship of computing and data location • In open Wound Pragmas, Twister2 uses Parallel First Touch and Owner Computes • Current Big Data systems use forgotten touch, owner forgets and Tragedy of the Commons Computes

2 Linking Machine Learning and HPC http://www.iterativemapreduce.org/

MLforHPC and HPCforML • We tried to distinguish between different interfaces for ML/DL and HPC. • HPCforML: Using HPC to execute and enhance ML performance, or using HPC simulations to train ML algorithms (theory guided machine learning), which are then used to understand experimental data or simulations. • MLforHPC: Using ML to enhance HPC applications and systems • A special case of Dean at NIPS 2017 – "Machine Learning for Systems and Systems for Machine Learning",

HPCforML in detail • HPCforML can be further subdivided • HPCrunsML: Using HPC to execute ML with high performance • SimulationTrainedML: Using HPC simulations to train ML algorithms, which are then used to understand experimental data or simulations. • Twister2 supports HPCrunsML by using high performance technology such as MPI

MLforHPC in detail • MLforHPC can be further subdivided into several categories: • MLautotuning: Using ML to configure (autotune) ML or HPC simulations. • MLafterHPC: ML analyzing results of HPC as in trajectory analysis and structure identification in biomolecular simulations • MLaroundHPC: Using ML to learn from simulations and produce learned surrogates for the simulations. The same ML wrapper can also learn configurations as well as results • MLControl: Using simulations (with HPC) in control of experiments and in objective driven computational campaigns. Here the simulation surrogates are very valuable to allow real-time predictions. • Twister2 supports MLforHPC by allowing nodes of dataflow representation to be wrapped with ML

MLAutotuned HPC. Machine Learning for Parameter Auto-tuning in Molecular Dynamics Simulations: Efficient Dynamics of Ions near Polarizable Nanoparticles JcsKadupitiya, Geoffrey Fox, Vikram Jadhao Nature 444, 697 (2006)

Electrostatics under Car-Parrinello inspired variational framework Electrostatics under conventional approach Integration of machine learning (ML) methods for parameter prediction in MD simulations by demonstrating how they were realized in MD simulations of ions near polarizable NPs. Initial Charge Configuration Initial Charge Configuration and optimized induced charge distribution • Compute • Force on charges • Force on additional degrees Reduction to Surface problem Solve Poisson Equation Compute Forces on charges Move charges and fake degrees Move Charges using the Forces. The electrostatic problem is solved on the fly in with energy conservation built-in the Lagrangian formalism.

Comparison of results for peak densities of counterions between adaptive (ML) and original non-adaptive cases (they look identical) Ionic densities from MLAutotuned system. Inset compares ML system results with those of slower original system Key characteristics of simulated system showing greater stability for ML enabled adaptive approach. Results for MLAutotuning Quality of simulation measured by time simulated per step with increasing use of ML enhancements. (Larger is better).Inset is timestep used • An ANN based regression model was integrated with MD simulation and predicted excellent simulation environment 94:3% of the time; human operation is more like 20(student)-50(faculty)% and runs simulation slower to be safe. • Auto-tuning of parameters generated accurate dynamics of ions for 10 million steps while improving the stability. • The integration of ML-enhanced framework with hybrid OpenMP/MPI parallelization techniques reduced the computational time of simulating systems with 1000 of ions and induced charges from 1000 of hours to 10 of hours, yielding a maximum speedup of 3 from ML-only and a maximum speedup of 600 from the combination of ML and parallel computing methods. • The approach can be generalized to select optimal parameters in other MD applications & energy minimization problems.

MLaroundHPC: Machine learning for performance enhancement with Surrogates of molecular dynamics simulations Integration of machine learning (ML) with the high-performance computing enabled simulation frameworks to enhance their performance and improve their usability for both research and education.

Correlation between Molecular Dynamics simulations and Learnt Machine Learning Predictions for contact density Dependence of contact densities on ion diameter and confinement length compared between ML and MD Contact, peak and center point densities versus salt concentration compared between MD and ML inference • We find that an artificial neural network based regression model successfully learns desired features associated with the output ionic density profiles (the contact, mid-point and peak densities) generating predictions for these quantities that are in excellent agreement with the results from explicit molecular dynamics simulations. • The integration of an ML layer enables real-time and anytime engagement with the simulation framework, thus enhancing the applicability for both research and educational use.

Speedup of MLaroundHPC • Tseqis sequential time • Ttrain time for a (parallel) simulation used in training ML • Tlearn is time per point to run machine learning • Tlookup is time to run inference per instance • Ntrain number of training samples • Nlookup number of results looked up • Becomes Tseq/Ttrain if ML not used • Becomes Tseq/Tlookup (105 faster in our case) if inference dominates (will overcome end of Moore’s law and win the race to zettascale) • This application deployed on nanoHub for high performance education use Ntrain is 7K to 16K in our work

MLaroundHPC Architecture: ML and MD intertwined ML-Based Simulation Prediction ANN Model Training Inference Inference I

Integration of machine learning (ML) methods for parameter prediction in MD simulations by demonstrating how they were realized in MD simulations of ions near polarizable NPs. ML is before and after MD ML-Based Simulation Configuration MLAutotuning: Integration Architecture Inference I Inference II Testing Training

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC HPCforML and MLforHPC http://www.iterativemapreduce.org/

Ways of adding High Performance to Global AI (and Modeling) Supercomputer • Fix performance issues in Spark, Heron, Hadoop, Flink etc. • Messy as some features of these big data systems intrinsically slow in some (not all) cases • All these systems are “monolithic” and difficult to deal with individual components • Execute HPBDC from classic big data system with custom communication environment – approach of Harp for the relatively simple Hadoop environment • Provide a native Mesos/Yarn/Kubernetes/HDFS high performance execution environment with all capabilities of Spark, Hadoop and Heron – goal of Twister2 • Execute with MPI in classic (Slurm, Lustre) HPC environment • Add modules to existing frameworks like Scikit-Learn or Tensorflow either as new capability or as a higher performance version of existing module.

Integrating HPC and Apache Programming Environments • Harp-DAAL with a kernel Machine Learning library exploiting the Intel node library DAAL and HPC communication collectives within the Hadoop ecosystem. • Harp-DAAL supports all 5 classes of data-intensive AI first computation, from pleasingly parallel to machine learning and simulations. • Twister2 is a toolkit of components that can be packaged in different ways • Integrated batch or streaming data capabilities familiar from Apache Hadoop, Spark, Heron and Flink but with high performance. • Separate bulk synchronous and data flow communication; • Task management as in Mesos, Yarn and Kubernetes • Dataflow graph execution models • Launching of the Harp-DAAL library with native Mesos/Kubernetes/HDFS environment • Streaming and repository data access interfaces, • In-memory databases and fault tolerance at dataflow nodes. (use RDD (Tsets) to do classic checkpoint-restart)

Twister2 Highlights I • “Big Data Programming Environment” such as Hadoop, Spark, Flink, Storm, Heron but with significant differences (improvements) • Uses HPC wherever appropriate • Links to “Apache Software” (Kafka, Hbase, Beam) wherever appropriate • Runs preferably under Kubernetes and Mesos but Slurm supported • Highlight is high performance dataflow supporting iteration, fine-grain, coarse grain, dynamic, synchronized, asynchronous, batch and streaming • Two distinct communication environments • DFW Dataflow with distinct source and target tasks; data not message level • BSP for parallel programming; MPI is default • Rich state model for objects supporting in-place, distributed, cached, RDD style persistence

Twister2 Highlights II • Can be a pure batch engine • Not built on top of a streaming engine • Can be a pure streaming engine supporting Storm/Heron API • Not built on on top of a batch engine • Fault tolerance as in Spark or MPI today; dataflow nodes define natural synchronization points • Many API’s: Data (at many levels), Communication, Task • High level (as in Spark) and low level (as in MPI) • Component based architecture -- it is a toolkit • Defines the important layers of a distributed processing engine • Implements these layers cleanly aiming at data analytics and with high performance

Twister2 Highlights III • Key features of Twister2 are associated with its dataflow model • Fast and functional inter-node linkage; distributed from edge to cloud or in-place between identical source and target tasks • Streaming or Batch nodes (Storm persisent or Spark emphemeral model) • Supports both Orchestration (as in Pegasus, Kepler, NiFi) or high performance streaming flow (as in Naiad) model • Tset Twister2 datasets like RDD define a full object state model supported across links of dataflow

Some Choices in Dataflow Systems NiFi Classic coarse-grain workflow • Computations (maps) happen at nodes • Generalized Communication happens on links • Direct, Keyed, Collective (broadcast, reduce), Join • In coarse-grain workflow, communication can be by disk • In fine-grain dataflow (as in K-means), communication needs to be fast • Caching and/or use in-place tasks • In-place not natural for streaming as persistent nodes/tasks K-means in Spark, Flink, Twister2

Twister2 Logistics • Open Source - Apache Licence Version 2.0 • Github - https://github.com/DSC-SPIDAL/twister2 • Documentation - https://twister2.gitbook.io/twister2 with tutorial • Developer Group - twister2@googlegroups.com – India(1) Sri Lanka(9) and Turkey(2) • Started in the 4th Quarter of 2017; reversing previous philosophy which was to modify Hadoop, Spark, Heron; • Bootstrapped using Heron code but that code now changed • About 80000 Lines of Code (plus 50,000 for SPIDAL+HarpHPCforML) • Languages - Primarily Java with some Python

Twister2 Team

Big Data APIs Started with Map-Reduce Different Data APIs in community • Data transformation APIs • Apache Crunch PCollections • Apache Spark RDD • Apache Flink DataSet • Apache Beam PCollections • Apache Storm Streamlets • Apache Storm Task Graph • SQL based APIs Task Graph with computations on data in nodes • High-level Data API hides communication and decomposition from the user • Lower-level messaging and Task API’s offer harder to use more powerful capabilities

GAIMSC Programming Environment Components I

GAIMSC Programming Environment Components II

Execution as a Graph for Data Analytics • The graph created by the user API can be executed using an event model • The events flow through the edges of the graph as messages • The compute units are executed upon arrival of events • Supports Function as a Service • Execution state can be checkpointed automatically with natural synchronization at node boundaries • Fault tolerance T T Events flow through edges Task Schedule R Graph Execution Graph (Plan)

HPC APIs • Dominated by Message Passing Interface (MPI) • Provides the most fundamental requirements in the most efficient ways possible • Communication between parallel workers • Managing of parallel processes • HPC has task systems and Data APIs • They are all built on top of parallel communication libraries • Legion from Stanford on top of CUDA and active messages (GASNet) • Actually HPC usually defines “model parameter” API’s and Big Data “Data” API’s • One needs both data and model parameters treated similarily in many cases Simple MPI Program

Data and Model in Big Data and Simulations I • Need to discuss Data and Model as problems have both intermingled, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence” (or differences!) • The Model is a user construction and it has a “concept”, parameters and gives results determined by the computation. We use term “model” in a general fashion to cover all of these. • Big Data problemscan be broken up into Data and Model • For clustering, the model parameters are cluster centers while the data is set of points to be clustered • For queries, the model is structure of database and results of this query while the data is whole database queried and SQL query • For deep learning with ImageNet, the model is chosen network with model parameters as the network link weights. The data is set of images used for training or classification

Data and Model in Big Data and Simulations II • Simulationscan also be considered as Data plus Model • Model can be formulation with particle dynamics or partial differential equations defined by parameters such as particle positions and discretized velocity, pressure, density values • Data could be small when just boundary conditions • Data large with data assimilation (weather forecasting) or when data visualizations are produced by simulation • Big Data implies Data is large but Model varies in size • e.g. LDA (Latent Dirichlet Allocation) with many topics or deep learning has a large model • Clustering or Dimension reduction can be quite small in model size • Data often static between iterations (unless streaming); Model parameters vary between iterations • Data and Model Parameters are often confused in papers as term data used to describe the parameters of models. • Models inBig Data and Simulations have many similarities and allow convergence • Both dataand model have non trivial parallel computing issues

Twister2 Features by Level of Effort

Twister2 Implementation by Language Software Engineering will double amount of code with unit tests etc.

Runtime Components Atomic Job Submission Connected or External DataFlow Orchestration API Streaming, Batch and ML Applications User APIs Java APIs Scala APIs Python API SQL API State TSet Runtime Task Graph System Internal (fine grain) DataFlow and State Definition Operations BSP Operations Resource API Mesos Kubernetes Standalone Slurm Data Access APIs Local HDFS NoSQL Message Brokers Future Features: Python API critical

Twister2 APIs in Detail APIs built on top of Task Graph Operator Level APIs Java API Python API Java API Java API Java API TSet SQL DataFlow Operations BSP Operations Task Graph Task Graph Task Graph Worker Worker DataFlow Operations DataFlow Operations DataFlow Operations Worker Worker Worker Higher Level APIs based on Task Graph Low level APIs with the most flexibility. Harder to program Future APIs are built combining different components of the System

Twister2 API Levels Suitable for Simple Applications Ex - Pleasingly Parallel TSet API Easy to program functional API with type support Task API Ease of Use Abstracts the threads, messages. Intermediate API Performance/Flexibility Suitable for Complex Applications Ex – Graph Analytics Operator API User in full control, harder to program

Runtime Process View • Driver • User submits and controls the Job • Cluster Resources • Resources managed by a resource scheduler such as Mesos or Kubernetes • Resource Unit • A resource allocated by the scheduler: Core, Kubernetes Pod, Mesos Task, Compute Node • Worker Process • A Twister2 process that executes the user tasks • Task • Execution unit programmed by user Every job runs in isolation (Dashboard is shared)

Big Data System Environments

Big Data System Environments

Presentation Transcript

Big Data

Big Data

Big Data

„Big data ”

Big Data

Big Data

Big Data

Managing the Data Lifecycle of Big Data Environments

System Software for Big Data Computing

Big Data

Big Data

Big Data

Big Data

Big Data Training | Big Data Courses | Big Data Online Courses

Big Data Big Data

Big Data

System Software for Big Data Computing

Sea Ice

Sea Ice