OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

OVERVIEW OF MULTICORE, PARALLEL COMPUTING,AND DATA MINING 1 Indiana University Computer Science Dept. Seung-Hee Bae 1

Outline Multicore Parallel Computing & MPI Data Mining 2 2

Multicore • Toward Concurrency • What is Multicore? • Shared cache architecture • Recognition, Mining, and Synthesis (RMS) • Parallel Computing & MPI • Data Mining 3 3

TOWARD CONCURRENCY IN SOFTWARE Exponential growth (Moore’s Law) can’t continue Previous CPU performance gains Clock speed: getting more cycles Become harder to exploit higher clock speeds due to several physical issues, such as, heat, power consumption, and current leakage problems. (2GHz:2001, 3.4GHz:2004, now?) Execution optimization: more work per cycle Pipelining, branch prediction, executing multiple instructions in the same clock cycle reordering the instruction stream: changing meaning of programs. Cache Increasing the size of on-chip cache: main memory is much slower than the CPU 4 4

Toward Concurrency in Software 2 Current CPU performance gains Moore’s law is over? Not yet (# of transistors ↑) Hyperthreading Running two or more threads in parallel inside a single CPU Runs some instructions in parallel One each of most basic CPU features, (except extra registers) 5% ~ 15 %, 40% under ideal conditions It doesn’t help single-threaded applications Multicore Running two or more actual CPUs on one chip. Less than double the speed even in the ideal case. It will boost reasonably well-written multi-thread applications, but not single-threaded applications. 2 * 3GHz < 6 GHz Coordination overhead between the cores to ensure cache coherency. Cache Only this will broadly benefit most existing applications. A cache miss costs 10 to 50 times. 5 5

Core 0 Core 1 CPU CPU L1 Cache L1 Cache L2 Cache What is Multicore? • Single Chip • Multiple distinct processing Engine • E.g.) Shared-cache Dual Core Architecture 6 6

SHARED-CACHE ARCHITECTURE • Options for the last-level cache • private to each core • sharing the last-level cache among diff. cores • Benefits of the Shared-Cache Architecture • Efficient use of the last-level cache. • reduce resource underutilization. • Reduce cache-coherence complexity • reduced false sharing because of shared cache. • reduce data-storage redundancy • same data only needs to be stored once. • reduce front-side bus traffic • data requests can be resolved at the shared-cache level instead of system memory. 7 7

Software Techniques for Shared-Cache Multicore Systems • Cache blocking (Data Tiling) • Allow data to stay in the cache while being processing by data loops. • Reducing unnecessary cache traffic. (Better cache hit ratio.) • Hold approach (Late update) • Each thread maintain its own private copy of data. • Updating the shared copy only when it is necessary. • Reducing the frequency of access to the shared data. • Avoid false sharing • What is false sharing? (unnecessary cache line update.) • How to avoid false sharing? • To allocate non-shared data to different cache lines. (padding) • To copy the global variable to a local function variable, then copy the data back before the function exits. 8 8

RECOGNITION, MINING, AND SYNTHESIS (RMS) • Era of Tera is coming quickly • Teraflops (computing power), Terabits (comm.), Terabytes (storage) • World data is doubling every three years and is now measured exabytes (a billion billion bytes) • Need computing model to deal this enormous sea of information • Working with Models • Recognition (What is ?) • Identifying that a set of data constitutes a model and then constructing that model. • Mining (Is it ?) • Search for instances of the model. • Synthesis (What if ?) • Create a potential instance of that model in an imaginary world. 9

RMS 2 (from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.) • Examples • Medicine (a tumor) • Business (hiring) • Investment 10

Multicore • Parallel Computing & MPI • Parallel architectures (Shared-Memory vs. Distributed-Memory) • Decomposing Program (Data Parallelism vs. Task Parallelism) • MPI and OpenMP • Data Mining 11 11

Parallel Computing: Introduction Parallel computing More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single processor to virtually limitless computing power Parallel computing software systems Goal: to make parallel programming easier and the resulting applications more portable and scalable while achieving goodperformance. Difficulty Explicitly parallel program is difficult e.g.) computation, partitioning, synchronization, and data movement (correct answer & high performance) Must be machine-independent – portability Complexity of the problems being attacked. Parallel Computing Challenges Concurrency & Communication Need for high performance Diversity of Architectures 12 12

PARALLEL ARCHITECTURE 1 • Shared-memory machines • Have a single shared address space that can be accessed by any processor. • Examples • Multicore • Symmetric multiprocessor (SMP) • Uniform Memory Access (UMA) • Access time is independent of the loc. • Use bus or completely connected net. • Not scalable • Shared-Memory Programming model • Need for synchronization to preserve the integrity • E.g.) Open Specifications for MultiProcessing (OpenMP) • Distributed-memory machines • The system memory is packaged with individual nodes of one or more processors (c.f. Use separate computers connected by a network) • E.g. Cluster • communication is required to provide data from a processor to a different processor. • support message-passing programming model • Send-receive communication steps. • E.g.) Message Passing Interface (MPI) 13

Parallel Architecture 2 14

PARALLEL ARCHITECTURE 3 Hybrid systems Distributed shared-memory (DSM) Distributed-memory machine which allows a processor to directly access a datum in a remote memory. Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA) characteristics. SMP clusters distributed-memory system with SMP as a unit. 15 15

PARALLEL PROGRAM: Decomposition 1 Decomposing Programs Decomposition: Identifying the portions for the parallelism. Decomposition strategy Task (Functional) parallelism Different processors carry out different functions. Data parallelism Subdivides the data domain of a problem into multiple regions and assigns different processors to compute the results for each region. More commonly used in scientific problems. Natural form of scalability Programming models Shared-memory programming model Need for synchronization to preserve the integrity Message-passing model Communication is required to access a remote data location. 16 16

Parallel Program: Decomposition 2 • Data Parallelism • Exploit the parallelism inherent in many large data structures. • Same Task on diff. data. (SPMD) • Can be expressed by ALL parallel programming models (i.e. MPI, HPF like, OpenMP like) • Features • Scalable • Hard to express when geometry irregular or dynamic • Functional Parallelism • Coarse grain parallelism • Parallelism btwn the parts of many systems. • Diff. task on the same or diff. data. • Features • Parallelism limited in size • Tens not millions • Synchronization probably good as parallelism • Decomposition natural • E.g.) workflow 17

Parallel Program: Decomposition 3 Load balance and scalability Scalable: running time is inversely proportional to the number of processors used. Speedup(n) = T(1)/T(n) Scalable if speedup(n) ≈ n Second definition of scalability: scaled speedup Scalable if the running time remains the same when the number of processors and the problem size are increased by a factor of n. Why scalability is not achieved? a region that must be run sequentially. Total speedup ≤ T(1)/Ts (Amdahl’s Law) Require for a high degree of communication or coordination. Poor load balance(major goal of parallel programming) If one of the processors takes half of the parallel work, speedup will be limited to a factor of two. 18 18

Parallel Program Memory-Hierarchy Management Blocking Ensuring that data remains in cache between subsequent accesses to the same memory location. Elimination of False Sharing False sharing: When two diff. processors are accessing distinct data items that reside on the same cache block. Ensure that data used by diff. processors reside on diff. cache blocks. (by padding: inserting empty bytes in a data structure.) Communication Minimization and Placement Move send and receive commands far enough apart so that time spent on communication can be overlapped. Stride-one access Programs in which the loops access contiguous data items are much more efficient than those that do not. 19 19

Message Passing Interface (MPI) 1 • Message Passing Interface (MPI) • A specification for a set of functions for managing movement of data among sets of communicating processes. • The dominant scalable parallel computing paradigm with scientific problem. • Explicit message sendandreceiveusing rendezvous model. • Point-to-point communication • Collective communication • Commonly implemented in terms of an SPMD model • All processes execute essentially the same logic. • Pros: • scalable and portable • Race condition avoided (implicit synch. w/ the copy) • Cons: • implements details at communication. 20

MPI • 6 Key Functions • MPI_INIT • MPI_COMM_RANK • MPI_COMM_SIZE • MPI_SEND • MPI_RECV • MPI_FINALIZE • Collective Communications • Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange • General reduction operation (sum, minimum, scan) • Blocking, nonblocking, buffered, synchronous messaging 21

Open Specifications for Multiprocessing (OpenMP) 1 • Appropriate toShared-Memory. • A sophisticated set of annotations (compiler directives) for traditional C, C++, or Fortran codes to aid compilers producing parallel codes. • It provides parallel loops and collective operations such as summation over loop indices. • Provide lock variables to allow fine-grain synchronization btwn threads. 22

OpenMP 2 • Directives: instruct the compiler to • Create threads • Perform synchronization operations. • Manage shared memory. • Examples • PARALLEL DO ~ END PARALLEL DO: explicit parallel loop. • SCHEDULE (STATIC): assign continuous blocks at compile time. • SCHEDULE (DYNAMIC): assign continuous blocks at run-time. • REDUCTION(+: x): final values of var. x is determined global sum. • PARALLEL SECTIONS: task parallelism. • OpenMP synchronization primitives • Critical sections • Atomic updates • Barriers • Master selection 23

OpenMP 3 • Summary • Work decomposition • Ideal target system: uniform-access, shared-memory. • Specify where multiple threads should be applied, and how to assign work to those threads. • Pros: • Excellent programming interface for uniform-access, shared-memory machines. • Cons: • No way to specify locality in machines w/ non-uniform shared-memory or distributed memory. 24

Multicore • Parallel Computing & MPI • Data Mining • Expectation Maximization (EM) • Deterministic Annealing (DA) • Hidden Markov Model (HMM) • Support Vector Machine (SVM) 25

Expectation Maximization (EM) • Expectation Maximization (EM) • A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables. • An efficient iterative procedure • Goal: estimate unknown parameters, given measurement. • Hill climbing approach  guarantee to reach local maxima. • Two Steps • E-step (Expectation): the missing data are estimated given the observed data and current estimate of the model parameters. • M-step (Maximization): the likelihood function is maximized under the assumption that the missing data are known. (The estimated missing data from the E-step are used in lieu of the actual missing data.) • Those two steps are repeated until the likelihood converges. 26

Deterministic Annealing (DA) Purpose: avoid local minima (optimization) Clustering example of unsupervised learning Simulated Annealing (SA) A sequence of random moves is generated and the random decision to accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method) Deterministic Annealing (DA) Deterministic: don’t wandering randomly (minimize the free energy directly) Annealing: still want to avoid local minima with certain level of uncertainty. maintain the free energy at its minimum. eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost) At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously 27 27

DA for Clustering • Start with a single cluster giving as solution Y1 as centroid • For some annealing schedule for T, iterate above algorithm testing covariance matrix in Xi about each cluster center to see if “elongated” • Split cluster if elongation “long enough” • You do not need to assume number of clusters but rather a final resolution T or equivalent • At T=0, uninteresting solution is N clusters; one at each point xi 28

Hidden Markov Model (HMM) 1 • Markov model • A system which may be described at any time as being in one of a set of N distinct states, S1, S2, …, SN. • State transition probability • The special case of a discrete, first order Markov chain: • P[qt = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1) • Furthermore, consider those processes in which the right-hand side of (1) is independent of time, thereby leading to the set of state transition probability aij of the form aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1 • Initial state probability 29 29

HIDDEN MARKOV MODEL (HMM) 2 • Hidden Markov Model • Observation is a probabilistic function of the state. • State is hidden. • Elements of an HMM • N, the number of states in the model. (Although the states are hidden) • M, the number of distinct observation symbols per state, i.e. the discrete alphabet size. • The state transition probability distribution A = {aij}, where aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1 • The observation symbol probability distribution (emission probability) in state j, B = {bj(k)}, where bj(k) = P[vk at t| qt = Sj],1 ≤ j ≤ N, 1 ≤ k ≤ M • The initial state distribution π = {πi} where πi = P[q1 = Si], 1 ≤ j ≤ N • Compact notation: λ = (A, B, π)

HIDDEN MARKOV MODEL (HMM) 3 • Three Basic Problems for HMMs • Prob(observation seq | model): Given the observation sequence O = O1O2 … OT, and a model λ = (A, B, π), how do we efficiently compute P(O| λ), the probability of the observation sequence, given the model? • Finding Optimal State Sequence: Given the observation sequence O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a corresponding state sequence Q = q1q2 … qT which is optimal in some meaningful sense (i.e. best “explains” the observations)? • Finding Optimal Model Parameters: How do we adjust the model parameters λ = (A, B, π) to maximize P(O| λ) ?

HIDDEN MARKOV MODEL (HMM) 4 • Solution to the three basic problems for HMMs • Solution to the problem 1 (Forward-Backward procedure) • Enumeration (straightforward way): computationally unfeasible. • Forward Procedure • Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e., the probability of the partial observation sequence, O1O2 … Ot, (until time t) and state Si at time t, given the model λ. • Solution to the problem 2 (Viterbi algorithm) • Optimality criterion: to find the single best state sequence (path), i.e., to maximize P(Q|O, λ) which is equivalent to maximizing P(Q, O| λ). • A formal technique for finding this single best state sequence exists, based on dynamic programming methods, and is called Viterbi algorithm.

HIDDEN MARKOV MODEL (HMM) 5 • Solution to the Problem 3. (Baum-Welch Algorithm) • The third problem of HMMs is to determine a method to adjust the model parameters (A, B, π) to maximize the probability of the observation sequence given the model. • Choose λ = (A, B, π) such that P(O| λ) is locally maximized using an iterative procedure such as the Baum-Welch method (or equivalently the EM (expectation-modification) method) or using gradient techniques. • Reestimation (iterative update and improvement), define ξt(i, j), the probability of being in state Si at time t and state Sj at time t+1, given the model and the observation sequence, i.e. ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING