1 / 33

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING. 1. Indiana University Computer Science Dept. Seung-Hee Bae. 1. Outline. Multicore Parallel Computing & MPI Data Mining. 2. 2. Multicore Toward Concurrency What is Multicore? Shared cache architecture

lethia
Download Presentation

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OVERVIEW OF MULTICORE, PARALLEL COMPUTING,AND DATA MINING 1 Indiana University Computer Science Dept. Seung-Hee Bae 1

  2. Outline Multicore Parallel Computing & MPI Data Mining 2 2

  3. Multicore • Toward Concurrency • What is Multicore? • Shared cache architecture • Recognition, Mining, and Synthesis (RMS) • Parallel Computing & MPI • Data Mining 3 3

  4. TOWARD CONCURRENCY IN SOFTWARE Exponential growth (Moore’s Law) can’t continue Previous CPU performance gains Clock speed: getting more cycles Become harder to exploit higher clock speeds due to several physical issues, such as, heat, power consumption, and current leakage problems. (2GHz:2001, 3.4GHz:2004, now?) Execution optimization: more work per cycle Pipelining, branch prediction, executing multiple instructions in the same clock cycle reordering the instruction stream: changing meaning of programs. Cache Increasing the size of on-chip cache: main memory is much slower than the CPU 4 4

  5. Toward Concurrency in Software 2 Current CPU performance gains Moore’s law is over? Not yet (# of transistors ↑) Hyperthreading Running two or more threads in parallel inside a single CPU Runs some instructions in parallel One each of most basic CPU features, (except extra registers) 5% ~ 15 %, 40% under ideal conditions It doesn’t help single-threaded applications Multicore Running two or more actual CPUs on one chip. Less than double the speed even in the ideal case. It will boost reasonably well-written multi-thread applications, but not single-threaded applications. 2 * 3GHz < 6 GHz Coordination overhead between the cores to ensure cache coherency. Cache Only this will broadly benefit most existing applications. A cache miss costs 10 to 50 times. 5 5

  6. Core 0 Core 1 CPU CPU L1 Cache L1 Cache L2 Cache What is Multicore? • Single Chip • Multiple distinct processing Engine • E.g.) Shared-cache Dual Core Architecture 6 6

  7. SHARED-CACHE ARCHITECTURE • Options for the last-level cache • private to each core • sharing the last-level cache among diff. cores • Benefits of the Shared-Cache Architecture • Efficient use of the last-level cache. • reduce resource underutilization. • Reduce cache-coherence complexity • reduced false sharing because of shared cache. • reduce data-storage redundancy • same data only needs to be stored once. • reduce front-side bus traffic • data requests can be resolved at the shared-cache level instead of system memory. 7 7

  8. Software Techniques for Shared-Cache Multicore Systems • Cache blocking (Data Tiling) • Allow data to stay in the cache while being processing by data loops. • Reducing unnecessary cache traffic. (Better cache hit ratio.) • Hold approach (Late update) • Each thread maintain its own private copy of data. • Updating the shared copy only when it is necessary. • Reducing the frequency of access to the shared data. • Avoid false sharing • What is false sharing? (unnecessary cache line update.) • How to avoid false sharing? • To allocate non-shared data to different cache lines. (padding) • To copy the global variable to a local function variable, then copy the data back before the function exits. 8 8

  9. RECOGNITION, MINING, AND SYNTHESIS (RMS) • Era of Tera is coming quickly • Teraflops (computing power), Terabits (comm.), Terabytes (storage) • World data is doubling every three years and is now measured exabytes (a billion billion bytes) • Need computing model to deal this enormous sea of information • Working with Models • Recognition (What is ?) • Identifying that a set of data constitutes a model and then constructing that model. • Mining (Is it ?) • Search for instances of the model. • Synthesis (What if ?) • Create a potential instance of that model in an imaginary world. 9

  10. RMS 2 (from P.Dubey, “Recognition, Mining and Synthesis Moves Computers to the Era of Tera,” Technology@Intel Magazine, Feb. 2005.) • Examples • Medicine (a tumor) • Business (hiring) • Investment 10

  11. Multicore • Parallel Computing & MPI • Parallel architectures (Shared-Memory vs. Distributed-Memory) • Decomposing Program (Data Parallelism vs. Task Parallelism) • MPI and OpenMP • Data Mining 11 11

  12. Parallel Computing: Introduction Parallel computing More than just a strategy for achieving good performance Vision for how computation can seamlessly scale from a single processor to virtually limitless computing power Parallel computing software systems Goal: to make parallel programming easier and the resulting applications more portable and scalable while achieving goodperformance. Difficulty Explicitly parallel program is difficult e.g.) computation, partitioning, synchronization, and data movement (correct answer & high performance) Must be machine-independent – portability Complexity of the problems being attacked. Parallel Computing Challenges Concurrency & Communication Need for high performance Diversity of Architectures 12 12

  13. PARALLEL ARCHITECTURE 1 • Shared-memory machines • Have a single shared address space that can be accessed by any processor. • Examples • Multicore • Symmetric multiprocessor (SMP) • Uniform Memory Access (UMA) • Access time is independent of the loc. • Use bus or completely connected net. • Not scalable • Shared-Memory Programming model • Need for synchronization to preserve the integrity • E.g.) Open Specifications for MultiProcessing (OpenMP) • Distributed-memory machines • The system memory is packaged with individual nodes of one or more processors (c.f. Use separate computers connected by a network) • E.g. Cluster • communication is required to provide data from a processor to a different processor. • support message-passing programming model • Send-receive communication steps. • E.g.) Message Passing Interface (MPI) 13

  14. Parallel Architecture 2 14

  15. PARALLEL ARCHITECTURE 3 Hybrid systems Distributed shared-memory (DSM) Distributed-memory machine which allows a processor to directly access a datum in a remote memory. Latency varies with the distance to the remote memory. Emphasize the Non-Uniform Memory Access (NUMA) characteristics. SMP clusters distributed-memory system with SMP as a unit. 15 15

  16. PARALLEL PROGRAM: Decomposition 1 Decomposing Programs Decomposition: Identifying the portions for the parallelism. Decomposition strategy Task (Functional) parallelism Different processors carry out different functions. Data parallelism Subdivides the data domain of a problem into multiple regions and assigns different processors to compute the results for each region. More commonly used in scientific problems. Natural form of scalability Programming models Shared-memory programming model Need for synchronization to preserve the integrity Message-passing model Communication is required to access a remote data location. 16 16

  17. Parallel Program: Decomposition 2 • Data Parallelism • Exploit the parallelism inherent in many large data structures. • Same Task on diff. data. (SPMD) • Can be expressed by ALL parallel programming models (i.e. MPI, HPF like, OpenMP like) • Features • Scalable • Hard to express when geometry irregular or dynamic • Functional Parallelism • Coarse grain parallelism • Parallelism btwn the parts of many systems. • Diff. task on the same or diff. data. • Features • Parallelism limited in size • Tens not millions • Synchronization probably good as parallelism • Decomposition natural • E.g.) workflow 17

  18. Parallel Program: Decomposition 3 Load balance and scalability Scalable: running time is inversely proportional to the number of processors used. Speedup(n) = T(1)/T(n) Scalable if speedup(n) ≈ n Second definition of scalability: scaled speedup Scalable if the running time remains the same when the number of processors and the problem size are increased by a factor of n. Why scalability is not achieved? a region that must be run sequentially. Total speedup ≤ T(1)/Ts (Amdahl’s Law) Require for a high degree of communication or coordination. Poor load balance(major goal of parallel programming) If one of the processors takes half of the parallel work, speedup will be limited to a factor of two. 18 18

  19. Parallel Program Memory-Hierarchy Management Blocking Ensuring that data remains in cache between subsequent accesses to the same memory location. Elimination of False Sharing False sharing: When two diff. processors are accessing distinct data items that reside on the same cache block. Ensure that data used by diff. processors reside on diff. cache blocks. (by padding: inserting empty bytes in a data structure.) Communication Minimization and Placement Move send and receive commands far enough apart so that time spent on communication can be overlapped. Stride-one access Programs in which the loops access contiguous data items are much more efficient than those that do not. 19 19

  20. Message Passing Interface (MPI) 1 • Message Passing Interface (MPI) • A specification for a set of functions for managing movement of data among sets of communicating processes. • The dominant scalable parallel computing paradigm with scientific problem. • Explicit message sendandreceiveusing rendezvous model. • Point-to-point communication • Collective communication • Commonly implemented in terms of an SPMD model • All processes execute essentially the same logic. • Pros: • scalable and portable • Race condition avoided (implicit synch. w/ the copy) • Cons: • implements details at communication. 20

  21. MPI • 6 Key Functions • MPI_INIT • MPI_COMM_RANK • MPI_COMM_SIZE • MPI_SEND • MPI_RECV • MPI_FINALIZE • Collective Communications • Barrier, Broadcast, Gather, Scatter, All-to-all, Exchange • General reduction operation (sum, minimum, scan) • Blocking, nonblocking, buffered, synchronous messaging 21

  22. Open Specifications for Multiprocessing (OpenMP) 1 • Appropriate toShared-Memory. • A sophisticated set of annotations (compiler directives) for traditional C, C++, or Fortran codes to aid compilers producing parallel codes. • It provides parallel loops and collective operations such as summation over loop indices. • Provide lock variables to allow fine-grain synchronization btwn threads. 22

  23. OpenMP 2 • Directives: instruct the compiler to • Create threads • Perform synchronization operations. • Manage shared memory. • Examples • PARALLEL DO ~ END PARALLEL DO: explicit parallel loop. • SCHEDULE (STATIC): assign continuous blocks at compile time. • SCHEDULE (DYNAMIC): assign continuous blocks at run-time. • REDUCTION(+: x): final values of var. x is determined global sum. • PARALLEL SECTIONS: task parallelism. • OpenMP synchronization primitives • Critical sections • Atomic updates • Barriers • Master selection 23

  24. OpenMP 3 • Summary • Work decomposition • Ideal target system: uniform-access, shared-memory. • Specify where multiple threads should be applied, and how to assign work to those threads. • Pros: • Excellent programming interface for uniform-access, shared-memory machines. • Cons: • No way to specify locality in machines w/ non-uniform shared-memory or distributed memory. 24

  25. Multicore • Parallel Computing & MPI • Data Mining • Expectation Maximization (EM) • Deterministic Annealing (DA) • Hidden Markov Model (HMM) • Support Vector Machine (SVM) 25

  26. Expectation Maximization (EM) • Expectation Maximization (EM) • A general algorithm for maximum-likelihood (ML) estimation where the data are “incomplete” or the likelihood function involves latent variables. • An efficient iterative procedure • Goal: estimate unknown parameters, given measurement. • Hill climbing approach  guarantee to reach local maxima. • Two Steps • E-step (Expectation): the missing data are estimated given the observed data and current estimate of the model parameters. • M-step (Maximization): the likelihood function is maximized under the assumption that the missing data are known. (The estimated missing data from the E-step are used in lieu of the actual missing data.) • Those two steps are repeated until the likelihood converges. 26

  27. Deterministic Annealing (DA) Purpose: avoid local minima (optimization) Clustering example of unsupervised learning Simulated Annealing (SA) A sequence of random moves is generated and the random decision to accept a move depends on the cost of resulting configuration relative to the current state cost (Monte Carlo Method) Deterministic Annealing (DA) Deterministic: don’t wandering randomly (minimize the free energy directly) Annealing: still want to avoid local minima with certain level of uncertainty. maintain the free energy at its minimum. eq) F = D – TH (T: temperature, H: Shannon Entropy, D: cost) At large T, entropy (H) dominates while at small T cost dominates. Annealing lowers temperature so solution tracks continuously 27 27

  28. DA for Clustering • Start with a single cluster giving as solution Y1 as centroid • For some annealing schedule for T, iterate above algorithm testing covariance matrix in Xi about each cluster center to see if “elongated” • Split cluster if elongation “long enough” • You do not need to assume number of clusters but rather a final resolution T or equivalent • At T=0, uninteresting solution is N clusters; one at each point xi 28

  29. Hidden Markov Model (HMM) 1 • Markov model • A system which may be described at any time as being in one of a set of N distinct states, S1, S2, …, SN. • State transition probability • The special case of a discrete, first order Markov chain: • P[qt = Sj|qt-1 = Si, qt-2 = Sk, …] = P[qt = Sj|qt-1 = Si] (1) • Furthermore, consider those processes in which the right-hand side of (1) is independent of time, thereby leading to the set of state transition probability aij of the form aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1 • Initial state probability 29 29

  30. HIDDEN MARKOV MODEL (HMM) 2 • Hidden Markov Model • Observation is a probabilistic function of the state. • State is hidden. • Elements of an HMM • N, the number of states in the model. (Although the states are hidden) • M, the number of distinct observation symbols per state, i.e. the discrete alphabet size. • The state transition probability distribution A = {aij}, where aij = P[qt = Sj|qt-1 = Si], 1 ≤ i, j ≤ N, aij ≥ 0 ∑J aij = 1 • The observation symbol probability distribution (emission probability) in state j, B = {bj(k)}, where bj(k) = P[vk at t| qt = Sj],1 ≤ j ≤ N, 1 ≤ k ≤ M • The initial state distribution π = {πi} where πi = P[q1 = Si], 1 ≤ j ≤ N • Compact notation: λ = (A, B, π)

  31. HIDDEN MARKOV MODEL (HMM) 3 • Three Basic Problems for HMMs • Prob(observation seq | model): Given the observation sequence O = O1O2 … OT, and a model λ = (A, B, π), how do we efficiently compute P(O| λ), the probability of the observation sequence, given the model? • Finding Optimal State Sequence: Given the observation sequence O = O1O2 … OT, and a model λ = (A, B, π), how do we choose a corresponding state sequence Q = q1q2 … qT which is optimal in some meaningful sense (i.e. best “explains” the observations)? • Finding Optimal Model Parameters: How do we adjust the model parameters λ = (A, B, π) to maximize P(O| λ) ?

  32. HIDDEN MARKOV MODEL (HMM) 4 • Solution to the three basic problems for HMMs • Solution to the problem 1 (Forward-Backward procedure) • Enumeration (straightforward way): computationally unfeasible. • Forward Procedure • Consider forward variable αt(i) = P(O1O2 … Ot, qt = Si| λ) i.e., the probability of the partial observation sequence, O1O2 … Ot, (until time t) and state Si at time t, given the model λ. • Solution to the problem 2 (Viterbi algorithm) • Optimality criterion: to find the single best state sequence (path), i.e., to maximize P(Q|O, λ) which is equivalent to maximizing P(Q, O| λ). • A formal technique for finding this single best state sequence exists, based on dynamic programming methods, and is called Viterbi algorithm.

  33. HIDDEN MARKOV MODEL (HMM) 5 • Solution to the Problem 3. (Baum-Welch Algorithm) • The third problem of HMMs is to determine a method to adjust the model parameters (A, B, π) to maximize the probability of the observation sequence given the model. • Choose λ = (A, B, π) such that P(O| λ) is locally maximized using an iterative procedure such as the Baum-Welch method (or equivalently the EM (expectation-modification) method) or using gradient techniques. • Reestimation (iterative update and improvement), define ξt(i, j), the probability of being in state Si at time t and state Sj at time t+1, given the model and the observation sequence, i.e. ξt(i, j) = P(qt = Si, qt+1 = Sj|O, λ)

More Related