1 / 43

Architectures and Systems for Parallel and Distributed Computing

Explore various parallel architectures and systems used in parallel and distributed computing, such as ARM11, Intel Pentium 4, Intel Skylake, Tianhe-2, and Google Cloud. Understand the limitations of sequential systems and the need for parallel processing. Learn about the challenges and advancements in reducing power consumption and increasing processing speed. Study concepts like pipeline processing, instruction-level parallelism, superscalar architectures, and Amdahl's law.

robertpark
Download Presentation

Architectures and Systems for Parallel and Distributed Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data Technologies Lecture 2:Architectures and systems for parallel and distributed computing Assoc. Prof. Marc FRÎNCU, Phd. Habil. marc.frincu@e-uvt.ro

  2. What is a parallel architecture ”A collectionof processingelements that cooperateto solvelargeproblemsfast” – Almasiand Gottlieb, 1989 • ARM11 (2002-2005) • 90% of the embedded systems is based on ARM processors • Rasberry Pi • Pipeline in 8 steps • 8 in-flight instructions (out-of-order) • Intel Pentium 4 (2000-2008) • 124 in-flight instructions • Pipeline in 31 steps • Superscalar: in processor instruction level parallelism • Intel Skylake (august 2015) • Quad core • 2 threads per core • GPU • Tianhe-2 (2013) • 32.000 Intel Xeon CPUs with12 cores • 48.000 Intel Xeon Phi CPUs with57 cores • 34 petaflops • Sunway Taihulight • First placein the world in November 2017 • 40.960 SW26010 CPUs with 256 cores • 125 petaflops • Google Cloud • Cluster farms • 10 Tbps bandwidth (US-Japan)

  3. Motivation • The performance of sequential systems is limited • Computation/data transfer through logic gates, memory devices • Latency <> 0 • Assuming an ideal environment we still have the limitation due to the light speed • Many applications requires performance • Nuclear reaction modelling • Climate modelling • Big Data • Google: 40.000 searches/min • Facebook: 1 billion users every day • Technological breakthrough • What do we do with so many transistors?

  4. Exemples • Titan Supercomputer • 299,000AMD x86 cores • 186,000NVIDIA GPUs

  5. Transistors vs. speed • More transistorspee chip equals more speed • Electronic devices that act as switches built to act as logical gates • The speed of each operation is given by the time required by each transistor to stop without causing any errors • A small transistor will stop/start faster • Exemple: • 3Ghz = 3 billion ops/sec • Increase density to increase speed • Intel P4: 170 milllion transistors • Intel 15-core Xeon IVY Bridge: 4.3 billion • On average, each year the processing power increased by 60%

  6. Transistors vs. speed • Moore law • Ideally: No. transistors doubles each year (2x) • In reality: 5x every 5 years (1,35x)

  7. Transistors vs. speed • Why can’t we increase speed forever? • Size of chip = constant • But, density of transistors increases • 2018: 10nm Intel Cannonlake • Dennard’s scalability • The power (V) needed to operate the transistors = constant even if the number of transistors per chip increases • But this is no longer true (as of 2006)! • The transistors is are becoming so small that their integrity breaks and they leak current • The faster we turn off/on a transistor the more heat it generates • For 8,5-9 GHz we require liquid nitrogen!

  8. How do we reduce the power consumption? • Frequency∝voltage • Gate energy ∝ voltage2 Executing the same number of cycles at a lower voltage andspeedpower economy • Example • Task with deadline of 100ms • Method #1: • 50ms at full speed then 50 ms idle • Method #2: • 100ms at frequency/2 and voltage/2 • Energy requirement: energy = voltage/44x cheaper

  9. Sequential processing • 6 hours to wash 4 rows of clothes

  10. Pipeline sequential processing • Pipeline= start processing IMMEDIATELY • Improves system throughput • Multiple tasks operatein parallel • Reduces time to 3.5 hours

  11. Instruction level parallelism (ILP)

  12. CPU pipeline

  13. Conditional pipeline (branching) • What happens when we have dependencies between instructions? • Especially for if-else branching • The processor must erase instructions fetched through the pipeline since reaching a branch means they were incorrectly fetched, i.e., we must fetch the instructions corresponding to the chosen branch. • AMD Ryzen (2017)uses neural networks topredictthe execution path generate address fetch operands fetch decode execute store i+2 is a branch instruction: execute either j or i+3 https://courses.cs.washington.edu/courses/csep548/00sp/handouts/lilja.pdf

  14. Superscalar architectures • Executemore than one instruction per CPU cycle • Send multiple instructions to redundant units • Hyperthreading (Intel) • Example • Intel P4 • One instruction (thread) processesintegers (ALU unit for integers) another processes floating point numbers (ALU unit for floats) • The OS thinksit deals with 2 processors • Is accomplished by combininga series of shared, replicated or partitioned resources: • Registers • Arithmetic units • Cache memory

  15. Amdahl’s law • How much can we improve by parallelizing? • Example • Floating point instructions • Theoretical speedup: 2x • Percentage of the total #instructions:10% • 1,053

  16. End of the single core? • Frequency stopped scalability • Dennard’s scalability • Memory wall • Data and instruction must be fetched to the registries (cache) • Memorybecomes the critical point • The ILP wall • Dependencies between instructions limit the ILP efficiency

  17. Multi-core solution • More cores on the same CPU • Better than hyperthreading • Real parallelism • Example • Reducing speed (frequency) by 30%  reduces power by 35% • Power ∝ frequency3 (or worse) • But performance is also reduced by 30% • Having 2 cores per chip at 70% speed  140% of the original performance at 70% of the power • 40% increase in power at 30% savings in energy

  18. IBM power4 • Introduced în 2001

  19. Multicore • Intel i7 • 6 cores • 12 threads • Parallelism through hyperthreading

  20. Different architectures • Intel Nehalem (Atom, i3, i5, i7) • Cores are linked through the QuickPathInterconnect (QPI) • Mesh architecture (all with all) • 25,6 GB/s • Older Intel versions (Penryn) • Cores are linked through the FSB (Front Side Bus) • Sequential architecture • 3,2 GB/S (P4) – 12,8 GB/s (Core 2 Extreme)

  21. AMD Infinity Fabric • HyperTransport protocol • 30-50 GB/s • 512 GB/s for GPU Vega • Mesh network • Network on a chip, clustering • Link between GPUs and SoC • CCIXstandard: accelerators, FPGA-uri

  22. Many core • Systems with tens or hundreds of cores • Developed forparallel computing • High Throughputand low energy consumption (sacrifices latency) • Problems such ascache coherencein multi-core systems (few cores) • They use: mesage passing, DMA, PGAS (Partitioned Global Access Space) • Not efficient for applications using just one thread • Example • Xeon Phi with 59-72 cores • GPUs: Tesla K80 with 4992 CUDA cores

  23. CPU vs. GPU architecture • GPU has more transistors for computations

  24. Intel Xeon Phi architecture

  25. Parallel processing models Flynn classification • SISD (Single Instruction Single Data) • Uniprocesor • MISD (Multiple Instruction Single Data) • Multiple processors on a single data stream • Sistolic processor • Stream processor • SIMD (Single Instruction Multiple Data) • Same instruction is executed on multiple processors • Each processor has its own memory (different data) • Shared memory for control and instructions • Good for data level parallelism • Vector processor • GPUs(partialy) • MIMD (Multiple Instruction Multiple Data) • Each processor has its own data and instructions • Multiprocessor • Processor multithread

  26. Centralized shared memory Multiprocessors • Symmetric MultiProcessor (SMP) • Processors areinterconectedthrough a UMA (Uniform Memory Access) backbone • Do no scale well

  27. Distributed shared memory multiprocessors • SMP clusters • NUMA (Non Uniform Memory Access) • Physical memory for each processor (but address space is shared) • Avoids starvation: memory is accessed by one processor at a time • AMD processors implement the model through HyperTransport (2003), and Intel through QMI (2007)

  28. Distributed shared memory multiprocessors • Enhanced scalability • Low latency for local access • Scalable memory bandwidth at low costs • But • Higher interprocessor communication times • Network technology is vital! • Complex software model

  29. Message passing multiprocessors • Multicomputers • Communication is based onmessage passing between processorsand not shared memory access • Can call remote methods: Remote Procedure Call (RPC) • Libraries: MPI (Message Passing Interface) • Synchronous communication • Causes process synchronization • Address space allocated private addresses to each distinct processors • Example • Massive Parallel Processing (MPP) • IBM Bluegene • Clusters • Created by linked computers in a LAN

  30. Shared memory vs. message passing • Shared memory • Easy programming • Hides the network but does not hide its latency • Hardware controlled software • To reduce communication overhead • Message passing • Explicitcommunication • Can be optimized • Natural synchronization • When sending messages • Programming is challenging as it must consider aspects hidden by the shared memory systems • Transport cost

  31. Distributed systems • “A collectionof (probably heterogeneous) automata whose distributionis transparent to the user so that the system appears as one local machine. This is in contrast to a network, where the user is aware that there are several machines, and their location, storage replication, load balancing and functionality is not transparent. Distributed systems usually use some kind of client-server organization.” – FOLDOC • “A Distributed System comprises several single components on different computers, which normally do not operate using shared memory and as a consequence communicatevia the exchange of messages. The various componentsinvolved cooperateto achieve a common objective such as the performing of a business process.” – Schill & Springer

  32. Parallel vs. distributed systems • Parallelism • Executing multiple tasks at the same time • True parallelism requires having as many cores as parallel tasks • Concurrent execution • Thread based computing • Can use hardware parallelism but usually derives from software requirements • Example: the effects of multiple system calls • Become parallelism if parallelism is true • Distributed computing • Refers to where the computation is performed • Computers are linked in a network • Memory is distributed • Is usually part of the objective • If resources are distributed then we have a distributed system • Raises many problems from a programming point of view • No global clock, synchronization, unpredicted errors, variable latency, security, interoperatibiliy.

  33. Distributed systems models • Miniccomputer • Workstation • Server workstation • Processor pool • Cluster • Grid • Cloud

  34. Mini- computer Mini- computer Mini- computer Minicomputer • Extension of time sharing systems • The user logs on the machine • Authenticates remotely though telnet • Shared resources • Data bases • HPC ARPA net

  35. Workstation • Process migration • The user authenticates on the machine • If any networked resources are available then the process migrates there • Issues • How do you identify available resources? • How do we migrate a process? • What happens if another user logs on the available resource? Workstation Workstation Workstation 100Gbps LAN Workstation Workstation

  36. Workstation – server • Client stations • No hard memory • Interactive/graphical processes are executed locally • All files and computations are senton the server • Servers • Each machine is dedicatedto a certain type of job • Communication model • RPC (Remote Procedure Call) • C • RMI (Remote Method Invocation) • Java • A client processinvokesa server process • There is no migration between machines Workstation Workstation Workstation 100Gbps LAN Mini- Computer file server Mini- Computer http server Mini- Computer cycle server

  37. Processor pool • Client • User authenticates on remote machine • Allservicesare sentto servers • Server • Allocates the required number of processorsto each client • Better usage but less interaction 100Gbps LAN Server 1 Server N

  38. Cluster Workstation • Client • Client-server model • Server • Consists of many interconnected machines through a high speed network • The aim is performance • Parallel processing Workstation Workstation 100Gbps LAN http server2 http server N http server1 Slave N Master node Slave 1 Slave 2 1Gbps SAN

  39. Grid Workstation • Aim • Collect processingpowerfrom many clusters or parallel systems and make it availableto users • Similar in concept with a power grid • You just buy what you use • HPC distributed computing • Large problems requiring many resources • On demand • Remote resources are integrated with local ones • Big Data • Data is distributed • Shared computing • Communication between resources Super- computer High-speed Information high way Mini- computer Cluster Super- computer Cluster Workstation Workstation

  40. cloud Workstation • Distributed systems where access to resources is virtualizedand on demand, while keeping the topology hidden • Per per use (per second, GB, query, etc.) • Access levels • Infrastructure (Iaas) • Platform (Paas) • Services (Saas) • Data (DaaS) • Amazon EC2 • Google Compute Cloud • Microsoft Azure Internet Specific services VM VM VM Database Workstation Workstation

  41. summary • Shared memory • Homogeneous resources • Nsecaccess • Message passing • Homogeneous/heterogeneous resources • Microsecaccess • Distributed systems • Heterogeneous resources • Msec access

  42. Sources • http://slideplayer.com/slide/5704113/ • https://www.comsol.com/blogs/havent-cpu-clock-speeds-increased-last-years/ • http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture01-intro.pdf • https://www.slideshare.net/DilumBandara/11-performance-enhancements • http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture15.pdf • http://www.ni.com/white-paper/11266/en/ • https://www.cs.virginia.edu/~skadron/cs433_s09_processors/arm11.pdf • http://www.csie.nuk.edu.tw/~wuch/course/eef011/4p/eef011-6.pdf

  43. Next lecture • Parallelizing algorithms • APIsand platformsfor parallel and distributed computing • OpenMP • MPI • Uniform Parallel C • CUDA • Hadoop

More Related