1 / 26

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer. Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT Students: just remember to take ENEE459P: Parallel Algorithms, fall’10 - What is a parallel algorithm? - Why should I care?.

kaili
Download Presentation

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer Uzi Vishkin www.umiacs.umd.edu/users/vishkin/XMT Students: just remember to take ENEE459P: Parallel Algorithms, fall’10 - What is a parallel algorithm? - Why should I care?

  2. Taste of a Parallel AlgorithmExample: Exchange Problem 2 Bins A and B. Exchange contents of A and B. Ex. A=2,B=5A=5,B=2. Algorithm (serial or parallel): X:=A;A:=B;B:=X. 3 Ops. 3 Steps. Space 1. Array Exchange Problem 2n bins A[1..n], B[1..n]. Replace A(i) and B(i), i=1..n. Serial Alg: For i=1 to n do /*serial exchange through eye-of-a-needle X:=A(i);A(i):=B(i);B(i):=X 3n Ops. 3n Steps. Space 1 Parallel Alg: For i=1 to n pardo /*2-bin exchange in parallel X(i):=A(i);A(i):=B(i);B(i):=X(i) 3n Ops. 3 Steps. Space n Discussion Parallelism tends to require some extra space Par Alg clearly faster than Serial Alg. What is “simpler” and “more natural”: serial or parallel? Small sample of people: serial, but only if you .. majored in CS Eye-of-a-needle: metaphor for the von-Neumann mental & operational bottleneck Reflects extreme scarcity of HW. Less acute now

  3. Commodity computer systems Intel Platform 2015, March05 Chapter 1 19462003:Serial. 5KHz4GHz. Chapter 2 2004--: Parallel. #”cores”:~dy-2003 Apple 2004: 1 core 2013: >100 cores Windows 7: scales to 256 cores… How to use the other 255? Did I mention ENEE459P? BIG NEWS Clock frequency growth: flat. If you want your program to run significantly faster … you’re going to have to parallelize it Parallelism: only game in town #Transistors/chip 19802011: 29K30B! Programmer’s IQ? Flat.. 40 years of parallel computing The world is yet to see a successful general-purpose parallel computer: Easy to program & good speedups

  4. Historic SPECint 2000 Performance Year Is performance at a plateau? ? Source: published SPECInt data Students: Make yourself ready for the job market. Serial computing <1% of computing power. Will serial computing be taught for … history majors?

  5. Welcome to the 2010 Impasse All vendors committed to multi-cores. Yet, their architecture and how to program them for single program completion time not clear  The software spiral (HW improvements  SW imp  HW imp) – growth engine for IT (A. Grove, Intel); Alas, now broken!  SW vendors avoid investment in long-term SW development since may bet on the wrong horse. Impasse bad for business. Parallel programming education: Does CS&E degree mean: being trained for a 50yr career dominated by parallelism by programming yesterday’s serial computers? ENEE459P Teach: (i) common denominator, and (ii) main approaches.

  6. Serial Abstraction & A Parallel Counterpart What could I do in parallel at each step assuming unlimited hardware  . . # ops Parallel Execution, Based on Parallel Abstraction Serial Execution, Based on Serial Abstraction . . # ops . . .. .. .. .. time time Time << Work Time = Work Work = total #ops • Rudimentary abstraction that made serial computing simple:that any single instruction available for execution in a serial program executes immediately Abstracts away different execution time for different operations (e.g., memory hierarchy) . Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively) • Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE) Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step.

  7. Explicit Multi-threading (XMT) 1979- : THEORY figure out how to think algorithmically in parallel Outcome in a nutshell: above abstraction 1997- : XMT@UMD: derive specs for architecture; design and build UV: Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism, http://www.umiacs.umd.edu/users/vishkin/XMT/cacm2010.pdf, to appear in CACM

  8. Not just talking Algorithms PRAM-On-Chip HW Prototypes 64-core, 75MHz FPGA of XMT (Explicit Multi-Threaded) architecture SPAA98..CF08 128-core intercon. networkIBM 90nm: 9mmX5mm, 400 MHz [HotI07] FPGA designASIC IBM 90nm: 10mmX10mm 150 MHz PRAM parallel algorithmic theory. “Natural selection”. Latent, though not widespread, knowledgebase “Work-depth”. SV82 conjectured: The rest (full PRAM algorithm) just a matter of skill. Lots of evidence that “work-depth” works. Used as framework in main PRAM algorithms texts: JaJa92, KKT01 Programming & Workflow Rudimentary yet stable compiler Architecture scales to 1000+ cores on-chip

  9. Participants Grad students:, Aydin Balkan, PhD, George Caragea, James Edwards, David Ellison, Mike Horak, MS, Fuat Keceli, Beliz Saybasili, Alex Tzannes, Xingzhi Wen, PhD • Industry design experts (pro-bono). • Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant. • Gang Qu, VLSI and Power. Co-advisor. • Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team grant. • Ron Tzur, Purdue U., K12 Education. Co-advisor. 2008 NSF seed funding K12:Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city) Ingenuity Project Middle School 2009 Summer Camp, Montgomery County Public Schools • Marc Olano, UMBC, Computer graphics. Co-advisor. • Tali Moreshet, Swarthmore College, Power. Co-advisor. • Marty Peckerar, Microelectronics • Igor Smolyaninov, Electro-optics • Funding: NSF, NSA 2008 deployed XMT computer, NIH • Industry partner: Intel Started from core CS. Built HW+Compiler foundation. Ready for ~10 timely CS PhD theses, ~2 Education, and ~10 ECE.

  10. More on ENEE459P, fall 2010 • Parallel algorithmic thinking (PAT) based on first principles. More challenging to self-study • Mainstream computingparallelism: chaotic. Hence: Pluralism valuable. • ENEE459: jointly taught by 2 instructors, video conferencing, U. Illinois • CS@Illinois: top 5.Parallel@Illinois: #1. • Joint course on timely topic : extremely rare opportunity. • More than “2 for the price of one“. 2 courses, each with 1 instructors would lack the interaction. • Advanced by Google, Intel and Microsoft, the introduction of parallelism into the curriculum dominated the recent flagship Computer Science Education Conference. Several speakers, including a Keynote by the Director of Education at Intel, reported that: • In job interviews, employers now expect an intelligent discussion of parallelism; and • (2) International competition recognizes that: 85% of the people that have been trained in parallel programming are outside the U.S.

  11. Membership in Intel Academic Community Implementing parallel computing into CS curriculum 85% outside USA Source: M. Wrinn, Intel

  12. The Pain of Parallel Programming • Parallel programming is currently too difficult: • To many users programming existing parallel computers is “as intimidating and time consuming as programming in assembly language” [NSF Blue-Ribbon Panel on Cyberinfrastructure]. • AMD/Intel: “Need PhD in CS to program today’s multicores”. • The real problem: Parallel architectures built using the following “methodology”: build-first figure-out-how-to-program-later. [J. Hennessy: “Many of the early ideas were motivated by observations of what was easy to implement in the hardware rather than what was easy to use”]

  13. Input: (i) All world airports. (ii) For each, all its non-stop flights. Find: smallest number of flights from DCA to every other airport. Basic (actually parallel) algorithm Step i: For all airports requiring i-1flights For all its outgoing flights Mark (concurrently!) all “yet unvisited” airports as requiring i flights (note nesting) Serial: forces eye-of-a-needle queue; need to prove that still the same as the parallel version. O(T) time; T – total # of flights Parallel: parallel data-structures. Inherent serialization: S. Gain relative to serial: (first cut) ~T/S! Decisive also relative to coarse-grained parallelism. Note: (i) “Concurrently” as in natural BFS: only change to serial algorithm (ii) No “decomposition”/”partition”  Speed-up wrt GPU: same-silicon area for highly parallel input 5.4X! (iii) But, SMALL CONFIG on 20-way parallel input: 109X wrt same GPU Mental effort of PRAM-like programming 1. sometimes easier than serial 2. considerably easier than for any parallel computer currently sold. Understanding falls within the common denominator of other approaches. 2nd Example of PRAM-like Algorithm

  14. Back to the education crisis CTO of NVidia and the official Intel leader of multi-cores at Intel: teach parallelism as early as you. Reason: we don’t only under teach. We misteach, since students acquire bad habits. Current situation is unacceptable. Sort of malpractice. Some possibilities • Teach as a major elective. • Teach all CS&E undergrads. • Teach CS&E Freshmen and invite all Eng, Math, and Science; sends message “CS&E is where the action is”.

  15. Need A general-purpose parallel computer framework [“successor to the Pentium for the multi-core era”] that: • is easy to program; • gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code; • supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming; and • fits current chip technology and scales with it. (in particular: strong speed-ups for single-task completion time) Main Point of talk: PRAM-On-Chip@UMD is addressing (i)-(iv).

  16. The PRAM Rollercoaster ride Late 1970’s Theory work began UP Won the battle of ideas on parallel algorithmic thinking. No silver or bronze! Model of choice in all theory/algorithms communities. 1988-90: Big chapters in standard algorithms textbooks. DOWN FCRC’93: “PRAM is not feasible”. [‘93+ despair no good alternative! Where vendors expect good enough alternatives to come from in 2008?]; Device changed it all: UP Highlights: eXplicit-multi-threaded (XMT) FPGA-prototype computer (not simulator), SPAA’07,CF’08; 90nm ASIC tape-outs: int. network, HotI’07, XMT. # on-chip transistors How come? crash “course” on parallel computing How much processors-to-memories bandwidth? Enough: Ideal Programming Model (PRAM) Limited: Programming difficulties

  17. How does it work “Work-depth” Algs Methodology (source SV82)State all ops you can do in parallel. Repeat. Minimize: Total #operations, #roundsThe rest is skill. • Programsingle-program multiple-data (SPMD). Short (not OS) threads. Independence of order semantics (IOS). XMTC: C plus 3 commands: Spawn+Join, Prefix-Sum Unique First parallelism. Then decomposition Programming methodology Algorithms  effective programs. Extend the SV82 Work-Depth framework from PRAM to XMTC OrEstablished APIs (VHDL/Verilog, OpenGL, MATLAB) “win-win proposition” • Compiler minimize length of sequence of round-trips to memory; take advantage of architecture enhancements (e.g., prefetch). [ideally: given XMTC program, compiler provides decomposition: “teach the compiler”] ArchitectureDynamically load-balance concurrent threads over processors. “OS of the language”. (Prefix-sum to registers & to memory. )

  18. PERFORMANCE PROGRAMMING & ITS PRODUCTIVITY Basic Algorithm (sometimes informal) Add data-structures (for serial algorithm) Add parallel data-structures (for PRAM-like algorithm) Serial program (C) 3 Parallel program (XMT-C) 1 Low overheads! 4 Standard Computer Decomposition XMT Computer (or Simulator) Assignment Parallel Programming (Culler-Singh) • 4 easier than 2 • Problems with 3 • 4 competitive with 1: cost-effectiveness; natural Orchestration Mapping 2 Parallel computer

  19. APPLICATION PROGRAMMING & ITS PRODUCTIVITY Application programmer’s interfaces (APIs) (OpenGL, VHDL/Verilog, Matlab) compiler Serial program (C) Parallel program (XMT-C) Automatic? Yes Maybe Yes Standard Computer Decomposition XMT architecture (Simulator) Assignment Parallel Programming (Culler-Singh) Orchestration Mapping Parallel computer

  20. Naming Contest for New Computer • Paraleap chosen out of ~6000 submissions Single (hard working) person (X. Wen) completed synthesizable Verilog description AND the new FPGA-based XMT computer in slightly more than two years. No prior design experience. Attests to: basic simplicity of the XMT architecture  faster time to market, lower implementation cost.

  21. Experience with High School Students, Fall’07 Gave 1-day parallel algorithms tutorial to 12 HS students. Some (2 10th graders) managed 8 programming assignments, including 5 of the 6 in the grad course. Only help: 1 office hour/week by undergrad TA. No school credit. Part of a computer club after 8 periods/day. May-June 08: 23 HS students, by self-taugh HS teacher, Alexandria, VA Spring’08: Course to non-major Freshmen (UMD Honor). How will programmers have to think by the time you graduate. Spring’08: Course to seniors.

  22. NEW: Software release Allows to use your own computer for programming on an XMT environment and experimenting with it, including: • Cycle-accurate simulator of the XMT machine • Compiler from XMTC to that machine Also provided, extensive material for teaching or self-studying parallelism, including • Tutorial + manual for XMTC (150 pages) • Classnotes on parallel algorithms (100 pages) • Video recording of 9/15/07 HS tutorial (300 minutes) Next Major Objective Industry-grade chip and production quality compiler. Requires 10X in funding.

More Related