1 / 65

Structured Codesign for Manycore Systems

Structured Codesign for Manycore Systems. Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich Sofsem Novy Smokovec, January 2011. About Me. 1968 System programming at Swissair 1977 PhD in Mathematics 1981 Joined Niklaus Wirth's Lilith/ Modula team 1985 Sabbatial stay at Xerox PARC

mateja
Download Presentation

Structured Codesign for Manycore Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Codesign for Manycore Systems Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich Sofsem Novy Smokovec, January 2011

  2. About Me • 1968 System programming at Swissair • 1977 PhD in Mathematics • 1981 Joined Niklaus Wirth's Lilith/ Modula team • 1985 Sabbatial stay at Xerox PARC • 1986 Project Oberon together with Wirth • 2000 Academic languages researcher at MSR

  3. Outline of Talk • Context & Vision • A Structured Approach • Use Cases • Programming Language & Compiler • Power Management Codesign • Hardware Library

  4. Context & Vision Some context of the project and a vision

  5. Microsoft Innovation Cluster • Launched in 2008 by Microsoft (Reseach) • Volume 5 years/ $5 mio • Theme embedded systems software • Participants • ETH Zürich (3 projects) • EPFL Lausanne (4 projects) • Goals • Research in embedded systems • Technology transfer • Education „Supercomputer in the pocket“ is one among them

  6. Supercomputer in the Pocket • Manycore architecture for embedded systems on the basis of programmable hardware (FPGA) • High-performance computing in the small • Generic technology for wide range of apps • Sensor driven medical IT • Data streaming in financial apps • Running robot with limb control • Real time audio processing • Hardware/ software design from the ground up will be focussed in this talk

  7. People Involved • Microsoft Research • Chuck Thacker (consultant) • ETH Zürich • Niklaus Wirth (processor design) • Jürg Gutknecht (project leader) • Lisa (Ling) Liu (hardware design) • Felix Friedrich (compiler) • University Hospital Basel • Alexej Morozow (medical IT app)

  8. The Vision • Custom hardware design for embedded systems • Programmers need no hardware knowledge • System design process at high level of abstraction • Fully automated mapping process to FPGA • FPGA resources are used efficiently

  9. Semantic Gap Program Constructs FPGA Resources • Object • Thread • Data structure • Statement • Communication • I/O • ... • Lookup tables (LUT) • Block RAMs (BRAM), • DSP slices • … Map

  10. An Structured Approach Big picture of our structured codesign approach

  11. Options for How to Achieve It • Hardware compilation: Custom mapping of specific algorithm (or hot spots) to hardware circuits. • Uniprocessor: Single universal processor plus on-chip cache memory. Transparently connected to external memory. • SMP: Several universal processors, each with on-chip cache memory, and each transparently connected to external memory. Cache coherence mechanism needed. • Preconfigured: Several universal processors, each with private on-chip memory. Interconnected via on-chip network. One processor connected to external memory.

  12. A Better Approach • Hardware/ software codesign based on a suitable high-level computing model and programming language • Fully automated mapping/ synthesizing to FPGA hardware based on suitable library of highly configurable hardware components

  13. Our Computing Model • Active Cell (Actor) • Object with private state space • Behavior control thread • Communicating with other actors via channels • Actor Graph • Collection of interoperating actors running in parallel • Some actors connected to I/O via serial port

  14. Our Hardware Library • TRM processor (Tiny Register Machine) • Extremely simple • Two level pipelined instruction execution • Several variants • VTRM (vectors via DSP), DTRM (DMA) • Communication FIFO • Ring buffer • Sizes 32, 64, 128, 1024 • I/O controllers • DDR2, CF, LCD, UART

  15. Mapping Actor Graph FPGA • Actor • Communication channel • I/ O • TRM processor („core“) • Instruction memory • Data memory • FIFO buffer • I/ O controllers connected to cores Map

  16. TRM/ FIFO Cooperation channel FIFO recv M TRM • fully orchestrated by TRM • no interrupts! send channel FIFO

  17. Use Cases Two data driven applications of our system

  18. Realtime Multichannel ECG Monitor • Analyze the activity of the heart, the morphology of the corresponding waves, and the heart rate variability (HRV), with the aim of detecting and classifying potential anomalies • The signal to be analyzed decomposes into 8 physical channels, each of them sampled at 500 Hz

  19. Decomposition into Actor Graph Wave proc_1 Wave proc_2 Signal input QRSdetect HRV analysis Disease classifier ECGbitstream out stream Wave proc_8

  20. Actions • Receive ECG signal from UART, compose individual samples, and distribute them to channel processors. • (Per channel): Precondition wave by suppressing noise via linear filtering; Detect the heart beats and contractions. • Detect QRS patterns and make a final decision about heart rate on the basis of standard multichannel logic. • Analyze the current heart rhythm and the heart rate variability (HRV). • Use decision tree logic to detect and classify arrhythmia events such as premature ventricular contractions (PVC), ventricular tachycardia etc. Feed results back to configure wave processing.

  21. Xilinx Virtex-5 FPGA Development board

  22. FIFO20 FIFO1 FIFO9 FIFO19 TRM2 ECG Resulting FPGA configuration TRM3 RS232 UART Ctrl CF Ctrl CF TRM4 TRM1 TRM10 TRM11 TRM12 FIFO17 FIFO18 LCD Ctrl LCD TRM9 FIFO33 FIFO8 FIFO16 FIFO34

  23. Use of Resources • ECG Monitor • Maximum number of TRMs in communication chain

  24. Preconfigured Version

  25. Comparative Power Usage • Preconfigured FPGA (TRM, IM/ DM, I/O, interconnect) • Fully configurable 86% saving!

  26. Graphics Based Motion Detection • Problem: Detect moving objects in a series of image frames • Approach: Parallelize detection process by domain decomposition (into 4 parts) • Design: A reader process continuously reads frames from external memory and forwards them to (4) part-detection processes running in parallel and reporting detected movements

  27. FPGA Configuration

  28. Performance Results • Data base • 10 frames of resolution 576 x 768 (432 KP) • Estimated performance • Transfer from external DDR2 memory ca. 40 MP/sec • Computation: 4 x 31 MP/sec • Total time used per frame 55 ms • Total throughput 18 frames/ sec

  29. Program Language & Compiler Programming language & automated mapping

  30. The ActiveCells Language • History & Profile • Evolution of Pascal, Modula, Oberon • Actor based • Compositional • Active cell (Actor) • Object with active behavior, communicating via channels • Assembly • Network of interoperating active cells • Reusable software component with ports interface

  31. Example of Functional Actor • F = actor (in1, in2: instr; out: outstr);vari, j: integer;beginlooprecv(in1, i); recv(in2, j); send(out, someOp(i, j))endend

  32. Example of User Interface Actor • UI = actor (out1, out2: outstr; in: instr);var i, j, k: INTEGER;beginloop RS232.RecvInt(i); RS232.RecvInt(j); send(out1, i); send(out2, j); recv(in, k); RS232.SendInt(k)endend

  33. Examples of Assemblies • Assembly without ports • Assembly with ports out delegate A B connect G out UI F in out RS232 actor in1 in2 out1 out2 in1 in2 F F out out in1 in2 in1 in2 in1 in2 in3 in4

  34. Assembly A Code • assembly A; (*without ports*) import RS232;typeF = actor (in1, in2: instr; out: outstr); UI = actor (out1, out2: outstr; in: instr);varifc: UI; f: F;begin new(ifc); new(f); connect(ifc.out1, f.in1); connect(ifc.out2, f.in2); connect(f.out, ifc.in)end A.

  35. Assembly B Code • Assembly B (in1, in2, in3, in4: instr; out: outstr); (*with five ports*)typeF, G = actor (in1, in2: instr; out: outstr);varf1, f2: F; g: G;begin new(f1); new(f2); new(g); connect(f1.out, g.in1); connect(f2.out2, g.in2); delegate(in1, f1.in1); delegate(in2, f1.in2); delegate(in3, f2.in1); delegate(in4, f2.in2); delegate(out, g.out)end B.

  36. Built-In Vector Types and Operators • Runge-Kutta (x, x1, k1, k2, … 3d vectors) • while t <= tmaxdo k1 := f(t, x); k2 := f(t + dt/2, x + dt/2 * k1); k3 := f(t + dt/2, x + dt/2 * k2); k4 := f(t + dt, x + dt * k3);x1 := x + dt/3 * (1/2 * k1 + k2 + k3 + 1/2 * k4); Draw(x, x1); x := x1; t := t + dt;end

  37. Built-In Matrix Types and Operators • Graphics pipeline (Matrix multiplication) • M := Graphics.Proj(left, right, bot, top, near, far)* Graphics.Trans(0.0, 0.0, -d)* Graphics.RotX(elev)* Graphics.RotY(-azim)* Graphics.Trans(0.0, 0.0,- zm)

  38. Hybrid Compilation

  39. Actor Code • F = actor (in1, in2: instr; out: outstr);vari, j: integer;beginlooprecv(in1, i); recv(in2, j); send(out, someOp(i, j))endend TRM

  40. Assembly Code • assembly B (in1, in2, in3, in4: instr; out: outstr);typeF, G = actor (in1, in2: instr; out: outstr);varf1, f2: F; g: G;begin new(f1); new(f2); new(g); connect(f1.out, g.in1); connect(f2.out2, g.in2); delegate(in1, f1.in1); delegate(in2, f1.in2); delegate(in3, f2.in1); delegate(in4, f2.in2); delegate(out, g.out)end B. Verilog

  41. Automated Mapping to FPGA source program runtime library TRMcode hybrid compiler scripts make.tcl, ram.bmm Verilog code memory images.mem hardware library Xilinxsynthesizer bits

  42. Program Model Refinement • Each thread may spawn any number mutually independent sub-threads • Advantages • Allows (lock-free) fine-grained parallel computing • Requirements • Needs core clustering • Needs runtime scheduling support • Needs barrier mechanism spawn A1 A A1 A2 barrier

  43. Next Step • Use the ActiveCells language for developing embedded software on top of some standard IDE • Including design, programming, debugging, analyzing • Analyzer may need cycle accurate simulator • Use fully automated tool to generate an FPGA image burn down

  44. Power Management Codesign Integrated HW/SW power management system Collaboration with Prof. Shiao-Li Tsao, National Chiao Tung University, Taiwan

  45. Perfomance/ Energy Space

  46. P/ E Profiling

  47. Clock Gating Strategy with clock always on with clock gating

  48. Power Management as Add-On • Clock gating • PM Add-On generated automatically on demand • actor{ PM } (...); • Instruction • clockOff() • Control registers • TRM mode, clock rate, voltage • Signals • Data on port • I/O ports • Interop with PM controller • Internal memory • backup TRM state/ registers TRM PMAdd-On Circuitry data clk out in

  49. Clock Gating Off Procedure TRM data PM Add-On Circuitry signal PM controller clk out in clk Clock Manager PM Controller stop clock

  50. Clock Gating On Procedure  Data arrives TRM processor resumes data PM Add-On Circuitry clk out in clk Clock Manager PM Controller PM controller feeds in clock

More Related