CGRA Quiz

CGRA Quiz

Quiz • What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures? (Max of 5 words!) • Give two examples for each coarse grained architecture type: Mesh, Linear Array, and Crossbar • Indicate whether the given architecture supports some form of partial reconfiguration or not. PipeRanch, KressArray, Chess

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/21/2014 DAY - 2 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku

Outline • Coarse Grained Reconfigurable Architectures • RAW • CHESS • Basics Of Network On Chip(NoC) • Project Overview

Raw Architecture Workstation (RAW) • Developed at MIT • It fully exposes Low Level hardware architectural details to the compiler • It lacks hardware for register renaming and dynamic instruction issue • A Raw architecture seeks to execute pipelined application (like signal processing) efficiently. Motivation ??? Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

Change Is Around the Corner • Processor performance not scaling as before • Wire delay and power old view: chip looks small to a wire chip size distance signal can travelin 1 cycle new view: chip looks much bigger to a wire, communication is expensive even on chip! Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

Raw Architecture How do we arrive at this design??? Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

PC RF Wide Fetch (16 inst) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Problems with Monolithic Designs • Super-wide general purpose processors are no longer practical • Centralized control with global operand routing • Area, power, and frequency concerns control Unified Load/Store Queue Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

PC RF Wide Fetch (16 inst) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net + control >> Unified Load/Store Queue

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Spatial Architectures RF Bypass Net Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net Spatial Architectures RF Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Spatial Architectures RF Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Exploiting Locality + RF >> Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

RF RF RF RF RF RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF Distribute the Register File RF Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

RF PC Wide Fetch (16 inst) PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF Control RF Unified Load/Store Queue Distribute the Rest Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

RF PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC PC RF RF RF RF RF RF RF RF RF RF RF RF I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ D$ RF RF RF Tiled-Processor Architecture Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

Tiled-Processor Architecture • Tile abstraction is quite powerful • e.g., power → resources used as necessary • Easily scalable • All signals registered at tile boundaries, no global signals • Easier to Tune the Frequency • Easier to do the Physical Design • Easier to Verify • Make a tile as big as you can go in one clock cycle, and expose longer communication to the programmer Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

Computation Resources Switch Processor Raw On-Chip Networks • 2 Static Networks • Provides low latency communication between tiles. • Makes routing decision during compile time. • 2 Dynamic Networks • Header encodes destination. • Transports unpredictable operations like interrupt and cache misses. Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

Inside the Compute Processor r24 r24 r25 r25 r26 r26 r27 r27 Local Bypass Network Output FIFOs to Static Router Input FIFOs from Static Router E M1 M2 A TL TV IF D RF F P U F4 WB

Raw Compiler Example Assign instructions to the tiles, maximizing locality. Generate the static router instructions to transfer Operands & streams tiles. tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. seed.0=seed pval5=seed.0*6.0 Raw tile pval1=seed.0*3.0 pval4=pval5+2.0 seed.0=seed pval0=pval1+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 tmp0.1=pval0/2.0 pval1=seed.0*3.0 v1.2=v1 v3.10=tmp3.6-v2.7 v2.4=v2 tmp0=tmp0.1 pval5=seed.0*6.0 pval2=seed.0*v1.2 v3=v3.10 pval0=pval1+2.0 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 v1.2=v1 v2.4=v2 tmp3.6=pval4/3.0 tmp1=tmp1.3 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 tmp2.5=pval3+2.0 tmp1.3=pval2+2.0 pval6=tmp1.3-tmp2.5 tmp2=tmp2.5 tmp1=tmp1.3 v1.8=pval7*3.0 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 v0.9=tmp0.1-v1.8 v2.7=pval6*5.0 v1.8=pval7*3.0 v1=v1.8 v3.10=tmp3.6-v2.7 v2=v2.7 v0=v0.9 v0.9=tmp0.1-v1.8 v2=v2.7 v1=v1.8 v3=v3.10 v0=v0.9 [Slide Source: Michael B. Taylor]

Architectural Comparison RAW Superscalar Multiprocessor Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

Application Mapping on RAW Video Data Stream Frame Buffer And Screen Custom Data Path Pipeline (by Compiler) Two-way threaded Java program Four-way parallelized scalar code httpd Zzzz.. Sleep Mode (power saving) Fast Inter-tile ALU forwarding : 3 cycles [ Waingold, Elliot, et al. "Baring it all to software: Raw machines." Computer 30.9 (1997): 86-93.

RAW - Performance Taylor, Michael Bedford, et al. "Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams." ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society, 2004.

CHESS - A Reconfigurable Arithmetic Array For Multimedia Applications • Designed by Hewlett Packard laboratories in the year 1999 • Aims at speeding up arithmetic operations for multimedia applications and tries to improve memory density • Principle goals of CHESS • Increased arithmetic computational density • Increased memory bandwidth • Increased capacity of internal memories • Enhanced Flexibility • Rapid Reconfiguration Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Architecture • 4 bit ALUs • 4 bit bus wiring • Switchboxes • Chessboard Layout • Embedded block RAM’s • Speed and hierarchical line lengths • Small configuration memories • No run-time reconfiguration Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Components Switchbox ALU Logic Design Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Routing Structure Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

CHESS - Performance • High computational density • Efficient multiplies due to embedded ALU • Issues: • No reported software or application results • No run-time reconfiguration Marshall, Alan, et al. "A reconfigurable arithmetic array for multimedia applications."

Comparison: CHESS and MATRIX • Both use 2D array of ALUs • For both, instructions can be generated within the array • Both the architectures are flexible • CHESS is 4 bit whereas MATRIX is 8 bit • CHESS does not support run-time reconfiguration but has very fast configuration as few bits are required • CHESS has high computational density • CHESS is aimed at arithmetic operations whereas MATRIX is more general purpose

Network-On-Chip(NoC)

Project Overview • Implementing Coarse Grained and Hybrid Reconfigurable Architecture • NoC interconnection between processing elements • Supports Variable Block Size Motion Estimation • Motion Estimation Algorithms • Full Search • Diamond Search Verma, Ruchika, and Ali Akoglu. "A coarse grained and hybrid reconfigurable architecture with flexible NoC router for variable block size motion estimation." Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE, 2008.

Main Memory Memory Interface (MI) c_d_(x,y) (32 bits) r_d_(x,y) (32 bits) reference_block_id (5 bits) data_load_control (16 bits) c_d r_d r_d c_d c_d c_d 32 bits CPE (1,1) CPE (1,2) CPE (1,3) CPE (1,4) 12 bits r_d r_d PE 2(2) PE 2(1) c_d r_d c_d r_d 14 bits c_d c_d CPE (2,1) CPE (2,2) CPE (2,3) CPE (2,4) r_d r_d PE 3 c_d c_d CPE (3,1) CPE (3,2) CPE (3,3) CPE (3,4) r_d r_d PE 2(3) r_d c_d r_d PE 2(4) c_d c_d c_d CPE (4,1) CPE (4,2) CPE (4,3) CPE (4,4) r_d r_d r_d c_d c_d r_d

QUESTIONS??

CGRA Quiz

CGRA Quiz

Presentation Transcript

Quiz

Quiz

QUIZ

Quiz

Quiz

Quiz

Quiz

Quiz, Quiz, Trade Cards

Quiz

QUIZ

Quiz:

Quiz:

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

QUIZ QUIZ QUIZ QUIZ QUIZ

Quiz Presentation “Quiz Factor”