Performance Analysis of GRAM Benchmarks for Reconfigurable Systems

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, George Mason University. MAPLD 2004, Washington DC

Abbreviations • BRAM – Block RAM • GRAM - Generalized Reconfigurable Architecture Model • LM - Local Memory • Max – Maximum • MAP – Multi Adaptive Processor • MPM - Microprocessor Memory • OCM - On-Chip Memory • PE - Processing Element • Trans Perms -Transfer of Permissons 2

Outline • Problem Statement • GRAM Description • Assumptions and Methodology • Testbed Description: SRC-6E • Results • Conclusion and Future Direction 3

Problem Statement • Develop a standardized model of Reconfigurable Architectures. • Define a set of synthetic benchmarks based on this model to analyze performance and discover bottlenecks. • Evaluate the system against the peak performance specifications given by the manufacturer. • Prove the concept by using these benchmarks to assess and dynamically characterize the performance of a reconfigurable system, using the SRC-6E as a test case. 4

Generalized Reconfigurable Architecture Model (GRAM)

GRAM Benchmarks: Objective • To measure maximum sustainable data transfer rates and latency between the various elements of the GRAM. • Dynamically characterize the performance of the system against system peak performance. 6

Generalized Reconfigurable Architecture Model (GRAM) 7

GRAM Elements • PE – Processing Element • OCM – On-Chip Memory • LM – Local Memory • Interconnect Network / Shared Memory • Bus Interface • Microprocessor Memory 8

GRAM Benchmarks • OCM – OCM: Measure max. sustainable bandwidth and latency between two OCMs residing on different PEs. • OCM – LM: Measure max. sustainable bandwidth and latency between OCM and LM in either direction. • OCM - Shared Memory: Measure max. sustainable bandwidth and latency between OCM and Shared Memory in either direction. • Shared Memory – MPM: Measure max. sustainable bandwidth and latency between Shared Memory and MPM in either direction. 9

GRAM Benchmarks • OCM – MPM: Measure max. sustainable bandwidth and latency between OCM and MPM in either direction. • LM – MPM: Measure max. sustainable bandwidth and latency between LM and MPM in either direction. • LM – LM: Measure max. sustainable bandwidth and latency between LM and LM in either direction. • LM – Shared Memory: Measure max. sustainable bandwidth and latency between LM and Shared Memory in either direction. 10

GRAM Assumptions

Assumptions • All devices on board are fed through a single clock • No direct path between the Local Memories of individual elements • Connections for add-on cards may exist but not shown • The generalized architecture has been created based on precedents set by past and current manufacturers of Reconfigurable Systems. 12

Methodology • Data paths can be parallelized to the maximum extent possible. • Inputs and Outputs have been kept symmetrical. • Hardware timers have been used to measure times taken to transfer data. • Measurements have been taken for transfers of increasingly large amounts of data. • Data must be verified for correctness after transfers. • Multiple paths may exist between the elements specified. Our aim will be to measure the fastest path available. • All experiments will be conducted using the programming model and library functions of the system. 13

Testbed Description: SRC-6E

Hardware Architecture of the SRC-6E 800/1600 Mbytes/sec 800/1600 Mbytes/sec 64 x 6 64 x 6 64 800 Mbytes/sec 800 Mbytes/sec 64 x 6 64 x 6 15

APPLICATION .c or .f Files .mc or .mf Files .vhd or .v Files Logic Synthesis μP Compiler MAP Compiler .v Files .o Files .o Files .ngo FILES Linker Place & Route Application Executable .bin Files Programming Model of the SRC-6E 16

mP Board mP Board P3/P4 mP (1/3 GHz) P3/P4 mP (1/3 GHz) P3/P4 mP (1/3 GHz) P3/P4 mP (1/3 GHz) 800/1600 MBytes/Sec 800/1600 MBytes/s Shared Memory to MPM 8000 8000 OCM - MPM MAP III Board L2 L2 L2 L2 800 800 4800 (6 x 800) 4800 (6 x 800) Control Chip Control Chip MIOC MIOC SNAP SNAP µ Processor Memory (1.5 GB) µ Processor Memory (1.5 GB) On-Board Memory (24 MB) On-Board Memory (24 MB) OCM – Shared Memory PCI Slot PCI Slot 4800 (6 x 800) 4800 (6 x 800) OCM - OCM User Chip User Chip User Chip User Chip 2400 (4800*) 2400 (4800*) Ethernet GRAM Benchmarks for the SRC-6E 17

GRAM Benchmarks for the SRC-6E 18

Results

Block Diagram for a Single Bank transfer between OCM to Shared Memory Start_timer Read_timer(ht0) µProcessor Memory to Shared Memory (DMA_in) Read_timer(ht1) Shared Memory to OCM Read_timer(ht2) OCM to Shared Memory Read_timer(ht3) Shared Memory to µProcessor Memory (DMA_out) Read_timer(ht4) 20

Latency *1 word = 64 bits 21

Latency • The difference between read and write times for the OCM and Shared Memory is due to the read latency of OBM (6 clocks) vs. BRAM (1 clock). • When transferring data from the MPM to Shared Memory, writes are issued at each clock cycle and there is no startup latency involved. • When reading data from the Shared Memory to the MPM, there is an additional five clock cycles required to transfer data after the read has been issued. 22

Shared Memory A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB PROCESSING ELEMENT (FPGA) PROCESSING ELEMENT (FPGA) 64 64 64 64 64 64 64 64 OCM 1 OCM 2 OCM 1 OCM 2 192 Data Path from OCM to OCM Using Transfer Of Permissions 23

Shared Memory A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB PROCESSING ELEMENT (FPGA 1) PROCESSING ELEMENT (FPGA 2) OCM 1 OCM 1 64 64 64 Data Path from OCM to OCM Using The Bridge Port and the Streaming Protocol 24

P III & IV: Bandwidth: OCM and OCM (BM#1) 25

P III: Bandwidth: OCM and OCM (BM#1) 26

P IV :Bandwidth: OCM and OCM (BM#1) 27

P IV: Bandwidth: OCM and OCM (BM#1) (Streaming Protocol in Bridge Port) 28

Control FPGA SNAP Shared Memory MICROPROCESSOR MEMORY A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB 64 64 64 PROCESSING ELEMENT (FPGA) 64 64 64 OCM 1 OCM 2 OCM 3 Data Path from OCM to MPM and Shared Memory to MPM 29

P III: Bandwidth: OCM and Shared Memory for a single bank 30

P III: Bandwidth: OCM and Shared Memory 31

P IV: Bandwidth: OCM and Shared Memory 32

P III: Bandwidth: OCM and µP Memory 33

P IV: Bandwidth: OCM and µP Memory 34

P III: Bandwidth: Shared Memory and µP Memory (BM#5) 35

P IV: Bandwidth: Shared Memory and µP Memory 36

P III: Bandwidth: Shared Memory and µP Memory 37

P IV: Bandwidth: Shared Memory and µP Memory 38

Shared Memory A 4 MB B 4 MB C 4 MB D 4 MB E 4 MB F 4 MB PROCESSING ELEMENT (FPGA 1) Register 64 64 Data Path from FPGA Register to Shared Memory 39

P III: Bandwidth: Shared Memory and Register 40

Conclusion & Future Direction

GRAM Summation for Pentium III 42

GRAM Summation for Pentium IV 43

Conclusions • Type of components used has a major role to play in determining the performance of the system as seen in the performance of the Pentium III and the Pentium IV versions of the SRC-6E. • Software environment and state of development plays a role in determining how effectively the program is able to utilize the hardware. This is clear when observing the difference in bandwidth achieved across the Bridge ports using the Carte 1.6.2 release and the Carte 1.7 release. 44

Conclusions … • The GRAM Summation Tables help to serve machine architects in the following ways: • The efficiency column indicates how well a particular communication channel is being utilized within the hardware context. If the efficiency is low, architects may be able to improve performance using a firmware improvement. If efficiency is high and the normalized bandwidth is low then they should consider a hardware upgrade. • By looking at the normalized bandwidths obtained from the GRAM benchmarks, designers can also determine whether the data transfer rates are balanced across the architectural modules. This helps identifying bottlenecks. • Designers can find out which channels have the maximum efficiency and can hence fine tune their application to exploit these channels to achieve the maximum data transfer rate. 45

Conclusions … • In addition, the GRAM Summation tables also provide the following information to application developers: • The tables can tell a designer what bottlenecks to expect and where these bottlenecks lie. • By comparing the figures for Efficiency and the Normalized transfer rates, designers can understand if the bottlenecks being created are by the hardware or the software. • By observing the GRAM summarization tables, designers can actually predict the performance of a pre-designed application on a particular reconfigurable system. 46

Future Direction • Benchmarks can be expanded to include end-to-end performance from asymmetrical and synthetic workloads. • The Benchmarks can also include tables to characterize the performance of reconfigurable computers as it compares to modern parallel architectures. A performance to cost analysis can also be considered. 47

Performance Analysis of GRAM Benchmarks for Reconfigurable Systems

Performance Analysis of GRAM Benchmarks for Reconfigurable Systems

Presentation Transcript

Pereira (2000)

SSD2: Introduction to Computer Systems

Performance Evaluation

Identification of Filamentous Bacteria

LVC Architecture Roadmap (LVCAR) Implementation Project Workshop Gateways Performance Benchmarks

Part 2: Fault-Tolerance Distributed Systems 2010

FPGA

Lecture 2: Performance Measurement

Gram negative rods and cocci

Standard Costs and Operating Performance Measures

Present and Future of Reconfigurable Systems

Computer Concept

Human-Computer Interaction

Gram-negative Bacilli

CS4100: 計算機結構 Computer Abstractions and Technology

CPRE 583 Reconfigurable Computing Lecture 1: Wed 8/26/2009 (Course Overview, VHDL Overview 1)

GRAM POSITIVE BACTERIA

SUMMATIVE EVALUATION

Extreme Performance Engineering: Petascale and Heterogeneous Systems

Programming Model and Protocols for Reconfigurable Distributed Systems