1 / 22

A Programmable Single Chip Digital Signal Processing Engine

MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions. A Programmable Single Chip Digital Signal Processing Engine. Presentation Outline. Space born signal processing tasks FPOA architecture highlights programmability and expandability System partition on FPOA device

aya
Download Presentation

A Programmable Single Chip Digital Signal Processing Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions A Programmable Single Chip Digital Signal Processing Engine

  2. Presentation Outline • Space born signal processing tasks • FPOA architecture highlights • programmability and expandability • System partition on FPOA device • Spatial processing - 5x5 filter solution • Temporal processing – motion estimation • Internal bus and I/O throughput • Resource utilization and future expansion

  3. A System of Digital Signal Processing • spatial edge filter • temporal difference filter • apply equation that defines feature • checking threshold Spatial or Temporal Processing • analyze and characterize signals Feature Extraction Input Data Data Extraction Characterization Frequency or Time domain Processing • mux/de-mux • Average filter • min/max select • time domain low/high/bandpass filter • frequency transformation • frequency domain low/high/bandpass filter

  4. Processing Requirements • High computation requirement on the following basic operations: add/sub and mul/mac, • Mixed control functions such as loop control and decision making • High I/O bandwidth to enable balanced processing vs. data input/output • Large and fast temporary memory space to facilitate real-time processing • Fast programmable and direct data transfer enables massive parallel processing

  5. FPOA Architecture Summary • Heterogeneous Array of 16-bitSilicon Objects • MAC, ALU, Truth Tables, Register File,Internal RAM • Single Clock Cycle Execution for All Objects • Homogeneous 2-Layer Programmable Interconnect Mesh • Tightly Integrated Data and Control Flow • Integrated DDRII RLDRAM & SRAM Controllers • High Speed I/O at Device Boundaries: SerDes, LVDS, HSTL

  6. Reconfigurable Interconnect Network • Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits • Nearest Neighbors • Range = 1 (N/E/S/W + diagonal) • Party Lines • Single cycle range = hop to 3 (skip 2) @ 1GHz • Extra clock cycles for digital retiming • 1 extra  25-object neighborhood • More clock cycles  entire chip

  7. FPOA Solution • Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second • Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth • 400 Silicon Objects running at 1 GHz • ALU: add/sub, and combinational logic • MAC: mul/mac • Register File (RF): fast distributed data storage • Internal RAM (IRAM): intermediate data storage • Party lines and muxes to support flexible internal bus as well as dedicated connections

  8. Example FPOA Partition

  9. 5x5 Convolution Filter • Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4] for i = 2; i < m – 3; i++ for j = 2; j < n – 3; j++ temp = 0; for k = -2; k < 3; k++ for l = -2; l < 3; l++ temp = D[i+k, j+l] * W[k+2, l+2] + temp end_of_l end_of_k Y[i, j] = temp; end_of_j end_of_i

  10. Computation Requirements • Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample • The whole convolution filter operation requires 25 * M * N MAC operations • With a standard 720x480 image data and 30 frames per second, the convolution filter operation requires 259 MMAC per second

  11. Data Storage • 2D data storage in a 1D linear memory where 4 16-bit word can be accessed concurrently • Example of an 8x8 2D matrix stored in a 1D memory

  12. Data Access Analysis • Samples are stored in the external memory with slower access speed • Maximize data bandwidth by accessing 4 words at a time • Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory • Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation

  13. Data Processing Analysis Note 1: with a 5x5 filter the first two rows and columns are skipped Note 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52

  14. FPOA Solution • Temporary data storage • 5 RFs, 3 ALUs • Data access control • 3 ALUs • Multiplier • 4 MACs • Adder Tree • 9 ALUs • Temporary Results • 2 RFs, 1 IRAM, 2 ALUs

  15. 5x5 Convolution Filter Performance • FPOA Resources • ALU: 17 • RF: 7 • MAC: 4 • IRAM: 1 • Total: 28 SOs + 1 IRAM • Data throughput • 20 results every 125 cycles

  16. Motion Estimation • Identify the movement of a similar pattern over time • The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7] sum = 0; for i = 0 to 7 for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp) end_of_j end_of_i

  17. SAD Computation Dataflow • 3 cycles throughput • Generates two partial sums of positive differences

  18. SAD Performance • FPOA Resources • ALU: 35 • RF: 1 • Total: 36 SOs • Data throughput • 24 cycles per 8x8 block

  19. Internal System Bus • Link all processing modules and the external host to the external memory for data accesses to the external system memory • Host controlled round-robin access from module to module • User defined package format to utilize the 16-bit party line and minimize the access overhead

  20. System Bus Implementation

  21. System Bus Performance • FPOA Resources • ALU: 20 • Cycles • XRAM read: 4 cycles • XRAM write: 4 cycles • Module switch: 10 cycles

  22. Performance of an Example Space Satellite Application • Processing Throughput • About 10 Million Samples per second • FPOA Resources (% of a device with 400 SOs and running at 400 MHz) • Cycle utilization: 21% • SO utilization: 51% • IRAM utilization: 25% • XRAM b/w: 49% (100 MHz DDR RLDRAM)

More Related