A Programmable Single Chip Digital Signal Processing Engine

MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions A Programmable Single Chip Digital Signal Processing Engine

Presentation Outline • Space born signal processing tasks • FPOA architecture highlights • programmability and expandability • System partition on FPOA device • Spatial processing - 5x5 filter solution • Temporal processing – motion estimation • Internal bus and I/O throughput • Resource utilization and future expansion

A System of Digital Signal Processing • spatial edge filter • temporal difference filter • apply equation that defines feature • checking threshold Spatial or Temporal Processing • analyze and characterize signals Feature Extraction Input Data Data Extraction Characterization Frequency or Time domain Processing • mux/de-mux • Average filter • min/max select • time domain low/high/bandpass filter • frequency transformation • frequency domain low/high/bandpass filter

Processing Requirements • High computation requirement on the following basic operations: add/sub and mul/mac, • Mixed control functions such as loop control and decision making • High I/O bandwidth to enable balanced processing vs. data input/output • Large and fast temporary memory space to facilitate real-time processing • Fast programmable and direct data transfer enables massive parallel processing

FPOA Architecture Summary • Heterogeneous Array of 16-bitSilicon Objects • MAC, ALU, Truth Tables, Register File,Internal RAM • Single Clock Cycle Execution for All Objects • Homogeneous 2-Layer Programmable Interconnect Mesh • Tightly Integrated Data and Control Flow • Integrated DDRII RLDRAM & SRAM Controllers • High Speed I/O at Device Boundaries: SerDes, LVDS, HSTL

Reconfigurable Interconnect Network • Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits • Nearest Neighbors • Range = 1 (N/E/S/W + diagonal) • Party Lines • Single cycle range = hop to 3 (skip 2) @ 1GHz • Extra clock cycles for digital retiming • 1 extra  25-object neighborhood • More clock cycles  entire chip

FPOA Solution • Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second • Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth • 400 Silicon Objects running at 1 GHz • ALU: add/sub, and combinational logic • MAC: mul/mac • Register File (RF): fast distributed data storage • Internal RAM (IRAM): intermediate data storage • Party lines and muxes to support flexible internal bus as well as dedicated connections

Example FPOA Partition

5x5 Convolution Filter • Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4] for i = 2; i < m – 3; i++ for j = 2; j < n – 3; j++ temp = 0; for k = -2; k < 3; k++ for l = -2; l < 3; l++ temp = D[i+k, j+l] * W[k+2, l+2] + temp end_of_l end_of_k Y[i, j] = temp; end_of_j end_of_i

Computation Requirements • Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample • The whole convolution filter operation requires 25 * M * N MAC operations • With a standard 720x480 image data and 30 frames per second, the convolution filter operation requires 259 MMAC per second

Data Storage • 2D data storage in a 1D linear memory where 4 16-bit word can be accessed concurrently • Example of an 8x8 2D matrix stored in a 1D memory

Data Access Analysis • Samples are stored in the external memory with slower access speed • Maximize data bandwidth by accessing 4 words at a time • Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory • Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation

Data Processing Analysis Note 1: with a 5x5 filter the first two rows and columns are skipped Note 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52

FPOA Solution • Temporary data storage • 5 RFs, 3 ALUs • Data access control • 3 ALUs • Multiplier • 4 MACs • Adder Tree • 9 ALUs • Temporary Results • 2 RFs, 1 IRAM, 2 ALUs

5x5 Convolution Filter Performance • FPOA Resources • ALU: 17 • RF: 7 • MAC: 4 • IRAM: 1 • Total: 28 SOs + 1 IRAM • Data throughput • 20 results every 125 cycles

Motion Estimation • Identify the movement of a similar pattern over time • The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7] sum = 0; for i = 0 to 7 for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp) end_of_j end_of_i

SAD Computation Dataflow • 3 cycles throughput • Generates two partial sums of positive differences

SAD Performance • FPOA Resources • ALU: 35 • RF: 1 • Total: 36 SOs • Data throughput • 24 cycles per 8x8 block

Internal System Bus • Link all processing modules and the external host to the external memory for data accesses to the external system memory • Host controlled round-robin access from module to module • User defined package format to utilize the 16-bit party line and minimize the access overhead

System Bus Implementation

System Bus Performance • FPOA Resources • ALU: 20 • Cycles • XRAM read: 4 cycles • XRAM write: 4 cycles • Module switch: 10 cycles

Performance of an Example Space Satellite Application • Processing Throughput • About 10 Million Samples per second • FPOA Resources (% of a device with 400 SOs and running at 400 MHz) • Cycle utilization: 21% • SO utilization: 51% • IRAM utilization: 25% • XRAM b/w: 49% (100 MHz DDR RLDRAM)

A Programmable Single Chip Digital Signal Processing Engine

A Programmable Single Chip Digital Signal Processing Engine

Presentation Transcript

DIGITAL SIGNAL PROCESSING

Digital Signal Processing

Digital Signal Processing

DIGITAL SIGNAL PROCESSING

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing and Field Programmable Gate Arrays

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

DIGITAL SIGNAL PROCESSING

Digital signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital signal processing