Correlator Options for 128T

Correlator Options for 128T
MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6

Current Status Correlator Hardware Inventory 10 each of v.2 Correlator Boards, PFB Boards, CB/RTM’s, PFB/RTM’s 2 full-size card cages + 1 small, with power supplies e2e simulation software file input packets module file output packets PFB FPGA Firmware for 32T very limited de-skew capability no inter-board transfer (via mesh backplane) corner-turns specific to 32T case PFB to 10 KHz channels needs no changes

Current Status (cont’d) CB FPGA Firmware 32T: operational code uses every other 50 ms interval, though 100% duty cycle code is available 512T: error-free CMAC (only) code for 115 cells working at 180 MHz

128T Correlator Requirements 30.72 MHz BW in 24 coarse channels of 1.28 MHz 256 inputs 16 Rx’s with 48 fibres 82.6 Gb/s aggregate bit rate ~32K correlation products F stage: ~150 GCMAC/s (12 tap FIR, 40 KHz channels) X stage: 1.01 TCMAC/s KEY (compared to 32T) same x4x16

Top Level Choices hardware: use current hardware, developing FPGA firmware as necessary software: get RX signals into standardized format (10 gigE) ASAP, do PFB and correlation in GPU-equipped server hybrid: use existing PFB’s for F stage and to form 10 gigE packets to be correlated in software

Hardware Solution Using existing 32T firmware it should take 4 PFB boards and 16 CB’s, but architecture doesn’t scale in a fully-parallel sense due to cross-correlations, and it would really take 6 PFB’s and 18 CB’s, with firmware mods unchanged 32T firmware leads to a system with 20 PFB’s and 20 CB’s! using tested CMAC design (115 cells @ 180 MHz) yields enough computation in ~6.5 CB’s, optimal partition appears to be 8 PFB’s and 8 CB’s

18 CB System split system into thirds, each getting 8 coarse chans each PFB gets 8 input fibres (need to do deskew) routing logic on CB’s changes, CMAC’s same

18 CB Hardware Assessment PRO relatively minor FPGA design work on PFB modest amount of change to FPGA code on CB’s system interfaces all tested and working use is made of all purpose-built boards CON another build of ~10 CB’s (and CB/RTM’s) necessary (~120 K$)

8 CB System Each PFB gets 6 input fibres total, from 2 Rx’s Each PFB outputs to 8 different CB’s CB uses CMAC design from 512T at only 80% of achieved speed CB needs some cleverness in allocating cells to CMAC chips LTA could be skipped due to low output rate (10 Hz dump rate)

8 CB Hardware Assessment PRO no additional cost for hardware relatively minor FPGA design work on PFB system interfaces all tested and working use is made of all purpose-built boards CON significant amount of modified FPGA code on CB

Software Solution Put Rx coarse channel data into 10 gigE packets, by (e.g.) modifying AgFo design OTS programmable modules (a la 2PIP) F stage in host servers or GPU’s Do X stage in multiple GPU’s

GPU Correlation Waythet al. (2009) correlated 1 coarse channel for 32 T in realtime, using a single Nvidia C1060 GPU How can we gain a factor of 24 x 16 = 384 in performance? 4x duty cycle – Wayth’s code did 1 s of processing in 0.19 s 2x memory BW reduction – by using a channel width of 40 KHz a larger block can be fit into shared memory 2x – by using a smaller word size (4 Re + 4 Im bits) Tesla C2050 has triple the shared memory of C1060 integer arithmetic uses less shared memory space multiple GPU units in parallel

GPU Bottlenecks NIC input rate max of 7 or 8 Gb/s to Host Host  Device BW (set by PCIe bus) PCI gen 2 x16 spec max of 8 GB/s Global memory processor BW spec max for C2050 is 144 GB/s Multiply & accumulate rate spec max for C2050 is 1.01 Tflops (single prec or 32 bit int)

Software Assessment PRO greatest flexibility, as all code is in software switched topology allows good match between # of servers and load easily expandable CON format conversion to 10 gigE will require some mixture of hardware acquisition and FPGA coding acquisition cost of GPU-equipped servers

Hybrid System modified PFB output stage in INF chip forms 10 gigE packets 4 lanes through CX-4 connector to unidirectional optical transceiver GPU-equipped servers only do 4+4 bit cross mult & sum 8 PFB’s used 6 inputs each 1 stream of 8 Gb/s per PFB output more real-estate

Hybrid Assessment PRO little additional cost to convert data to 10 gigE minimal FPGA design work relieves GPU of filtering burden switched topology allows good match between # of servers and load easily expandable CON some risk in unidirectional 10 gigE transceiver mods acquisition cost of GPU-equipped servers

Level of Effort - none/modest/significant

Correlator Options for 128T

Correlator Options for 128T

Presentation Transcript

Correlator Options for 128T

ASKAP Correlator

Digital FX Correlator

Digital FX correlator

Digital FX Correlator

Digital FX Correlator

WASHINGTON CORRELATOR UPDATE

Focus on 128T

Correlator Test Plan

EVLA Correlator CoDR

Correlator Growth Path

NEW USE for An old correlatoR

Prototype Correlator Testing

An FX software correlator for VLBI

The EVLA Correlator

Correlator Issues

Correlator Growth Path

Correlator

VLBA Correlator Interfaces for VSOP-2