1 / 0

Correlator Options for 128T

Correlator Options for 128T. MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6. Current Status. Correlator Hardware Inventory 10 each of v.2 Correlator Boards, PFB Boards, CB/ RTM’s , PFB/ RTM’s 2 full-size card cages + 1 small, with power supplies

yepa
Download Presentation

Correlator Options for 128T

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlator Options for 128T

    MWA Cambridge Meeting Roger Cappallo MIT Haystack Observatory 2011.6.6
  2. Current Status Correlator Hardware Inventory 10 each of v.2 Correlator Boards, PFB Boards, CB/RTM’s, PFB/RTM’s 2 full-size card cages + 1 small, with power supplies e2e simulation software file input packets module file output packets PFB FPGA Firmware for 32T very limited de-skew capability no inter-board transfer (via mesh backplane) corner-turns specific to 32T case PFB to 10 KHz channels needs no changes
  3. Current Status (cont’d) CB FPGA Firmware 32T: operational code uses every other 50 ms interval, though 100% duty cycle code is available 512T: error-free CMAC (only) code for 115 cells working at 180 MHz
  4. 128T Correlator Requirements 30.72 MHz BW in 24 coarse channels of 1.28 MHz 256 inputs 16 Rx’s with 48 fibres 82.6 Gb/s aggregate bit rate ~32K correlation products F stage: ~150 GCMAC/s (12 tap FIR, 40 KHz channels) X stage: 1.01 TCMAC/s KEY (compared to 32T) same x4x16
  5. Top Level Choices hardware: use current hardware, developing FPGA firmware as necessary software: get RX signals into standardized format (10 gigE) ASAP, do PFB and correlation in GPU-equipped server hybrid: use existing PFB’s for F stage and to form 10 gigE packets to be correlated in software
  6. Hardware Solution Using existing 32T firmware it should take 4 PFB boards and 16 CB’s, but architecture doesn’t scale in a fully-parallel sense due to cross-correlations, and it would really take 6 PFB’s and 18 CB’s, with firmware mods unchanged 32T firmware leads to a system with 20 PFB’s and 20 CB’s! using tested CMAC design (115 cells @ 180 MHz) yields enough computation in ~6.5 CB’s, optimal partition appears to be 8 PFB’s and 8 CB’s
  7. 18 CB System split system into thirds, each getting 8 coarse chans each PFB gets 8 input fibres (need to do deskew) routing logic on CB’s changes, CMAC’s same
  8. 18 CB Hardware Assessment PRO relatively minor FPGA design work on PFB modest amount of change to FPGA code on CB’s system interfaces all tested and working use is made of all purpose-built boards CON another build of ~10 CB’s (and CB/RTM’s) necessary (~120 K$)
  9. 8 CB System Each PFB gets 6 input fibres total, from 2 Rx’s Each PFB outputs to 8 different CB’s CB uses CMAC design from 512T at only 80% of achieved speed CB needs some cleverness in allocating cells to CMAC chips LTA could be skipped due to low output rate (10 Hz dump rate)
  10. 8 CB Hardware Assessment PRO no additional cost for hardware relatively minor FPGA design work on PFB system interfaces all tested and working use is made of all purpose-built boards CON significant amount of modified FPGA code on CB
  11. Software Solution Put Rx coarse channel data into 10 gigE packets, by (e.g.) modifying AgFo design OTS programmable modules (a la 2PIP) F stage in host servers or GPU’s Do X stage in multiple GPU’s
  12. GPU Correlation Waythet al. (2009) correlated 1 coarse channel for 32 T in realtime, using a single Nvidia C1060 GPU How can we gain a factor of 24 x 16 = 384 in performance? 4x duty cycle – Wayth’s code did 1 s of processing in 0.19 s 2x memory BW reduction – by using a channel width of 40 KHz a larger block can be fit into shared memory 2x – by using a smaller word size (4 Re + 4 Im bits) Tesla C2050 has triple the shared memory of C1060 integer arithmetic uses less shared memory space multiple GPU units in parallel
  13. GPU Bottlenecks NIC input rate max of 7 or 8 Gb/s to Host Host  Device BW (set by PCIe bus) PCI gen 2 x16 spec max of 8 GB/s Global memory processor BW spec max for C2050 is 144 GB/s Multiply & accumulate rate spec max for C2050 is 1.01 Tflops (single prec or 32 bit int)
  14. Software Assessment PRO greatest flexibility, as all code is in software switched topology allows good match between # of servers and load easily expandable CON format conversion to 10 gigE will require some mixture of hardware acquisition and FPGA coding acquisition cost of GPU-equipped servers
  15. Hybrid System modified PFB output stage in INF chip forms 10 gigE packets 4 lanes through CX-4 connector to unidirectional optical transceiver GPU-equipped servers only do 4+4 bit cross mult & sum 8 PFB’s used 6 inputs each 1 stream of 8 Gb/s per PFB output more real-estate
  16. Hybrid Assessment PRO little additional cost to convert data to 10 gigE minimal FPGA design work relieves GPU of filtering burden switched topology allows good match between # of servers and load easily expandable CON some risk in unidirectional 10 gigE transceiver mods acquisition cost of GPU-equipped servers
  17. Level of Effort - none/modest/significant
More Related