1 / 26

Design and Testing of GPU Server based RTC for TMT NFIRAOS

Design and Testing of GPU Server based RTC for TMT NFIRAOS. Lianqi Wang. AO4ELT3 Florence, Italy 5/31/2013. Thirty Meter Telescope Project. Introduction. Part of the on-going RTC architecture trade study for NFIRAOS Handle MCAO and NGS SCAO Mode Iterative algorithms

lily
Download Presentation

Design and Testing of GPU Server based RTC for TMT NFIRAOS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Testing of GPU Server based RTC for TMT NFIRAOS Lianqi Wang AO4ELT3 Florence, Italy 5/31/2013 Thirty Meter Telescope Project

  2. Introduction • Part of the on-going RTC architecture trade study for NFIRAOS • Handle MCAO and NGS SCAO Mode • Iterative algorithms • Hardware: FPGA, GPU, etc. • Matrix Vector Multiply (MVM) algorithm • Made possible by using GPUs to compute the control matrix • Hardware: GPU, Xeon Phi, or just CPUS • In this talk • GPU server based RTC design • Benchmarking the 10 GbEinterface and the whole real time chain • Updating the control matrix as seeing changes

  3. Most Challenging Requirements • LGS WFS Reconstruction: • Processing pixels from 6 order 60x60 LGS WFS • 2896 subapertures per WFS. • Average 70 pixels per sub-aps.(0.4MB per WFS, 2.3MB total) • CCD read out: 0.5 ms. • Compute DM commands from gradients • 6981 active (7673 total) DM actuators from 35k grads • Using Iterative algorithms or matrix vector multiply (MVM). • At 800 Hz. All has to finish in ~1.25ms • This talk focuses on MVM, in GPUs.

  4. Number of GPUs for Applying MVM 925 MB control matrix. Assuming 1.00 ms total time  2 GPU per WFS Dual GPU board 1: http://www.intel.com/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf

  5. A Possible GPU RTC Architecture

  6. LGS WFS to RTC Interface • 10GbE surpasses the requirement of 800 MB/s • Widely deployed • Low cost • Alternatives: • sFPDP, Camera Link, etc • Testing with our Emulex 10 GbE PCI-E Board • Plain TCP/IP • No special feature

  7. A Sends 0.4 MB of Pixels to B100,000 Frames Latency: median: 0.465ms  880MB/s Jitter: RMS: 13 ms, p/v: 115ms CPU load: 100% of one core

  8. B Sends 0.4MB of Pixels to A100,000 Frames Latency: median: 0.433ms  950MB/s Jitter: RMS: 6 ms, p/v: 90ms

  9. Server with GPUs 1U or 2U Chassis Memory x8 10GbE Adapter WFS DC PCIe x16 CPU 1 WFS DC K10GPU Module x16 K10 GPU Module Central Server CPU 2 PCIe x8 Built-in10GbE x16 RPG x16 Memory (not accessible in 1U form) • Each K10 GPU module has two GPUs

  10. Benchmarking the End to End Latency to Qualify this Architecture • Simulation Scheme: Server A Server B TCP over 10GbE Pixels One LGS WFS Camera 1/6 of RTC DM Commands Central Server

  11. Benchmarking Hardware Configuration • Hardware • Server A: Dual Xeon W5590 @3.33Ghz • Server B: Intel Core i7 3820 @ 3.60 GHz • Two NVIDIA GTX 580 GPU board • or One NVIDIA GTX 590 dual GPU board • Emulex OCe11102-NX PCIe 2.0 10Gbase-CR • Connected back to back

  12. Benchmarking Software Configuration • Servers optimization for low latency • Runs real time preempt Linux kernel (rt-linux-3.6) • CPU affinity set to core 0. • Lock all memory pages • Scheduler: SCHED_FIFO at maximum priority • Priority: -18 • Disable hyper-threading • Disable virtualization technology and Vt-d in bios (critical) • Disable all non-essential services • No other tasks running • CUDA 4.0 C runtime library • Vanilla TCP/IP

  13. Pipelining Scheme WFS Read out (0.5ms) • Benchmarked for an LGS WFS • MVM takes most of the time. • Memory copying is indeed concurrent with computing … Not to scale … Stream 1 … Stream 2 … Stream 3 Stream synchronized with events Frame start 1.25ms Next Frame

  14. End to End Timing for 100 AO Seconds 2x GTX 580 Latency Mean: 0.85ms Jitter RMS: 12ms. Jitter P/V: 114ms.

  15. End to End Timing for 9 AO hours2x GTX 580 Latency Mean: 0.85ms Jitter RMS: 13ms. Jitter P/V: 144ms.

  16. End to End Timing for 100 AO SecondsGTX 590 (dual GPU board) Latency Mean: 0.96ms Jitter RMS: 9ms. Jitter P/V: 111ms.

  17. Summary • For one LGS WFS • With 1 Nvidia GTX 590 board and 10 GbE interface • <1 ms from end of exposure to DM command ready • Pixel read out and transport (0.5ms) • Gradient computation • MVM computation • RTC is made up of 6 such subsystems • And one more for soft real-time or background task. • Plus one more for (online) spare.

  18. Compute the Control Matrix • by solving the minimum variance reconstructor in GPUs. (34752x6981)=(34752x62311) x(62311x62311)-1x(62311x6981) x(6981x6981)-1 • Cn2 profile is used for regularization • It needs to be updated frequently • Control matrix need to be updated, using warm restart FDPCG  how many iterations? • Benchmarking results of iterative algorithms in GPUs

  19. Benchmarking Results of Iterative Algorithms for Tomography CG: Conjugate Gradients FD: Fourier Domain Preconditioned CG. OSn: Over sampling n tomography layers (¼ m spacing)

  20. How Often to Update the Reconstructor? • Tony Travouillon (TMT) computed covariance function of Cn2 profile from 5 years of site testing data • One function per layer. Independent between layers • Measurement has minimum separation of 1 minute •  Interpolate the covariance to finer time scale • Generate Cn2 profile time series at 10 second separation • Pick a case every 500 seconds (60 total for 8 hour duration) • Run MAOS simulations (2000 time steps) with • Initializing control matrix with “true” Cn2 profile • FD100 Cold (start) • Update control matrix using results from 10 seconds earlier • FD10 or FD20 Warm (restart)

  21. Distribution of Seeing of the 60 Cases r0 at 10 seconds earlier Case

  22. Performance of MVM FD100 Cold vs CG30 (baseline iterative algorithm) Only 1 point: MVM worse than CG30

  23. MVM Control Matrix Update(FD10/20 Warm vs FD100 Cold) 8 GPU can do FD10 in 10 seconds.

  24. Conclusions • We demonstrated that end to end hard real time process of the RTC can be done using 10 GbE and GPUs • Pixel transfer over 10GbE • Gradient computation and MVM in 1 GPU board per WFS • Total latency is 0.95 ms. P/V jitter is ~0.1 ms. • Control matrix update • Largely sufficient with FD10 warm restart for 10 second stale Cn2 profile.

  25. Acknowledgements • The author gratefully acknowledges the support of the TMT collaborating institutions. They are • the Association of Canadian Universities for Research in Astronomy (ACURA), • the California Institute of Technology, • the University of California, • the National Astronomical Observatory of Japan, • the National Astronomical Observatories of China and their consortium partners, • and the Department of Science and Technology of India and their supported institutes. • This work was supported as well by • the Gordon and Betty Moore Foundation, • the Canada Foundation for Innovation, • the Ontario Ministry of Research and Innovation, • the National Research Council of Canada, • the Natural Sciences and Engineering Research Council of Canada, • the British Columbia Knowledge Development Fund, • the Association of Universities for Research in Astronomy (AURA) • and the U.S. National Science Foundation.

  26. Other tasks • Copying updated MVM matrix to RTC • Do so after DM actuator commands are ready • Measured 0.1 ms for 10 columns • 579 time steps to copy 5790 columns. • Collect statistics to update matched filter coefficients • Do so after DM actuator commands are ready or in CPUs • Complete before next frame arrives.

More Related