status of gpu trigger
Download
Skip this Video
Download Presentation
“ Status of GPU trigger ”

Loading in 2 Seconds...

play fullscreen
1 / 17

“ Status of GPU trigger ” - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

“ Status of GPU trigger ”. Gianluca Lamanna (INFN) On behalf of GAP collaboration. TDAQ, Liverpool, 28.8.2013. Two problems to use GPU in the trigger. Computing power : Is the GPU fast enough to take trigger decision at tens of MHz events rate ? Elena -> Use with the RICH

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' “ Status of GPU trigger ”' - berny


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
status of gpu trigger

“Status of GPU trigger”

Gianluca Lamanna (INFN)

On behalf of GAP collaboration

TDAQ, Liverpool, 28.8.2013

two problems to use gpu in the trigger
Twoproblems to use GPU in the trigger
  • Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate?
    • Elena -> Use with the RICH
    • Jacopo -> Use with the Straw
  • Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system in HEP? Is the latency stable enough for usage in synchronous trigger systems?(Felice, Roberto, me + Alessandro, Piero, Andrea, etc.)
gpu processing
GPU processing

VRAM

  • Example: packet with 1404 B (20 events in NA62 RICH application)
  • T=0

NIC

GPU

PCI express

chipset

CPU

RAM

us

0

gpu processing1
GPU processing

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

us

0

10

gpu processing2
GPU processing

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

us

0

10

99

gpu processing3
GPU processing

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

104

us

0

10

99

gpu processing4
GPU processing

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

104

134

us

0

10

99

gpu processing5
GPU processing

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

104

139

134

us

0

10

99

gpu processing6
GPU processing
  • The latency due to the data transfer from data source to the system is more important than the latency due to the computing on the GPU
  • It scales almost linearly (apart from the overheads) with the data size while the latency due to the computing can be hidden exploiting the huge resources.
  • Communication latency fluctuations quite big (~50%).

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

104

139

134

us

0

10

99

two approaches pf ring driver
Twoapproaches: PF_RING driver

VRAM

NIC

GPU

PCI express

chipset

CPU

RAM

Fast packet capturing from standard NIC (PF_RING from ntop)

The data are written directly on the user space memory.

Skip redundant copy in the kernel memory space.

Both for 1 Gb/s and 10 Gb/s

Latency fluctuations could be reduced using RTOS( Mauro P.).

Host/kernel parallelized with three concurrent streaming.

tests
Tests

1 Gb/s

NIC

PC

TEL62

GPU

lpt

Scope

Start

Stop 1

Stop 2

  • Dual processor PC:
    • XEON E5-2620 2Ghz
    • I350T2 Gigabit card
    • 32 GB
    • GPU K20c (2496 cores) PCIe v2 x16

Events simulated in TEL62

Grouped in MTP

Start signal rises with the first event in the MTP

First stop: arriving of the packet

Buffering in the PC RAM (buffer depth can be changed (GMTP))

Second stop: after execution on GPU (single ring reconstruction kernel)

The precision of the method as been evaluated as better than 1us

data transfer time
Data transfer time

~80B/event

Using PF_RING the latency (and the fluctuations) due to packet handling in the NIC are highly reduced.

gpu computing time
GPU computing time

It’s better to accumulate a big number of events (GMTP) in order to exploit the computing cores available in the GPU.

total latency
Total latency

256 GMTP

256 GMTP

NA62 latency

Total latency: Start on last GMTP event

  • Latency timeout not implemented yet.

Total latency: Start on first GMTP event

two approaches nanet
Twoapproaches: NANET

NANET

  • NANET based on the Apenet+ card (collaboration with the Apenet group of Rome INFN).
  • Additional UDP protocol offload
  • First not-NVIDIA device having P2P connection with a GPU.
    • Joint development with NVIDIA.
  • Preliminary version implemented on Altera DEV4 dev board with PCIX8 Gen2 (Gen3 under study).
  • Modular structure: the link can be replaced (1 Gb/s, 10 Gb/s, SLINK, GBT,…)
nanet data trasfer performances
NANET data trasfer performances

Test with system loopback (data produced on the same PC and sent through standard NIC to NANET).

The 50usplateau in the latency is mainly due to the NIC used for data transmission.

Full bandwidth in 1 Gb/s.

10Gb/s version will be ready soon.

best performance with apenet link
Best performance with apenet link

Faster than Infiniband for GPU-GPU transmission.

Latency at level of 8us with Apenet Link.

Link implemented on high-speed daughter card: in principle several types of links can be implemented (GBT, PCI express (Gianmaria), Infiniband,…)

ad