1 / 15

L1 Event Reconstruction in the STS

L1 Event Reconstruction in the STS. I. Kisel GSI / KIP. CBM Collaboration Meeting Dubna, October 16, 2008. Many-core HPC. Gaming STI: Cell. ?. ?. GP GPU Nvidia: Tesla. ?. GP CPU Intel: Larrabee. CPU/GPU AMD: Fusion. High performance computing (HPC)

karli
Download Presentation

L1 Event Reconstruction in the STS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. L1 Event Reconstructionin the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008

  2. Many-core HPC Gaming STI: Cell ? ? GP GPU Nvidia: Tesla ? GP CPU Intel: Larrabee CPU/GPU AMD: Fusion • High performance computing (HPC) • Highest clock rate is reached • Performance/power optimization • Heterogeneous systems of many (>8) cores • Similar programming languages (Ct and CUDA) • We need a uniform approach to all CPU/GPU families • On-line event selection • Mathematical and computational optimization • SIMDization of the algorithm (from scalars to vectors) • MIMDization (multi-threads, multi-cores) • Optimize the STS geometry (strips, sector navigation) • Smooth magnetic field Ivan Kisel, GSI

  3. NVIDIA GeForce GTX 280 • NVIDIA GT200GeForce GTX 280 1024MB. • 933 GFlops single precision (240 FPUs). • finally double precision support, but only ~ 90 GFlops (8 core Xeon ~80 GFlops). • Currently under investigation: • Tracking • Linpack • Image Processing CUDA (Compute Unified Device Architecture) Sebastian Kalcher Ivan Kisel, GSI

  4. Intel Larrabee: 32 Cores • Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: • use the x86 instruction set with Larrabee-specific extensions; • feature cache coherency across all its cores; • include very little specialized graphics hardware. • The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's x86 cores will be based on the much simpler Pentium design; • each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; • LRB includes one fixed-function graphics hardware unit; • LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; • LRB includes explicit cache control instructions; • each core supports 4-way simultaneous multithreading, with 4 copies of each processor register. L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. Ivan Kisel, GSI

  5. Intel Ct Language Extend C++ for Throughput-Oriented Computing • Ct adds new data types (parallel vectors) & operators to C++ • Library-like interface and is fully ANSI/ISO-compliant • Ct abstracts away architectural details • Vector ISA width / Core count / Memory model / Cache sizes • Ct forward-scales software written today • Ct platform-level API, Virtual Intel Platform (VIP), is designed to be dynamically retargetable to SSE, SSEx, LRB, etc • Ct is fully deterministic • No data races • Nested data parallelism and deterministic task parallelism differentiate Ct on parallelizing irregular data and algorithm 1 2 3 Reduction (a global sum) Element-wise multiply Vector operations subsumes loop 1 3 2 The basic type in Ct is a TVEC Dot Product Using C Loops for (i = 0; i < n; i++) { dst += src1[i] * src2[i]; } Dot Product Using Ct TVEC<F64> Dst, Src1(src1, n), Src2(src2, n); Dst = addReduce(Src1*Src2); Ct: Throughput Programming in C++. Tutorial. Intel. Ivan Kisel, GSI

  6. Ct vs. CUDA Matthias Bach Ivan Kisel, GSI

  7. Multi/Many-Core Investigations • CA: Game of Life • L1/HLT CA Track Finder • SIMD KF Track Fitter • LINPACK • MIMDization (multi-threads, multi-cores) GSI, KIP, CERN, Intel Ivan Kisel, GSI

  8. SIMD KF Track Fit on Multicore Systems: Scalability Real fit time/track (ms) #threads Using Intel Threading Building Blocks – linear scaling on multiple cores Håvard Bjerke Ivan Kisel, GSI

  9. Parallelization of the L1 CA Track Finder Create tracklets Collect tracks 1 2 GSI, KIP, CERN, Intel, ITEP, Uni-Kiev Ivan Kisel, GSI

  10. L1 Standalone Package for Event Selection Igor Kulakov Ivan Kisel, GSI

  11. KFParticle: Primary Vertex Finder The algorithm is implemented and passed first tests. Ruben Moor Ivan Kisel, GSI

  12. L1 Standalone Package for Event Selection Efficiency of D+ selection: 48.9% Igor Kulakov, Iouri Vassiliev Ivan Kisel, GSI

  13. Magnetic Field: Smooth in the Acceptance • Approximate with a polynomial in the plane of each station • Approximate with a parabolic function between each 3 stations We need a smooth magnetic field in the acceptance Ivan Kisel, GSI

  14. CA on the STS Geometry with Overlapping Sensors UrQMD MC central Au+Au 25AGeV Efficiency and fraction of killed tracks ok up to ∆Z = Zhit - Zstation < ~0.2cm Irina Rostovtseva Ivan Kisel, GSI

  15. Summary and Plans • Learn Ct (Intel) and CUDA (Nvidia) programming languages • Develop the L1 standalone package for event selection • Parallelize the CA track finder • Investigate large multi-core systems (CPU and GPU) • Parallel hardware -> parallel languages -> parallel algorithms Ivan Kisel, GSI

More Related