New trends and paradigms in HENP: where do we go from here to “there”?

New trends and paradigms in HENP: where do we go from here to “there”? JérômeLauretRHIC/STAR Software & Computing Leader Brookhaven National Laboratory, NY, USA

Outline • HENP successes • New paradigms: a look ahead • Cloud computing • From Grid to Cloud • Usage & the future • Multi-core & Many-core • Evolution • How we did get there • Can we use it? • Examples of multi-core problematic • More trends … • Concluding remarks

HENP success Building a global infrastructure – Grids Sustain unprecedented data productions Build services to significant scales

Grids – LCG, … 1 M jobs/day

Grids – OSG, …

Key to success • Grid middleware, Grid schedulers, … • Ethernet [1 and 10 Gbits; LAN and WAN; ip v4] • MSS - High-density tapes and efficient access to low latency (high capacity) storage • Data distribution access layers: Castor, dCache, xrootd, … pNFS • ROOT files (physics data) • Relational databases (metadata, all kind of SQL) • AFS, NAS, RAID • Batch systems • C++ frameworks – code re-usability, …base-class, steering class • gcc / Linux • With lots of data … data deluge …

STAR HPSSdata rawHigh Performance Storage System 4.7 M files, 2.4 PBytes

STAR HPSSproduction data ~ 2.2 PBytes of “derived” data18 M files, 15 M files indexed STAR resources ~ 6500 kSI200 @ BNL + remote sites

The Tianhe-1A, based at the National University of Defense Technology (NUDT) in China, can perform over 2.5 thousand trillion floating point operations per second (PetaFlops).

Microsoft partners with China firm on cloud-computingIBM helps clients excel in Cloud Computing, offering reliable and secure SaaS, IaaS, and PaaS solutions.

The world is changing – new trends appearedPetaFlops, Exascale, BlueGene, Cloud, multi core, many cores, ARM versus Intel, Xeon vs Atom vs … IPV6, … Changes are fast coming (some fast going too) Can we integrate the new realities in current frameworks?

New paradigms, a look aheadCloud & parallelism

Cloud computing From Grids to Clouds

Cloud computing primer • No kitchen sink in the Cloud … yet • Base idea and building blocks Infrastructure in layers • Infrastructure-as-a-Service (IaaS) • Provide access & control to infrastructure • Ex: Amazon Web service - Through an interface, ask for instances of an infrastructure (VM) and access it; raise, shutdown instances • Platform-as-a-Service (PaaS) • Collection of software and tools maintained by your cloud provider. • Ex: Microsoft Azure - develops apps with Azure platform, move apps to the Cloud (get rid of local services, connect from anywhere) • Software-as-a-Service (SaaS) • Both hardware and software provided on the Cloud • Ex: Apple iCloud - Apps provided already and interact with Cloud (store and retrieve content, DB, iTunes, … services)

Use in HENP – why and what is attractive? • Are Grids usable and why a move? • Pro • Grid are usable – outstanding use of resources. Efficiency > 97% (second try) • Grid operation support are world wide • Cons • Grids are heterogeneous – you CANNOT guarantee on which platform your will run. Infrastructure is controlled/maintained by provider. You need to “discover” resources • Complexity and dynamic • Troubleshooting is very inadequate (globus error # anyone?), exacerbated by heterogeneity (OS, batch, any components, …) • Cloud IaaS has virtualization at its heart • Cons • No operational support, no helpdesk • Commercial mostly (for now) • Pro • Clouds are usable – simplest form efficiency > 97% • You can (ex.) “provision” resources by packing ALL of your software, services, OS, environment dependencies (startup and setup scripts) into one VM • You can test this VM at home and make sure your software stack works AND produce the same result than on your home infrastructure / cluster (validation / QA) • You build the VM once, deploy as many times as needed • … VMS can be preserved – obsolescence and experiment AOL (long term preservation) near ensured

Motivation for Cloud – a software stack problem analysis • Complex experimental application codes • STAR case: Developed over more than 10 years, by more than 100 scientists, comprises ~2.5 M lines of C++ and FORtran code • Require complex, customized environments • Rely on the right combination of compiler versions and available libraries • Dynamically load external libraries depending on the task to be performed (system or third parties: ROOT, mysql, libxml, …) Little to NO opportunistic use of Grids Cloud – pack once into a VM, use many times

Amazon EC2 (native) General recipe • Prepare VM (4-6 GB with OS and STAR software altogether) • ~ 2 hours preparation to be done once • Contextualize (EC2 specifics) • Ship it to EC2 (slow? 20 mnts, also a one time job) • Login to EC2 and check the (new) STAR VM image exist (this creates an AMI or Amazon Machine Image) • Select STAR VM image you want to launch • Select type of machine & # of copies • Must select SSH keys if you want to use to communicate with this VM • Must select firewall (part of your EC2 setup) • The exciting part press “launch” … do your physics • The not so painful part … pay

/EC2 STAR @ MIT, Adam Kocoloski Jan Balewski, Mathew Walker Kate Keahey, Jérôme Lauret, Tim Freeman, Levente Hajdu, Lidia Didenko Purely Web based + sshlogin possible. WN “see” the world Gatekeeper + WN form a virtual cluster. WN “see” the world VOC Condor/VM Miron Livny, Greg Thain, Jan Balewski, Matthew Walker, Jérôme Lauret Sebastien Goasgen, Jérôme Lauret, Michael Fenn, Levente Hajdu Semi-standard GK used to start VMs.Private IP space, need SE + start/stopmechanism for VMs “on-demand” VM subscribe to external RMS. VMs forms an additional network layer IM client controls VMsXMPP used for dispatch Kestrel Sebastien Goasgen, Jérôme Lauret, Matthew Walker, Lance Stout Models – virtualization at a glance CHEP 2010 - contribution ID 267, SCiDAC 2010 paper

Many success stories … • July 28th 2011 - Magellan Tackles the Mysterious Proton Spin • June 1st 2011 -The case of the missing proton spin • March 24th 2010 -Video of the week - RHIC’s hot quark soup • May 29th 2009 - Nimbus cloud project saves brainiacs' bacon • May 2nd 2009 - Number Crunching Made Easy • April 8th 2009 -Feature - Clouds make way for STAR to shine (also as OSG highlight) • April 2nd 2009 -Nimbus and cloud computing meet STAR production demands (see also HPCwire) • April 30th 2008 -The new Nimbus: first steps in the clouds • September 2007 - CHEP 2007 OSG SUMS Workspace Demo (also attached) • August 7th 2006 - SunGrid and the STAR Experiment (flyer attached) Magellan Project – seamless use of Nimbus , Eucalyptus and OpenStackWhat do we learn from “Cloudifying” National Laboratory Resources? • ACAT 2011, contribution #58, Offloading peak processing to Virtual Farm by STAR experiment at RHIC First near real-time data processing on the Cloud in HENP I know off. Simple provider/Consumer Model of VMs coordination • Overall achievements: • Cloud processing boost for the “W” measurement (10 months became 3) • On-the-fly BUR reshape in 2011, data and result preview Cloud computing paradigms is a HUGE success for STAR

The future? • Clouds represent an evolutionary step in reaching the dream of “heterogeneity with confidence”. In layer • Software and environment fully packaged, fully re-producible • Infrastructure seem to scale in size and VMs • Resource “chunks” are carved our of larger clusters • HENP only grazing Cloud: IaaS, PaaS (Nimbus, …) • Tomorrow? • Traditional Farms / clusters will disappear – economy of scales • ExaScale and Mega-computers may replace them • Will be hard for HENP … • But technology WILL exists to create “virtual clusters” • “Infinite” power at reach from your laptop

Multi-core, many cores era Many for > 100Onset of ExaScalecomputing

Disclaimer: New paradigms? Or not so new reality? Changes through improvements Changes through Innovation

Many ways to see evolution … Can process faster, smarter but still on two feet, two arms, … • Computers are as “simple” as they used to be 50 years ago. Under the hood • Still based on Von-Neumann architecture (shared bus, shared memory) with little improvements (cache). • Push words back and forth – Machine language very primitive … Bottlenecks and solution also the same – data in/out is a killer, CPU need to wait,caching strategies all over for ages (Harvard architecture)

Multi-core, many coresHow did we get there? • Moore’s law: a self-fulfilling prophecy • CPU power increase by x2 every 2 years (18 months) • 2003-2005: Intel “buzz killer” • speed highly related to miniaturization of components • 2005+: new strategy begins • Keep speed around 2-3 Ghz (best 5.2 GHz for the z196) • Pack more CPU into the same box: multi-core • Increase “other dimensions” Multi-coreera Raw power increase era

A many-dimension problem ~ 7Adapted from SvereJarp, OpenLab • The 7 dimensions to speed increase • Traditional • Multiple computer nodes – embarrassingly parallel • Multi-core – one program uses as many cores • Less traditional • Multi-sockets– provides more hardware parallelism (but hard to program due to NUMA, Non-Uniform Memory Access) • Pipelining – instruction pipelining • Superscalar (MIMD Multiple Instructions, Multiple Data) • Vector widths/SIMD (Single Instructions, Multiple Data) • 4th is a pseudo-dimension – hardware multi-threading • Perhaps even more dimensions • Precision – quadruple precisions and beyond …

Are we in good shape? • Multi-core alone: Community dominated by single threaded applications and libraries • Algorithm: Kalman, … • Geant, ROOT, … • Embarrassingly // is short sighted (at best) • Internal bandwidth and storage speed • Memory would need to scale – no cost saving AND IO through BUS a killer • Random IO to underlined device • Some mitigation (laptop area) • More complex cache architectures – ReadyBoost, … • Solid State Drive, Flash-memory, hybrid-drives • IO is also external: network, database, … standard schema challenged • Fully parallel is not easy • Same IO issues – copy of data in/out of the box itself from many sources • Memory delays – copy of data in/out of memory costly Estimated – HENP harvesting/exploitation of 10-20% at best of power

What are the show stoppers & difficulties? Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • Many API and approach • Vector Class helps – this comes for “free” and should be used (SIMD) • Old fork() or POSIX threads methods • OpenMP, Threaded Building Blocks (TBB), … GPU – CUDA, … OpenCL, … Intel Cilk • Not all methods are compatible with each other • thread synchronization issues • So many dimensions, so many things to try – today, typically require deep knowledge of the architecture Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Intel Single-Cloud computer, 48 coresin 24 tiles x 2 IA core per tile – 24 routercommunication Core Core Core Core Nvidia / GPUTeraflops per cardFermi → Keppler→ Maxwell Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache TileraTile-GX, 100 cores

The “sequentiality” • Amdahl's law (1979) • Overall speed up drive by the sequential part • For N processors and a P portion fully parallel → S = 1/((1-P)+P/N) • Examples: • 0% is // (P=0), speed-up is 1 • 50% is // (P=0.5), Smax → 2 • 100% is // (P=1), speed-up is N • With 5% sequential (easy with IO, initialization, input reading) • 16 cores, S ~ 10 - 10/16 cores used at most or 62% of the machine power • 32 cores, S ~ 12 – 12/32 cores so now, drop to 32% • … by 512 cores, efficiency is 19/512 ~ 4% optimal usage Entire processing driven by slowest portion – correlative: focus on partwith biggest gain (make the most common case fast), segregate the rest

Some example of work, attempts and lesson learn

Problem # 1: IO issues not only localAccess to a standard database service in STAR • Scale • STAR: 0.5M database queries per second (qps) at peak, Facebook (also MySQL) is doing 13M qps - STAR is ~500 collaborators ( 1000 qps / user), Facebook is ~500 million ( 0.026 qps /user) • Twitter users produce ~ 600 tweets per second (56 M users), STAR doing ~600 FileCatalog requests per second (do the math of query / per users) • The problem: • Data is acquired online – each event is sorted and recorded in a “file” (but not necessarily in time order) • Each file  one jobeach job  one or more access to the database each job  one timeline in the database • Now, add to this: • Farm grows to a 4000 CPUs (16 cores each box) • Some event selections become “rare” – ‘File D’ records events once every 15-20 mnts (while information is @ 1 Hz) • Result: immediate database thrashing • Database efficiency  cache re-use • “Sparsity” destroys the philosophy – standard approach challenged Would still happen but MUCH later if program was fully parallel – x16 factor

Problem #1 (standard solutions) Solutions: (a) use larger cache [more memory] (b) use faster cache (tune DBto have all index fit in memory, use SSD] (c ) re-order processing in strict time order (d) split database service, enhance API and “fit all in memory” [dbsnapshot] or perform Horizontal partitioning [see blog, CHEP 2010 contrib TBP] Longer term: pre-fetching, client caching, WS, … or maximum DB or no DB

Problem #2: memory copy overheadProblem #3: usability of API (diversity) • LHC/ALICE using CUDA and OpenCL for speeding a RAND function • Test on an NVIDIA GeForce GTX 480 – 2 D sample, N=10^6, 4 Rand / point • Evaluation of a combined random number generator: use of a Tausworthe with Linear Congruential Algorithm for a 2^121 periodicity • Each algorithm (Tx3, LCAx1) on one CPU / GPU and internal calculation split • Overall gain x10 (should be x1000) mainly due to moving data in/our of bus • Other work … • CERN/OpenLab – porting RooFit (likelihood fit) to CPU and GPU devices • OpenMP and OpenCL (hybrid) can coexist – OpenMP easier • CUDA preferred for optimal efficiency (vendor locking) • Performance obtained with cache-blockingtechniques

Problem #4: a combo of sequentiality, and little savingProblem #5: Language roadblock • LHC experiment “A” tries event parallelization • Findings • Saves memory (copy on write / memory shared until data is altered) - in fact, ~ 1.2-1.4 GB / event shared … • Does NOT save time comparing to embarrassingly // (in fact, less efficient by a few%) • fork() => near no code rewrite (but no real gain apart from memory) • Recurrent theme in other methods => usually, reverse from STL and OO models to plain C-structs to achieve more compatible with wide range of API or reach minimum efficiency – code re-write. • Event processed in separate sub-process (essentially fork()) , separate IO • Merger at the end Stepping away from C++ is a recurrent theme Is C++ a show stopper to parallel programing?

Dreams … and holy grail

Wouldn’t it be nice if … • A product / API would appear answering our prayers for “a” standard • Should provide forward evolution • Many tries – none crystalized: CT, OpenMP, … • OpenCL may be “a” way • Embraced by Khronos group • Embraced by Mac • Embraced by ATI and AMD, …

Language ?… • C/C++ “locking + threads” possible but hard to get it right, and even harder to keep it right, efficient, portable • C++ OO models have lots of overheads, templates impossible to handle efficiently – often best to step back to C • Prediction: a soon coming language or compiler/wrapper war • Much investment in HENP in C++, will be slow move • Needs • A new language allowing for friendly parallelism • OR a compiler would allow auto-parallelization … • Can gosave us? • Addresses non-multi-core issues from yesterday and some multi-core issues of tomorrow (type safe data communication & sync, garbage collection, reflection, …) • gcc-go coming – large exposure to be expected • BUT Binding to C++ is hardly possible, no dynamic libraries build and no dynamic library loading, no operator overloads, … • Prediction: It may evolve or may not be “the“ language – but attempt is interesting and we will learn from new concepts

Even more trends …

Some new trends …SvereJarp, CERN/OpenLab • Phones • Soon, there is one for every inhabitant on earth: 1,650,000,000 expected sold this year • Smart-phones: Approaching fast the 1 Billion devices: 480,000,000 this year; compound annual growth (CAGR): 60% • Most (if not all) phones shipped in 2015 will be smart phones • Tablets: 50,000,000, CAGR of 200% • In comparison: • Netbooks/Notebooks (200,000,000) • Desktops (150,000,000) • Servers (10,000,000) with 55 BUSD in revenue • Why this may be important • Cheer amount of Smartphone is “tempting” … a little calculation there  big win • Tablets may drive “finger over mouse & keyboard” developments – Windows 8 & Next release of ROOT is IOS based (simplified GUI) • Other buzz • 300 Chinese fabless companies are springing up across the country • … The lesson: many changes may be lead by the private sector, booming companies, new devices … Innovation → demand → paradigm shift

Do we have a clear path forward? Concluding remarks

Cloud computing • The path is rather clear • Forget about Clusters and Batch systems • Mega-machine will be apportions intovirtual clusters & resources • Elastic computing – expand on demand when you need it • Outsourcing is here and to stay • New features will appear • Control from one laptop (or even aSmartPhone) • Pay as you go • Market place - Highest bidder

Many core? • Clear path? Yes, as much as MrMagoo • I am sure we will be saved at the last mnt by a suddenly appearing elevator (i.e. solutions) … But we are not there yet • For now, the community is “lost” or “searching” … • For now • A multi-dimension / multi-disciplinary problem • Problem has a logistic / strategic design dimension: a “parallelize all or …”.? Workflow split may be inevitable – some on many core, some on cloud • IO needs to be delegated, delayed, asynchronous, buffered, … • What needs to happen • Experiments need to build teams with very broad knowledge • We cannot concentrate on just one layer, ignoring the others • I view NO path to success without close collaboration of CS and Physicists • API, libraries, must be agile and flexible • Common libraries, shared algorithms, . • We likely need “a” new language or compiler savior – C++ WILL otherwise be challenged – a language war is coming • A common strategy and architecture – where are the software architects?

Overall • A computing landscape with potentials for dramatic paradigm shifts (de-localization of resources, vastly parallel, tactile devices, … 3D?) • A world rich of opportunities for young scientists • Team work WILL be the way to success See blue: UK faculty, staff, alumni and students - are united with a common purpose: to make the world a better place, a place where our students will lead lives of purpose and meaning

New trends and paradigms in HENP: where do we go from here to “there”?