Enterprise Supercomputers

Enterprise Supercomputers Ike Nassi (Joint work with Rainer Brendle, Keith Klemba, and David Reed) TTI/Vanguard, “Real Time”, Paris, July 12, 2011

A Personal Reflection • Let me start with a story. • The setting. Early 1982: • Digital Equipment Corporation (DEC) had introduced the VAX (a super-minicomputer) two years earlier • Sales were going well • The VAX was a popular product • And then one day…

The Moral of the Story • We see that… • Powerful personal workstations A Large Market ($$$$) Unmet Needs + Inflection Point A Set of New Technologies • Bitmapped Displays • 10 MB Ethernet • Powerful µprocessors • Open source UNIX • The Market for Workstations

Is “the Cloud” an inflection point? • Emergence of Cloud indicates that we have been missing something • Emerging Needs • MOBILITY • BIG DATA • NEW UI • ELASTIC • SOCIAL • NEW BUSINESS CLOUD • GAP CURRENT SYSTEMS • Current systems represent the work of the last two decades • Drivers

Cloud computing has emerged as a viable mode for the enterprise • Ease of access • Elastic • Mobility and sensors • More pervasive data storage • Computing as a utility, devices as appliances • Expandable storage • New business models are emerging (SaaS, PaaS, IaaS)

Today I’ll talk a bit about • Current commodity server design • How this creates a computing gap for the enterprise * • How businesses of tomorrow will function • And how enterprise supercomputers will support them • A different layer of cloud Cloud Computing Cloud Computing • *A Closer Look • A Gap Current Systems Current Systems • My Reality • The Perception

The short story… Virtual Machine 1 Virtual Machine 2 Virtual Machine 3 … Virtual Machine N Current View Physical Machine Physical Machine 1 Physical Machine 2 Physical Machine 3 Physical Machine M … Emerging View Virtual Machine

Enterprise HW Computing Landscape And what has emerged BladeServer The Cloud The Chip The Board

And this creates a perceived technology gap Problems with today’s servers: • Cores and memory scale in lock step • Access latencies to large memories from modern cores • Addressing those memories Current and emerging business needs: • Low latency decision support and analytics • “What if?” simulation • Market/Sentiment analysis • Pricing and inventory analysis • In other words, we need “real” real time This creates a technology conflict: Cores and memory need to scale more independently rather than in lock step How can we resolve this conflict?

Flattening the layers • We need to innovate without disrupting customers. (Like your multicore laptops.) • Seamless platform transitions are difficult, but doable (Digital, Apple, Intel, Microsoft) • Large memory technology is poised to take off, but it needs appropriate hardware • High Performance Computing (HPC)* has yielded results (Vanguard 1991) that can now be applied to Enterprise Software; today it’s different: • We have a lot of cores now and better interconnects • We can “flatten” the layers and simplify • We can build systems that improve our software now without making anymajor modifications • Nail Soup argument: But, if we are willing to modify our software, we can win bigger. But we do this on our own schedule. (“Stone Soup” on Wikipedia) • *Scientific/Technical Computing

The Chip How do we leverage advances in: • Processor Architecture • Multiprocessing (Cores and threads) • NUMA (e.g. QPI, Hypertransport) • Connecting to the board (sockets) • Core to memory ratio • We have 64 bits of address, but can we reach those addresses? The Chip

Typical Boards Highlights: • Sockets (2, 4, 8) • Speed versus Access How do we make these choices for the enterprise? An 8-socket board layout

Storage Considerations for storage • Cost per Terabyte (for the data explosion) • Speed of access (for real time access) • Storage network (for global business) • Disk, Flash, DRAM • How can applications be optimized for storage? • Example: A 2TB OCZ Card

Memory Considerations Highlights • Today, cores and memory currently scale in lock step • Large memory computing (to handle modern business needs) • Collapse the multi-tiered architecture levels • NUMA support is built in to Linux • Physical limits (board design, connections, etc) • Memory Riser

The “Magic” Cable

Reincarnating the “mainframe” • In-memory computing provides the real time performance we’re looking for • In-memory computing requires more memory per core in a symmetric way • I remember what Dave Cutler said about the VAX: • “There is nothing that helps virtual memory like real memory” • Ike’s corollary: “There is nothing that helps an in-memory database like real memory” • The speed of business should drive the need for more flexible server architectures • Current server architectures are limiting, and we must change the architecture and do so inexpensively • And how do we do this?

Taking In-Memory Computing Seriously • Basic Assumptions and observations • Hard disks are for archives only. All active data must be in DRAM memory. • Good caching and data locality are essential. • Otherwise CPUs are stalled due to too many cache misses • There can be many levels of caches • Problems and Opportunities for In-Memory Computing • Addressable DRAM per box is limited due to processor physical address pins. • But we need to scale memory independently from physical boxes • Scaling Architecture • Arbitrary scaling of the amount of data stored in DRAM • Arbitrary and independent scaling of the number of active users and associated computing load • DRAM access times have been the limiting factor for remote communications • Adjust the architecture to DRAM latencies (~100 ns?) • InterProcess Communication is slow and hard to program (latencies are in the area of ~0.5-1ms ) • We can do better.

Putting it together – The “Big Iron” Initiative

What is “Big Iron” Technology Today? • Big Iron-1Extreme Performance, Low Cost • Big Iron-2Extreme Performance, Scalability, and much simpler system model • Traditional Cluster 10 x 4U Nodes (Intel XEON x7560 2.26Ghz) • 320 cores (640 Hyper-threads), 32 cores per node • 5TB memory (10TB max), 20TB solid state disk • System architecture: SAP • Enterprise Supercomputer • Developers can treat as a single large physical server • Large shared coherent memory across nodes 5x 4U Nodes (Intel XEON x7560 2.26Ghz) 160 cores (320 Hyper-threads), 32 cores per node 5 TB memory total, 30TB solid state disk 160 GB InfiniBand interconnect per node 1 NAS (72TB Expandable to 180) Scalable coherent shared memory (via ScaleMP) • System architecture: SAP

The Next Generation Servers a.k.a. “Big Iron” are a reality • Big Iron -1 Fully OperationalSeptember 30, 2010 2:30pm • Big Iron -2 Fully Operational February 4, 2011 All 10 Big Iron-1 nodes up and running All 5 Big Iron-2 nodes up and running

At a Modern Data Center These bad boys are huge backup generators Pod way down there for Big Iron Special Cooling has won awards

Progress, Sapphire and others: Sapphire Type 1 Machine (Senior) • 32 Nodes (5 racks, 2.5 tons) • 1024 Cores • 16 TB of Memory • 64 TB of Solid State Storage • 108 TB of Network Attached Storage • Sapphire Type 1 Machine (Junior) • 1 node (1 box, on stage) • 80 cores • 1 TB of Memory • 2 TB of SSD • No Attached Storage

Coherent Shared Memory – A Much Simpler Alternative to Remote Communication Uses high-speed, low latency networks (Optical fiber/copper with 40Gb/s or above) • Typical latencies of this are in the area of 1-5 μsec • Aggregated throughput matches CPU consumption • L4 cache needed to balance the longer latency on non-local access(cache-coherent non-uniform memory access over different physical machines) Separate the data transport and cache layers into a separate tier below the operating system • (almost) never seen by the application or the operating system! Applications and database code can just reference data • The data is just “there”, i.e. it’s a load/store architecture, not network datagrams • Application level caches are possibly not necessary – the system does this for you (?) • Streaming of query results is simplified, L4 cache schedules the read operations for you. • Communication is much lighter weight. Data is accessed directly and thread calls are simple and fast (higher quality by less code) • Application designers do not confront communications protocol design issues • Parallelization of analytics and combining simulation with data are far simpler, enabling powerful new business capabilities of mixed analytics and decision support at scale

Software implications of Coherent Shared Memory • On Application Servers • Access to database can be “by reference” • No caches on application server side. Application can refer to database query results including metadata, master data etc. within the database process. • Caches are handled by “hardware” and are guaranteed to be coherent. • Lean and fast application server for in-memory computing • On Database Code • A physically distributed database can have consistent DRAM-like latencies • Database programmers can focus on database problems • Data replication and sharding are handled by touching the data and L4 cache does the automatic distribution • In fact, do we need to separate application servers from database servers at all? • No lock-in to fixed machine configurations or clock rates • No need to make app-level design tradeoffs between communications and memory access • One O/S instance, one security model, easier to manage

But what about… Distributed Transactions? • We don’t need no stinkin’ distributed transactions! What about traditional relational databases? • In the future, databases become data structures! (Do I really believe this? Well, not really. Just wanted to make the point.) Why not build a mainframe? • From a SW perspective, it is a mainframe (with superior price/performance for in-memory computing) Is it Virtualization? • In traditional virtualization, you take multiple virtual machines and multiplex them onto the same physical hardware. We’re taking physical hardware instances and running them on a single virtual instance.

A simplified programming model and landscape Virtual Machine 1 Virtual Machine 2 Virtual Machine 3 … Virtual Machine N Current View Physical Machine Physical Machine 1 Physical Machine 2 Physical Machine 3 Physical Machine M … Emerging View Virtual Machine (Single OS instance)

The business benefits of coherent shared memory • Improved in-memory computing performance at dramatically lower cost • The ability to build high performance “mainframe-like” computing systems with commodity cluster server components • Ability to scale memory capacity in a more in-memory computing-friendly fashion • Simplified software system landscape using system architecture that can be made invisible to application software • Minimize changes to applications • Enables applications to scale seamlessly without changes to the application code or additional programming effort • Traditional cluster based parallel processing requires specialized programming skills held only by top developers; this will constrain an organization’s ability to scale development for in-memory computing • With coherent shared memory, the bulk of developers can develop as they do today and let the underlying hardware and lower level software handle some of the resource allocation, unlike today. • But, existing, standard NUMA capabilities in Linux remain available.

Big Iron Research • Next generation server architectures for in-memory computing • Processor Architectures • Interconnect technology • Systems software for coherent memory • OS considerations • Storage • Fast data load utilities • Reliability and Failover • Performance engineering for enterprise computing loads • Price/performance analysis • Private and public cloud approaches • And more

Some experiments that we have run • Industry standard system benchmarks • Evaluation of shared memory program scalability • Scalability of “Big Data” analytics – just beginning

Industry Standard Benchmarks • STREAM: • large memory vector processing • BigIron would be 12th highest recorded test • Nearly linear in system size • Measures scalability of memory access • specJbb: • Highly concurrent Java transactions • BigIron would be 13th highest recorded test • More computational than STREAM

To sum up • Cloud computing has emerged • The businesses of tomorrow will function in real time. Really! • There is a growing computing technology gap • There is an unprecedented wave of advances in HW • The “mainframe” has been reincarnated with it’s attendant benefits, but can be made from commodity parts Enterprise supercomputers leveraging coherent shared memory principles and the wave of SW+HW advances will enable this new reality

Questions? • Contact information: • Ike Nassi • Executive VP and Chief Scientist • SAP Palo Alto • Email: ike.nassi@sap.com

Enterprise Supercomputers