430 likes | 482 Views
Explore the evolution of Windows HPC, HPC trends, computational finance, parallel computing growth, and Microsoft's vision for productivity computing. Learn about scalability, productivity, and integration with existing tools.
E N D
High Performance and Productivity Computing with Windows HPC Phil Pennington Windows HPC Microsoft Corporation
Supercomputing Reached the Petaflop IBM RoadRunner atLos Alamos National Lab
HPC at Microsoft • 2004 Windows HPC team established • 2005 Windows Server 2003 SP1 x64 • 2005 Microsoft launches HPC entry at SC‘05 in Seattle with Bill Gates keynote • 2006 Windows Compute Cluster Server 2003 ships • 2007 Microsoft named one of the Top 5 companies to watch in HPC at SC’07 • 2008 Windows HPC Server 2008
Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Spring 2006, NCSA, #130 896 cores, 4.1 TF Winter 2005, Microsoft 4 procs, 9.46 GFlops Spring 2007, Microsoft, #1062048 cores, 9 TF, 58.8% Fall 2007, Microsoft, #1162048 cores, 11.8 TF, 77.1% 30% efficiencyimprovement Windows HPC Server 2008 Windows Compute Cluster 2003
HPC Clusters in Every Lab X64 Server
Explosion of Data Experiments Simulations Archives Literature Petabytes Doubling every 2 years
The Data Pipeline Courtesy Catherine van Ingen, MSR
New Breed of HPC: Computational Finance • Modern finance differentiates by the quality, breadth and rapidity of building internal models of global markets and executing on them profitably • Very large datasets (10’s of TB), changing daily→realtime • Tick by tick data, yield curves, past trades and closing prices, fundamental data, news, video • Overnight and realtime computation • Finding patterns, building trading strategies, backtesting, portfolio optimization, derivatives pricing, risk simulation for thousands of scenarios • HPC Grids growing to tens of thousands of nodes • Data is moving from databases to scale-out caches • Enterprise management, security, policy and accounting requirements • Extreme developer productivity requirements • Develop, test and deploy models in production in DAYS • Scale to tens of thousands of cores • Usable by thousands of domain experts, not || wizards
Sun’s Surface 10,000 1,000 100 10 1 Rocket Nozzle Nuclear Reactor Power Density (W/cm2) 8086 Hot Plate 4004 8085 Pentium® processors 8008 386 286 486 8080 ‘70 ‘80 ‘90 ‘00 ‘10 Parallelism Everywhere Today’s Architecture: Heat becoming an unmanageable problem! To Grow, To Keep Up, We Must Embrace Parallel Computing 32,768 2,048 128 16 Many-core Peak Parallel GOPs Parallelism Opportunity 80X GOPS Single Threaded Perf 10% per year 2004 2006 2008 2010 2012 2015 Intel Developer Forum, Spring 2004 - Pat Gelsinger “… we see a very significant shift in what architectures will look like in the future ... fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this willpermeate every architecture that we build. All will have massivelymulticore implementations.” Intel Developer Forum, Spring 2004 Pat Gelsinger Chief Technology Officer, Senior Vice President Intel Corporation February, 19, 2004
Challenge: High Productivity Computing “Make high-end computing easier and more productive to use. Emphasis should be placed on time to solution, the major metric of value to high-end computing users… A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems.” 2004 High-End Computing Revitalization Task Force Office of Science and Technology Policy,Executive Office of the President
Microsoft’s Productivity Vision Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating with the tools they are already using. Administrator Application Developer End - User • Integrated Turnkey Solution • Simplified Setup and Deployment • Built-In Diagnostics • Efficient Cluster Utilization • Integrates with IT Infrastructure and Policies • Highly Productive Parallel Programming Frameworks • Service-Oriented HPC Applications • Support for Key HPC Development Standards • Unix Application Migration • Seamless Integration with Workstation Applications • Integrated Collaboration and Workflow Solutions • Secure Job Execution and Data Access • World-class Performance
Windows HPC Server 2008 • Complete, integrated platform for computational clustering • Built on top the proven Windows Server 2008 platform • Integrated development environment • Available at http://www.microsoft.com/hpc
Windows HPC Server 2008 • Integrated security via Active Directory • Support for batch, interactive and service-oriented applications • High availability scheduling • Interoperability via OGF’s HPC Basic Profile • Rapid large scale deployment and built-in diagnostics suite • Integrated monitoring, management and reporting • Familiar UI and rich scripting interface Job & Resource Scheduling Systems Management HPC Application Models Storage • MS-MPI stack based on MPICH2 reference implementation • Performance improvements for RDMA networking and multi-core shared memory • MS-MPI integrated with Windows Event Tracing • Access to SQL, Windows and Unix file servers • Key parallel file server vendor support (GPFS, Lustre, Panasas) • In-memory caching options
Typical HPC Cluster Topology Corporate IT Infrastructure SystemsManagement Windows Update Monitoring AD DNS DHCP PublicNetwork Head Node Compute Node Compute Node Admin / User Cons Node Manager Node Manager WDS MPI Job Scheduler MPI MPI Management Management Management NAT PrivateNetwork MPINetwork Compute Cluster
Job Scheduler Architecture Compute Nodes Job Validation Resource Allocation Resource Controller Admins Scheduler Store Users
Submitting a job on 9472 cores • Start time < 2 seconds Id : 584 JobTemplate : Default Priority : Normal JobType : Batch NodeGroups : OrderBy : State : Finished Name : UserName : CCE\jeffb Project : RequestedNodes : ResourceRequest : 9472-9472 cores MinMemory : MaxMemory : AllocatedNodesubmitTime : 4/1/2008 10:51:53 PM StartTime : 4/1/2008 10:51:54 PM EndTime : 4/1/2008 10:58:58 PM PendingReason : ChangeTime : 4/1/2008 10:58:58 PM Wait time : 00:00:00:00 Elapsed time : 00:00:07:04 ErrorMessage : RequeueCount : 0 TaskCount : 1 ConfiguringTaskCount : 0 QueuedTaskCount : 0 RunningTaskCount : 0 FinishedTaskCount : 1 FailedTaskCount : 0 CanceledTaskCount : 0
Placement via Job ContextNode Grouping, Job Templates, Filters MATLAB A C0 C1 C2 C3 A A A A MATLAB Application Aware An ISV application (requires Nodes where the application is installed) M M MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB Multi-threaded application (requires machine with many Cores) Capacity Aware A big model (requires Large memory machines) P0 P1 M M |||||||| |||||||| M M Numa Aware M M |||||||| |||||||| 4-way Structural Analysis MPI Job M M P2 P3 C0 C1 C2 C3 IO IO Quad-core 32-core M
Node/Socket/Core Allocation • Windows HPC Server can help your application make the best use of multi-core systems Node 2 S2 S0 S1 S1 S3 S2 S0 P1 P1 P1 P1 P1 P1 P1 P0 P0 P0 P0 P0 P0 P0 Node 1 P2 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P3 P3 P3 J1 J1 J2 S3 P1 P0 J3 J3 J1 P2 P3 J3 J3 J1: /numsockets:3 /exclusive: false J3: /numcores:4 /exclusive: false J2: /numnodes:1
Group compute nodes based on hardware, software and custom attributes; Act on groupings. Pivoting enables correlating nodes and jobs together Track long running operations and access operation history Receive alerts for failures List or Heat Map view cluster at a glance Single Management Console
Evolving HPC Application Support V2 (focusing on Interactive applications) V1 (focusing on batch applications) Job Scheduler Resource allocation Process Launching Resource usage tracking Integrated MPI execution Integrated Security WCF Service Broker WS Virtual Endpoint Reference Request load balancing Integrated Service activation Service life time management Integrated WCF Tracing + App.exe App.exe App.exe App.exe Service (DLL) Service (DLL) Service (DLL) Service (DLL)
HPC + WCF Services Compute Scenario 2. Session Manager starts WCF Broker job and WCF Service job for client. Head Node Compute Nodes 1. User submits job. 3. Requests 4. Requests Workstation WCF Broker Nodes 5. Responses 6. Responses
Head Node Job Mgmt Cluster Mgmt Scheduling Resource Mgmt Jobs Scheduler Results Compute Node Job Execution User App MPI Service Oriented HPC + WCF Integrated Solutions UDF UDF UDF UDF UDF UDF UDF UDF
HPC + WCF Programming Model Sequential Parallel for (i = 0; i < 100,000,000; i++) { r[i] = worker.DoWork(dataSet[i]); } reduce ( r ); Session session = new session(startInfo); PricingClient client = new P ricingClient(binding, session.EndpointAddress); for (i = 0; I < 100,000,000, i++) { client.BeginDoWork(dataset[i], new AsyncCallback(callback), i) } void callback(IAsyncResult handle) { r = client.EndDoWork(handle); // aggregate results reduce ( r ); }
HPC MPI Programming Model • Traditional HPC • mpiexec communicates with each node’s MPI Service to start worker processes mpiexec –n 6 app.exe process process process process process process Job scheduler node P P node node P P P P ... MPI Service MPI Service MPI Service MPI Service Headnode Compute nodes
MPI.NET • Supports all .NET languages (C#, C++, F#, ..., even Visual Basic!) • Natural expression of MPI in C# • Negligible overhead (relative to C) over TCP if (world.Rank == 0) world.Send(“Hello, World!”, 1, 0); else stringmsg = world.Receive<string>(0, 0); string[] hostnames = comm.Gather(MPI.Environment.ProcessorName, 0); double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) => return x + y, 0) / totalDartsThrown;
User Mode Kernel Mode NetworkDirectA new RDMA networking interface built for speed and stability • Verbs-based design for close fit with native, high-perf networking interfaces • Equal to Hardware-Optimized stacks for MPI micro-benchmarks • 2 usec latency, 2 GB/sec bandwidth on ConnectX • OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols Socket-Based App MPI App MS-MPI Windows Sockets (Winsock + WSD) RDMA Networking TCP/Ethernet Networking Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware WinSock Direct Provider NetworkDirect Provider Mini-port Driver TCP IP NDIS Kernel By-Pass Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Hardware Driver User Mode Access Layer Networking Hardware (ISV) App CCP Component OS Component IHV Component
Devs can't tune what they can't seeMS-MPI integrated with Event Tracing for Windows • Single, time-correlated log of: OS, driver, MPI, and app events • CCS-specific additions • High-precision CPU clock correction • Log consolidation from multiple compute nodes into a single record of parallel app execution • Dual purpose: • Performance Analysis • Application Trouble-Shooting • Trace Data Display • Visual Studio & Windows ETW tools • Intel Collector/Analyzer • Vampir • Jumpshot
Enables Optimization Strategies Count of machines and distinct communicating pairs Statistical summary of counts Statistical summary of sizes Sender / receiver pairs. Senders on vertical axis. Bubble chart has bubble area proportional to size of chart. Histogram of counts Histogram of sizes Scatter plot of sizes ( vertical axis ) vs counts ( Large scale problem before optimization ( linpack 2048 cores ) Large scale problem after optimization Usage and notes: Overall idea is that we are able to do live logging of the communication traffic that occurs as part of an executing run. We are then able to optimize the traffic based on either latency or bandwidth metrics. Real-world usage is: • Run your scenario with traffic analysis on • Optimize for latency or bandwidth dependent on the characteristics of the app • Save a machine file representing the changes • Rerun your task passing in –machinefile to mpiexec and see things improve hopefully Walkthrough of zipped up stuff: • Unzip to a folder • Start the health client. This takes an ip address and port, but you can use random ones as we are not doing live traffic work • Healthclient 10.1.1.1 6000 • Choose the view / view traffic menu option • Load one of the provided traffic files • Traffic_64.txt is a 64 node linpack run • Traffic_2048.txt is a 2048 node linpack run • Open the RHM menu over the traffic and you have a number of options: • Show counts and show size let you flip the ui between showing counts , sizes or both on the bubble chart • Histograms lets you flip the vertical axis on the histograms to logarithmic which is useful when the data distributions are very uneven • Optimize For…lets you choose to optimize for latency , bandwidth or a combination of the two. The implementation here is obvious: just weighting the proportion of size and counts when calculating the final layout • SHM / Network ratio lets you set the relative speeds of your network compared to SHM. For gige 100:1 or 1000:1 is good, for NWD it is more like 2 or 5:1 • Optimize performs the optimization ( a greedy clustering algorithm currently ) • View optimized / original lets you flip between optimized and non optimized views • Once you have optimized choose file / save machine file to save an optimized layout suitable for being passed to mpiexec.
HPC Open Grid Forum Interoperability Cloud Services Other OS’s Thin Clients HPC client API Application ISVs Scheduling ISVs HPC Basic Profile Web Service Windows HPC Server 2008 Headnode
Resources • Windowshpc.net • www.microsoft.com/hpc • Channel9.msdn.com/shows/the+hpc+show • Edge.technet.com/tags/HPC • www.microsoft.com/science • research.microsoft.com/fsharp • www.osl.iu.edu/research/mpi.net • www.microsoft.com/msdn • www.microsoft.com/technet
© 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.