High Productivity Computing: Taking HPC Mainstream

High Productivity Computing:Taking HPC Mainstream Lee Grant Technical Solutions Professional High Performance Computing leegrant@microsoft.com

Challenge: High Productivity Computing “Make high-end computing easier and more productive to use. Emphasis should be placed on time to solution, the major metric of value to high-end computing users… A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems.” 2004 High-End Computing Revitalization Task Force Office of Science and Technology Policy, Executive Office of the President

X64 Server

The Data Pipeline

Free Lunch Is Over For Traditional Software 24 GHz1 Core 12 GHz1 Core Operations per second for serial code Free Lunch for traditional software 3 GHz2 Cores 3 GHz4 Cores 3 GHz8 Cores 6 GHz1 Core 3 GHz1 Cor 3 GHz1 Cores Additional operations per second if code can take advantage of concurrency No Free Lunch for traditional software (Without highly concurrent software it won’t get any faster!)

Microsoft’s Vision for HPC “Provide the platform, tools and broad ecosystem to reduce the complexity of HPC by making parallelism more accessible to address future computational needs.” Reduced Complexity Mainstream HPC Developer Ecosystem • Ease deployment forlarger scale clusters • Address needs of traditional supercomputing • Increase number of parallel applications and codes • Simplify management forclusters of all scale • Address emerging cross-industry computation trends • Offer choice of parallel development tools, languages and libraries • Integrate with existing infrastructure • Enable non-technical users to harness the power of HPC • Drive larger universe of developers and ISVs

Microsoft HPC++ Solution Application Benefits The most productive distributed application development environment Cluster Benefits Complete HPC cluster platform integrated with the enterprise infrastructure System Benefits Cost-effective, reliable and high performance server operating system

Windows HPC Server 2008 • Integrated security via Active Directory • Support for batch, interactive and service-oriented applications • High availability scheduling • Interoperability via OGF’s HPC Basic Profile • Rapid large scale deployment and built-in diagnostics suite • Integrated monitoring, management and reporting • Familiar UI and rich scripting interface Systems Management Job Scheduling Storage MPI • MS-MPI stack based on MPICH2 reference implementation • Performance improvements for RDMA networking and multi-core shared memory • MS-MPI integrated with Windows Event Tracing • Access to SQL, Windows and Unix file servers • Key parallel file server vendor support (GPFS, Lustre, Panasas) • In-memory caching options

Group compute nodes based on hardware, software and custom attributes; Act on groupings. Pivoting enables correlating nodes and jobs together Track long running operations and access operation history Receive alerts for failures List or Heat Map view cluster at a glance

Skip/Demo Integrated Job Scheduling

Skip/Demo Node/Socket/Core Allocation Windows HPC Server can help your application make the best use of multi-core systems Node 2 S3 S2 S0 S2 S1 S0 S1 Node 1 P1 P1 P1 P1 P1 P1 P1 P0 P0 P0 P0 P0 P0 P0 J1 J1 J2 P2 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P3 P3 P3 S3 P1 P0 J3 J3 J1 P2 P3 J3 J3 J1: /numsockets:3 /exclusive: false J3: /numcores:4 /exclusive: false J2: /numnodes:1

Job submission: 3 methods • using Microsoft.Hpc.Scheduler; • class Program • { • static void Main() • { • IScheduler store = new Scheduler(); • store.Connect(“localhost”); • ISchedulerJob job = store.CreateJob(); • job.AutoCalculateMax = true; • job.AutoCalculateMin = true; • ISchedulerTask task = job.CreateTask(); • task.CommandLine = "ping 127.0.0.1 -n *"; • task.IsParametric = true; • task.StartValue = 1; • task.EndValue = 10000; • task.IncrementValue = 1; • task.MinimumNumberOfCores = 1; • task.MaximumNumberOfCores = 1; • job.AddTask(task); • store.SubmitJob(job, @"hpc\user“, "p@ssw0rd"); • } • } • Programmatic • Support for C++ & .Net languages • Web Interface • Open Grid Forum: “HPC Basic Profile” • Command line • Job submit /headnode:Clus1 /Numprocessors:124 /nodegroup:Matlab • Job submit /corespernode:8 /numnodes:24 • Job submit /failontaskfailure:true /requestednodes:N1,N2,N3,N4 • Job submit /numprocessors:256 mpiexec \\share\mpiapp.exe • [CompletelPowershell system mgmt commands are available as well]

Scheduling MPI jobs • Job Submit /numprocessors:7800 mpiexec hostname • Start time: 1 second, Completion time: 27 seconds

User Mode Kernel Mode NetworkDirectA new RDMA networking interface built for speed and stability Socket-Based App MPI App MS-MPI Windows Sockets (Winsock + WSD) RDMA Networking TCP/Ethernet Networking Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Mini-port Driver WinSock Direct Provider NetworkDirect Provider TCP IP NDIS Kernel By-Pass Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Hardware Driver Networking Hardware User Mode Access Layer • Verbs-based design for close fit with native, high-perf networking interfaces • Equal to Hardware-Optimized stacks for MPI micro-benchmarks • 2 usec latency, 2 GB/sec bandwidth on ConnectX • OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols (ISV) App CCP Component OS Component IHV Component

Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Spring 2006, NCSA, #130 896 cores, 4.1 TF Spring 2007, Microsoft, #1062048 cores, 9 TF, 58.8% Fall 2007, Microsoft, #1162048 cores, 11.8 TF, 77.1% 30% efficiencyimprovement Windows HPC Server 2008 Windows Compute Cluster 2003

November 2008 Top500

Customers “It is important that our IT environment is easy to use and support. Windows HPC is improving our performance and manageability.” -- Dr. J.S. Hurley, Senior Manager, Head Distributed Computing, Networked Systems Technology, The Boeing Company • “Ferrari is always looking for the most advanced technological solutions and, of course, the same applies for software and engineering. To achieve industry leading power-to-weight ratios, reduction in gear change times, and revolutionary aerodynamics, we can rely on Windows HPC Server 2008. It provides a fast, familiar, high performance computing platform for our users, engineers and administrators.” • -- Antonio Calabrese, Responsabile Sistemi Informativi (Head of Information Systems), Ferrari “Our goal is to broaden HPC availability to a wider audience than just power users. We believe that Windows HPC will make HPC accessible to more people, including engineers, scientists, financial analysts, and others, which will help us design and test products faster and reduce costs.” -- Kevin Wilson, HPC Architect, Procter & Gamble

“We are very excited about utilizing the Cray CX1 to support our research activities,” said Rico Magsipoc, Chief Technology Officer for the Laboratory of Neuro Imaging. “The work that we do in brain research is computationally intensive but will ultimately have a huge impact on our understanding of the relationship between brain structure and function, in both health and disease. Having the power of a Cray supercomputer that is simple and compact is very attractive and necessary, considering the physical constraints we face in our data centers today.”

Porting Unix Applications • Windows Subsystem for Unix applications • Complete SVR-5 and BSD UNIX environment with 300 commands, utilizes, shell scripts, compilers • Visual Studio extensions for debugging POSIX applications • Support for 32 and 64-bit applications • Recent port of WRF weather model • 350K lines, Fortran 90 and C using MPI, OpenMP • Traditionally developed for Unix HPC systems • Two dynamical cores, full range of physics options • Porting experience • Fewer than 750 lines of code changed in makefiles/scripts • Level of effort similar to port to any new version of UNIX • Performance on par with the Linux systems • India Interoperability Lab, MTC Bangalore • Industry Solutions for Interop jointly with partners • HPC Utility Computing Architecture • Open Source Applications on HPC Server 2008 (NAMD, PL_POLY, GROMACS)

High Productivity Modeling .Net Framework Languages/Runtimes C++, C#, VB F#, Python, Ruby, Jscript Fortran (Intel, PGI) OpenMP, MPI LINQ: language integrated query Dynamic Language Runtime Fx/JIT/GC improvements Native support for Web Services Team Development Team portal: version control, scheduled build, bug tracking Test and stress generation Code analysis, Code coverage Performance analysis IDE Rapid application development Parallel debugging Multiprocessor builds Work flow design

MSFT || Computing Technologies Task Concurrency IFx / CCR • Robotics-based manufacturing assembly line • Silverlight Olympics viewer • Automotive control system • Internet –based photo services WCF Maestro TPL / PPL WF Local Computing MPI / MPI.Net Distributed/Cloud Computing Cluster-TPL • Ultrasound imaging equipment • Media encode/decode • Image processing/ enhancement • Data visualization • Enterprise search, OLTP, collab • Animation / CGI rendering • Weather forecasting • Seismic monitoring • Oil exploration Cluster SOA PLINQ Cluster-PLINQ OpenMP TPL / PPL CDS Data Parallelism

UDF UDF UDF UDF UDF UDF UDF Head Nodes SupportsSOA functionalityWCF Brokers. UDF Compute Nodes Each performs UDF Tasks as called From WCF Broker

SOA Broker Performance

MPI.NET • Supports all .NET languages (C#, C++, F#, ..., even Visual Basic!) • Natural expression of MPI in C# • Negligible overhead (relative to C) over TCP if (world.Rank == 0) world.Send(“Hello, World!”, 1, 0); else stringmsg = world.Receive<string>(0, 0); string[] hostnames = comm.Gather(MPI.Environment.ProcessorName, 0); double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) => return x + y, 0) / totalDartsThrown;

Skip/Demo Allinea DDT VS Debugger Add-in

Parallel Extensions to .NET var q = from n in names.AsParallel() where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderbyn.Year ascending select n; Parallel.For(0, n, i=> { result[i] = compute(i); }); • Declarative data parallelism (PLINQ) • Imperative data and task parallelism (TPL) • Data structures and coordination constructs

Example: Tree Walk Thread Pool Sequential static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Stack<Tree<T>> nodes = new Stack<Tree<T>>(); Queue<T> data = new Queue<T>(); nodes.Push(tree); while (nodes.Count > 0) { Tree<T> node = nodes.Pop(); data.Enqueue(node.Data); if (node.Left != null) nodes.Push(node.Left); if (node.Right != null) nodes.Push(node.Right); } using (ManualResetEventmre = new ManualResetEvent(false)) { intwaitCount = Environment.ProcessorCount; WaitCallbackwc = delegate { boolgotItem; do { T item = default(T); lock (data) { if (data.Count > 0) { item = data.Dequeue(); gotItem = true; } else gotItem = false; } if (gotItem) action(item); } while (gotItem); if (Interlocked.Decrement(ref waitCount) == 0) mre.Set(); }; for (inti = 0; i < Environment.ProcessorCount - 1; i++) { ThreadPool.QueueUserWorkItem(wc); } wc(null); mre.WaitOne(); } } static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; ProcessNode(tree.Left, action); ProcessNode(tree.Right, action); action(tree.Data); }

Example: Tree Walk Parallel Extensions (with Task) static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Task t = Task.Create(delegate { ProcessNode(tree.Left, action); }); ProcessNode(tree.Right, action); action(tree.Data); t.Wait(); } Parallel Extensions (with Parallel) static void ProcessNode<T>(Tree<T> tree, Action<T> action) { if (tree == null) return; Parallel.Do( () => ProcessNode(tree.Left, action), () => ProcessNode(tree.Right, action), () => action(tree.Data) ); } Parallel Extensions (with PLINQ) static void ProcessNode<T>(Tree<T> tree, Action<T> action) { tree.AsParallel().ForAll(action); }

F# is... ...a functional, object-oriented, imperative and explorativeprogramming language for .NET

Interactive F# Shell • C:\fsharpv2>bin\fsiMSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved F# Version 1.9.2.9, compiling for .NET Framework Version v2.0.50727 NOTE: NOTE: See 'fsi --help' for flags NOTE: NOTE: Commands: #r <string>;; reference (dynamically load) the given DLL. NOTE: #I <string>;; add the given search path for referenced DLLs. NOTE: #use <string>;; accept input from the given file. NOTE: #load <string> ...<string>;; • NOTE: load the given file(s) as a compilation unit. • NOTE: #time;; toggle timing on/off. • NOTE: #types;; toggle display of types on/off. • NOTE: #quit;; exit. • NOTE: • NOTE: Visit the F# website at http://research.microsoft.com/fsharp. • NOTE: Bug reports to fsbugs@microsoft.com. Enjoy! • > let rec f x = (if x < 2 then x else f (x-1) + f (x-2));; val f : int -> int • > f 6;; • val it = 8 val it : int

Example: Taming Asynchronous I/O using System; using System.IO; using System.Threading; public class BulkImageProcAsync { public const String ImageBaseName = "tmpImage-"; public const intnumImages = 200; public const intnumPixels = 512 * 512; // ProcessImage has a simple O(N) loop, and you can vary the number // of times you repeat that loop to make the application more CPU- // bound or more IO-bound. public static intprocessImageRepeats = 20; // Threads must decrement NumImagesToFinish, and protect // their access to it through a mutex. public static intNumImagesToFinish = numImages; public static Object[] NumImagesMutex = new Object[0]; // WaitObject is signalled when all image processing is done. public static Object[] WaitObject = new Object[0]; public class ImageStateObject { public byte[] pixels; public intimageNum; public FileStreamfs; } public static void ReadInImageCallback(IAsyncResultasyncResult) { ImageStateObject state = (ImageStateObject)asyncResult.AsyncState; Stream stream = state.fs; intbytesRead = stream.EndRead(asyncResult); if (bytesRead != numPixels) throw new Exception(String.Format ("In ReadInImageCallback, got the wrong number of " + "bytes from the image: {0}.", bytesRead)); ProcessImage(state.pixels, state.imageNum); stream.Close(); // Now write out the image. // Using asynchronous I/O here appears not to be best practice. // It ends up swamping the threadpool, because the threadpool // threads are blocked on I/O requests that were just queued to // the threadpool. FileStreamfs = new FileStream(ImageBaseName + state.imageNum + ".done", FileMode.Create, FileAccess.Write, FileShare.None, 4096, false); fs.Write(state.pixels, 0, numPixels); fs.Close(); // This application model uses too much memory. // Releasing memory as soon as possible is a good idea, // especially global state. state.pixels = null; fs = null; // Record that an image is finished now. lock (NumImagesMutex) { NumImagesToFinish--; if (NumImagesToFinish == 0) { Monitor.Enter(WaitObject); Monitor.Pulse(WaitObject); Monitor.Exit(WaitObject); } } } public static void ProcessImagesInBulk() { Console.WriteLine("Processing images... "); long t0 = Environment.TickCount; NumImagesToFinish = numImages; AsyncCallbackreadImageCallback = new AsyncCallback(ReadInImageCallback); for (inti = 0; i < numImages; i++) { ImageStateObject state = new ImageStateObject(); state.pixels = new byte[numPixels]; state.imageNum = i; // Very large items are read only once, so you can make the // buffer on the FileStream very small to save memory. FileStreamfs = new FileStream(ImageBaseName + i + ".tmp", FileMode.Open, FileAccess.Read, FileShare.Read, 1, true); state.fs = fs; fs.BeginRead(state.pixels, 0, numPixels, readImageCallback, state); } // Determine whether all images are done being processed. // If not, block until all are finished. boolmustBlock = false; lock (NumImagesMutex) { if (NumImagesToFinish > 0) mustBlock = true; } if (mustBlock) { Console.WriteLine("All worker threads are queued. " + " Blocking until they complete. numLeft: {0}", NumImagesToFinish); Monitor.Enter(WaitObject); Monitor.Wait(WaitObject); Monitor.Exit(WaitObject); } long t1 = Environment.TickCount; Console.WriteLine("Total time processing images: {0}ms", (t1 - t0)); } } Processing 200 images in parallel

Example: Taming Asynchronous I/O Open the file synchronously Equivalent F# code (same perf) Read from the file, asynchronously let ProcessImageAsync(i) = async{letinStream =File.OpenRead(sprintf"source%d.jpg"i) let!pixels =inStream.ReadAsync(numPixels) letpixels' =TransformImage(pixels,i) letoutStream=File.OpenWrite(sprintf"result%d.jpg"i) do!outStream.WriteAsync(pixels') doConsole.WriteLine"done!" } letProcessImagesAsync() = Async.Run (Async.Parallel [foriin1 .. numImages->ProcessImageAsync(i) ]) Write the result asynchronously Generate the tasks and queue them in parallel

The Coming of Accelerators

Current Offerings Microsoft AMD nVidia Intel Apple Accelerator Brook+ RapidMind Ct Grand Central D3DX, DaVinci, FFT, Scan ACML-GPU cuFFT, cuBLAS, cuPP MKL++ CoreImage CoreAnim Compute Shader CAL CUDA LRB Native OpenCL AMD CPU or GPU nVidia GPU Intel CPU Larrabee Any Processor Any Processor

DirectX11 Compute Shader • A new processing model for GPUs • Integrated with Direct3D • Supports more general constructs • Enables more general data structures • Enables more general algorithms • Image/Post processing: • Image Reduction, Histogram, Convolution, FFT • Video transcode, superResolution, etc. • Effect physics • Particles, smoke, water, cloth, etc. • Ray-tracing, radiosity, etc. • Gameplay physics, AI

FFT Performance Example • Complex 1024x1024 2-D FFT: • Software 42ms 6 GFlops • Direct3D9 15ms 17 GFlops 3x • CUFFT 8ms 32 GFlops 5x • Prototype DX11 6ms 42 GFlops 6x • Latest chips 3ms 100 GFlops • Shared register space and random access writes enable ~2x speedups

IMSL .NET Numerical Library • Variances, Covariances and Correlations • Multivariate Analysis • Analysis of Variance • Time Series and Forecasting • Distribution Functions • Random Number Generation • Nonlinear Equations • Optimization • Basic Statistics • Nonparametric Tests • Goodness of Fit • Regression • Linear Algebra • Eigensystems • Interpolation and Approximation • Quadrature • Differential Equations • Transforms

Integrate Analyze Report Research • Data enrichment, with business logic, hierarchical views • Data discovery via data mining • Data presentation and distribution • Data access for the masses • Data acquisition from source systems and integration • Data transformation and synthesis

Data Browsing with Excel Monthly Mean Weekly Mean Annual Mean Courtesy Catherine van Ingen, MSR

Datamining with Excel • Integrated algorithms • Text Mining • Neural Nets • Naïve Bayes • Time Series • Sequent Clustering • Decision Trees • Association Rules

High Productivity Computing: Taking HPC Mainstream

High Productivity Computing: Taking HPC Mainstream

Presentation Transcript

High Productivity Computing

High Performance and Productivity Computing with Windows HPC

HPC - High Performance Productivity Computing and Future Computational Systems: A Research Engineer’s Perspective

Advanced High Performance Computing Workshop HPC 201

High Performance Computing Workshop HPC 101

High-Performance Computing (HPC) IS Transforming Seismology

Advanced High Performance Computing Workshop HPC 201

High Performance Computing Workshop HPC 101

High Performance Computing Workshop HPC 101

High Productivity Computing Systems Program

High Productivity Computing: Predictions for the New HPC

Remote HPC Computing

High Performance Computing Workshop (Statistics) HPC 101

STAR-P: High Productivity Parallel Computing

Advanced High Performance Computing Workshop HPC 201

Global High Performance Computing (HPC) Market

1KEY HPC - High Performance Computing

Advanced High Performance Computing Workshop HPC 201

STAR-P: High Productivity Parallel Computing

High Performance Computing, Clusters, and Productivity

NSF High Performance Computing (HPC) Activities

High Productivity Computing