Software Support for Advanced Computing Platforms

Software Support for Advanced Computing Platforms Ananth Grama Professor, Computer Sciences and Coordinated Systems Lab., Purdue University. ayg@cs.purdue.edu http://www.cs.purdue.edu/pdsl

Building Applications for Next Generation Computing Platforms • Emerging trends point to two disruptive technologies: • Architecture innovations from the desktop to scalable systems • Embedded intelligence and ubiquitous processing • How do we program these platforms efficiently? Very little of what we have learned over three decades of parallel programming directly applies here.

Evolution of Microprocessor Architectures • Chip-Multiprocessor Architectures • Scalable Multicore Platforms • Heterogeneous Multicore Processors • Transactional Memory

Multicore Architectures -- An Overview • The Myth: • Multicore processors are designed for speed. • The Reality: Multicore processors are motivated by power considerations: • Power is proportional to clock speed • Power is quadratic in Vdd • Vdd can be reduced as clock speed is reduced • Computation speed is generally sublinear in clock speed

Multicore Architectures -- An Overview • Collocate multiple processor cores on a single chip (a special class of chip-multiprocessors) • Programming model is typically thread-based • Many microprocessors are hardware compatible with existing motherboards (memory performance?) • Memory systems vary widely across various vendors (AMD vs. Intel vs. IBM PowerPC/Cell)

Multicore Architectures -- Trends • Current generation typically at dual- or quad-core • Desktops and mobile dual-core variants available • Scalable multicore: AMD and Intel both plan up to 16 cores in the next two years and up to 64 cores in the medium term. • Heterogeneous multicore: some of the most commoly used processors today are heterogeneous multicore (network routers, ARM/TI DSPs in cell-phones).

Memory System Architecture • Trading off latency and bandwidth (the Cell solution) • Programmable caches • Transactional Memory

Transactional Memory Overview • Addresses problems of correctness of parallel programs as well as performance. • Requires hardware support. • Mitigates many of the problems associated with locks – composability, granularity, mixing correctness and performance.

Transactional Memory Overview begin_transaction x = x + 1 y = y + x if (x < 10) z = x; else z = y; end_transaction Thread 1 begin_transaction x = x - 1 y = y - x if (x > 10) z = x; else z = y; end_transaction Thread 2 Each thread sees either all, or none of the other threads updates. Basic mechanisms: isolation (conflict detection), versioning (maintain versions), and atomicity (commit or rollback).

Implications for Application Development and Performance • Fundamental changes in the entire application stack • Programming paradigms (models of concurrency) • Software support (compilers, OS) • Library support (application kernels) • Runtime systems and performance monitoring (performance bottlenecks and alleviation) • Analysis techniques (scaling to the extreme)

Ongoing work at Purdue / collaborators – A Birds-eye View (Collaborators: Intel -- Compilers, Libraries, UMN -- Analysis Techniques, EPFL -- Programming Paradigms) Programming Models: What are appropriate concurrency abstractions? • When is communication good? • How do we deal with the spectrum of coherence models seamlessly? • How do we use transactions in real programs (I/O and networks are not transactional)

Programming Models: The Mediera Environment • Define domains of identical coherence models. • Build slack into concurrency. • View other cores as intelligent caches. • Use an LRU-type strategy to swap out threads across cores. • Support for algorithmic asynchrony. A number of important issues need to be resolved relating to mixed models -- messaging overhead associated with swapped out threads, resource bounds, livelock, priority inversion.

Library Support • Building optimized multicore libraries for important computational kernels (sparse algebra, quantum scale – MD methods) / Intel MKE. • Novel algorithms for memory-constrained platforms (excess FLOPS, instead of excess memory accesses). • Demonstrated application performance (model reduction, nano-scale modeling). • Comprehensive benchmarking of platforms (DARPA/HPCS pilot study) with a view to identifying performance bottlenecks and desirable application characteristics.

Analysis Techniques How do we analyze programs over large number of cores? • Isoefficiency metric • Scaling problem size with number of cores to maintain performance. • Memory constrained scaling • Quantifying drop in performance with increase in number of cores while operating at peak memory • Impact of limited bandwidth • Increasing number of cores implies lower bandwidth at each core

Technical Objective To develop the next generation software environment for scalable chip-multiprocessor systems, along with library support and validating applications.

Setting of calibration tests Software Environments for Embedded Systems

Programming Scalable Systems • The traditional approach to distributed programming involves writing “network-enabled” programs for each node • The program encodes distributed system behavior using complex messaging between nodes • This paradigm raises several issues and limitations: • Program development is time consuming • Programs are error prone and difficult to debug • Lack of a distributed behavior specification, which precludes verification • Limitations with respect to scalability, heterogeneity and performance

Programming Scalable Systems • Macroprogramming entails direct specification of the distributed system behavior in contrast to programming individual nodes • Provides: • Seamless support for heterogeneity • Uniform programming platform • Node capability-aware abstractions • Performance scaling • Separating the application from system-level details • Scalability and adaptability with network & load dynamics • Validation of behavioral specification

Technical Objective To develop a second generation operating system suite that facilitates rapid macroprogramming of efficient self-organized distributed applications for scalable embedded systems

Ongoing Work: The CosmOS System Suite for Embedded Environments • CosmOS Components: • Programming model, compilation techniques • Device independent node operating system interfaces and implementations • Network operating system

CosmOS Programming Model • Macroprogram consists of: • Distributed system behavioral specification • Constraints associated with mapping behavioral specification to physical system • Behavioral Specification • Functional Components (FCs) • Represents a specific data processing function • Typed input and output interface • Interaction Assignment (IA) • Directed graph that specifies data flow through FCs • Data source and sinks are (logical) device ports

CosmOS Program Valdiation • Statically type-checked interaction assignment • The output of a component can be connected to the input of another only if their types match • Functional components represent a deterministic data processing function • The output sequence depends only on the inputs to the FC • Correctness • Given input at each source in the IA the outputs at sinks are deterministically known

CosmOS Functional Components • Elementary unit of execution • Isolated from the state of the system and other FCs • Uses only stack variables and statically assigned state memory • Asynchronous execution: data flow and control flow handled by cosmOS • Static memory • Prevents non-deterministic behavior due to malloc failures • Leads to a lean memory management system in the OS • Reusable components • The only interaction is via typed interfaces • Dynamically loadable components • Runtime updates possible raw_t Average avg_t avg_t

CosmOS Program Specification • Sections: • Enumerations • Declarations • Mapping constraints • IA Description

Average (10) Average (100) raw_t raw_t avg_t avg_t FS FS * * avg_t avg_t CosmOS Program: An Example • %photo : device = PHOTO_SENSOR, out [ raw_t ]; • %fs : device = FILE_DUMP, in [ * ]; • %avg : { fcid = FCID_AVG, in [ raw_t, avg_t ], out [ avg_t ] }; • %thresh : { fcid = FCID_THRESH, in [ raw_t ], out [ raw_t ] }; • @ snode = CAP_PHOTO_SENSOR : photo, thresh; • @ fast_m = CAP_FAST_CPU : avg; • @ server = CAP_FS | CAP_UNIQUE_SERVER : avg, fs; • start_ia • timer(100)  photo(1); • photo(1)  thresh(2,0,500); • thresh(2,0)  avg(3,0,10), avg(4,0,100); • avg(3,0)  fs(5) |  avg(3,1); • avg(4,0)  fs(6) |  avg(4,1); • end_ia T(t) raw_t raw_t raw_t P() Threshold (500)

raw_t raw_t raw_t avg_t raw_t P() Threshold (500) FS * T(t) raw_t avg_t FS * Average (10) Average (100) avg_t avg_t CosmOS: Runtime System

CosmOS: Runtime System • Provides a low-footprint execution environment for CosmOS programs • Key components • Data flow and control flow • Locking and concurrency • Load conditioning • Routing primitives

Updateable User space Services Services App FC App FC App FC Platform Independent Kernel Static OS Kernel Hardware Abstraction Layer HW Drivers HW Drivers HW Drivers CosmOS Node Operating System

CosmOS: Current Status • Fully functional implementations for Mica2 and POSIX (on Linux) • Mica2: • Non-preemptive function pointer scheduler • Dynamic memory management • POSIX: • Multi-threading using POSIX threads and underlying scheduler • The OS exists as library calls and a single management thread

CosmOS: Current Status • Comprehensively evaluated and validated • Alpha releases can be freely downloaded from: http://www.cs.purdue.edu/~awan/cosmos/

CosmOS Validation Pilot deployment at BOWEN labs FM 433MHz ECN Net MICA2 motes with ADXL 202 802.11b Peer-to-Peer Internet Laser attached via serial port to Stargate computers Currently laser readings can be viewed for from anywhere over the Internet (conditioned on firewall settings)

CosmOS: Ongoing Work • Semantics of the CosmOS Programming Model • GUI for Interaction Assignment • Library of modules • Large-scale deployment and scalability studies • Application-specific optimizations.

Thank you! For papers and talks on these topics, please visit: http://www.cs.purdue.edu/pdsl

Software Support for Advanced Computing Platforms

Software Support for Advanced Computing Platforms

Presentation Transcript

Advanced Computing

Computing Platforms for Multimedia

Software Platforms

Software for Exaflops Computing

Software requirements for my platforms

Lecture 2: Software Platforms

Hardware platforms for Embedded computing

Power Profiling in Computing Platforms

Advanced Computing

Software for Personal Computing

Accessing European Computing Platforms

Open Grid Computing Environments: Advanced Gateway Support Activities

Reconfigurable Computing Platforms

Toward Global HPC Platforms Gabriel Mateescu Research Computing Support Group

Parallel Computing Platforms

Australian Partnership for Advanced Computing

Support for Platforms in Geant4

Australian Partnership for Advanced Computing

Scalable Networking for Next-Generation Computing Platforms

Australian Partnership for Advanced Computing

Software Engineering Methodology for Reconfigurable Platforms