1 / 41

Polar Opposites: Next Generation Languages & Architectures

Polar Opposites: Next Generation Languages & Architectures. Kathryn S McKinley The University of Texas at Austin. Collaborators. Faculty Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, Graduate Students

dale
Download Presentation

Polar Opposites: Next Generation Languages & Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Polar Opposites:Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

  2. Collaborators • Faculty • Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, • Graduate Students • Xianglong Huang, Sundeep Kushwaha, Aaron Smith, Zhenlin Wang (MTU) • Research Staff • Jim Burrill, Sam Guyer, Bill Yoder

  3. Computing in the Twenty-First Century New and changing architectures • Hitting the microprocessor wall • TRIPS - an architecture for future technology Object-oriented languages • Java and C# becoming mainstream Key challenges and approaches • Memory gap, parallelism • Language & runtime implementation efficiency • Orchestrating a new software/hardware dance • Break down artificial system boundaries

  4. Technology Scaling Hitting the Wall Analytically … Qualitatively … 35 nm 70 nm 100 nm 130 nm 20 mm chip edge Either way … Partitioning for on-chip communication is key

  5. End of the Road for Out-of-Order SuperScalars • Clock ride is over • Wire and pipeline limits • Quadratic out-of-order issue logic • Power, a first order constraint • Major vendors ending processor lines • Problems for any architectural solution • ILP - instruction level parallelism • Memory latency

  6. Where are Programming Languages? • High Productivity Languages • Java, C#, Matlab, S, Python, Perl • High Performance Languages • C/C++, Fortran • Why not both in one? • Interpretation/JIT vs compilation • Language representation • Pointers, arrays, frequent method calls, etc. • Automatic memory management costs • Obscure ILP and memory behavior

  7. Outline • TRIPS • Next generation tiled EDGE architecture • ILP compilation model • Memory system performance • Garbage collection influence • The GC advantage • Locality, locality, locality • Online adaptive copying • Cooperative software/hardware caching

  8. TRIPS • Project Goals • Fast clock & high ILP in future technologies • Architecture sustains 1 TRIPS in 35 nm technology • Cost-performance scalability • Find the right hardware/software balance • New balance reduces hardware complexity & power • New compiler responsibilities & challenges • Hardware/Software Prototype • Proof-of-concept of scalability and configurability • Technology transfer

  9. TRIPS Prototype Architecture

  10. Execution Substrate Interconnect topology & latency exposed to compiler scheduler Register banks Execution node Global Ctrl Branch Predictor I-cache H 0 1 2 3 D-cache/LSQ 0 I-cache 0 D-cache/LSQ 1 I-cache 1 Execution array D-cache/LSQ 2 I-cache 2 D-cache/LSQ 3 I-cache 3

  11. Large Instruction Window • Instruction buffers add depth to execution array • 2D array of ALUs; 3D volume of instructions • Entire 3D volume exposed to compiler Control src1 opcode src2 src2 src1 opcode src2 src1 opcode opcode src1 src2 ALU Out-of-Order Instruction Buffers form a logical “z-dimension” in each node Router 4 logical frames of 4 X 4 instructions Execution Node

  12. Execution Model • SPDI - static placement, dynamic issue • Dataflow within a block • Sequential between blocks • TRIPS compiler challenges • Create large blocks of instructions • Single entry, multiple exit, predication • Schedule blocks of instructions on a tile • Resource limitations • Registers, Memory operations

  13. Block Execution Model • Program execution • Fetch and map block to TRIPS grid • Execute block, produce result(s) • Commit results • Repeat • Block dataflow execution • Each cycle, execute a ready instruction at every node • Single read of registers and memory locations • Single write of registers and memory locations • Update the PC to successor block • TRIPS core may speculatively execute multiple blocks (as well as instructions) • TRIPS uses branch prediction and register renaming between blocks, but not within a block start A C B D E end

  14. Just Right Division of Labor • TRIPS architecture • Eliminates short-term temporaries • Out-of-order execution at every node in grid • Exploits ILP, hides unpredictable latencies • without superscalar quadratic hardware • without VLIW guarantees of completion time • Scale compiler - generate ILP • Large hyperblocks - predicate, unroll, inline, etc. • Schedule hyperblocks • Map independent instructions to different nodes • Map communicating instructions to same or close nodes • Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths

  15. High Productivity Programming Languages • Interpretation/JIT vs compilation • Language representation • Pointers, arrays, frequent method calls, etc. • Automatic memory management costs MMTk in IBM Jikes RVM • ICSE’04, SIGMETRICS’04 • Memory Management Toolkit for Java • High Performance, Extensible, Portable • Mark-Sweep, Copying SemiSpace, Reference Counting • Generational collection, Beltway, etc.

  16. Allocation Choices Bump-Pointer Free-List • Fast (increment & bounds check) • Can't incrementally free & reuse: must free en masse • Relatively slow (consult list for fit) • Can incrementally free & reuse cells

  17. Allocation Choices • Bump pointer • ~70 bytes IA32 instructions, 726MB/s • Free list • ~140 bytes IA32 instructions, 654MB/s • Bump pointer 11% faster in tight loop • < 1% in practical setting • No significant difference (?) • Second order effects? • Locality?? • Collection mechanism??

  18. Implications for Locality • Compare SS & MS mutator • Mutator time • Mutator memory performance: L1, L2 & TLB

  19. javac

  20. pseudojbb

  21. db

  22. Locality &Architecture

  23. MS/SS Crossover 1.6GHz PPC

  24. MS/SS Crossover1.9GHz AMD

  25. MS/SS Crossover 2.6GHz P4

  26. MS/SS Crossover3.2GHz P4

  27. MS/SS Crossover locality space 2.6GHz 1.6GHz 1.9GHz 3.2GHz

  28. Locality in Memory Management • Explicit memory management on its way out • Key GC vs Explicit MM insights 20 yrs old • Technology has and is changing • Generational and Beltway Collectors • Significant collection time benefits over full heap collectors • Collect young objects • Infrequently collect old space • Copying nursery attains similar locality effects as full heap

  29. Where are the Misses? Generational Copying Collector

  30. Copy Order • Static copy orders • Bredth first - Cheney scan • Depth first, hierarchical • Problem: one size does not fit all • Static profiling per class • Inconsistant with JIT • Object sampling • Too expensive in our experience • OOR - Online Object Reordering • OOPSLA’04

  31. OOR Overview • Records object accesses in each method (excludes cold basic blocks) • Finds hot methods by dynamic sampling • Reorders objects with hot fields in higher generation during GC • Copies hot objects into separate region

  32. Static Analysis Example Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c } } Hot BB Collect access info Compiler Compiler Cold BB Ignore Access List: 1. A.b 2. …. ….

  33. Adaptive Sampling Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c } } Adaptive Sampling Foo Accesses: 1. A.b 2. …. …. Foo is hot A.b is hot A b c ….. B

  34. Advice Directed Reordering • Example • Assume (1,4), (4,7) and (2,6) are hot field accesses • Order: 1,4,7,2,6 : 3,5 1 5 4 2 3 7 6

  35. OOR System Overview Hot Methods Source Code Look Up Access Info Database Adaptive Sampling Baseline Compiler Optimizing Compiler Adds Entries Register Hot Field Accesses GC: copying objects GC: Copies Objects Executing Code Affects Locality Advice OOR addition Input/Output Jikes RVM

  36. Cost of OOR

  37. Performance db

  38. Performance jython

  39. Performance javac

  40. Software is not enoughHardware is not enough • Problem: inefficient use of cache • Hardware limitations: set associativity, cannot predict the future • Cooperative Software/Hardware Caching • Combines high level compiler analysis with dynamic miss behavior • Lightweight ISA support conveys compiler’s global view to hardware • Compiler-guided cache replacement (evict-me) • Compiler-guided region prefetching • ISCA’03, PACT’02

  41. Exciting Times • Dramatic architectural changes • Execution tiles • Cache & Memory tiles • Next generation system solutions • Moving hardware/software boundaries • Online optimizations • Key compiler challenges (same old…) ILP and Cache Memory Hierarchy

More Related