1 / 26

Compiler-directed Data Partitioning for Multicluster Processors

Compiler-directed Data Partitioning for Multicluster Processors. Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006. Intercluster Communication Network. Register File. Register File. M. M. I. F. I. F. I. F. M. I. F. M.

ira-stuart
Download Presentation

Compiler-directed Data Partitioning for Multicluster Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler-directed Data Partitioning for Multicluster Processors Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006

  2. Intercluster Communication Network Register File Register File M M I F I F I F M I F M Cluster 1 Cluster 2 Processor Data Memory Data Mem 1 Data Mem 2 Multicluster Architectures • Addresses the register file bottleneck • Decentralizes architecture • Compilation focuses on partitioning operations • Most previous work assumes a unified memory Register File Data Memory

  3. I F M I F M Cluster 1 Cluster 2 Data Mem 1 Data Mem 2 Problem: Partitioning of Data • Determine object placement into data memories • Limited by: • Memory sizes/capacities • Computation operations related to data • Partitioning relevant to caches and scratchpad memories int x[100] struct foo int y[100]

  4. I F M I F M Cluster 1 Cluster 2 int x[100] foo int y[100] Architectural Model • This work focuses on use of scratchpad-like static local memories • Each cluster has one local memory • Each object placed in one specific memory • Data object available in the memory throughout the lifetime of the program

  5. Data Unaware Partitioning Lose average 30% performance by ignoring data

  6. Our Objective • Goal: Produce efficient code • Strategy: • Partition both data objects and computation operations • Balance memory size across clusters • Improve memory bandwidth • Maximize parallelism int y [100] int x[100] struct foo

  7. First Try: Greedy Approach • Computation-centric partition of data • Place data where computation references it most often • Greedy approach: • Pass 1: Region-view computation partition Greedy data cluster assignment • Pass 2: Region-view computation repartition Full knowledge of data location

  8. Greedy Approach Results • 2 Clusters: • One Integer, Float, Memory, Branch unit per cluster • Relative to a unified, dual-ported memory • Improvement over Data Unaware, still room for improvement

  9. Global Data Partition Regional Computation Partition Second Try: Global Data Partition • Data-centric partition of computation • Hierarchical technique • Pass 1: Global-view for data • Consider memory relationships throughout program • Locks memory operations to clusters • Pass 2: Region-view for computation • Partition computation based on data location

  10. Step 1 Step 4 Step 2 Step 3 Interprocedural Pointer Analysis & Memory Profile METIS Graph Partitioner Build Program Data Graph Merge Memory Operations Pass 1: Global Data Partitioning • Determine memory relationships • Pointer analysis & profiling of memory • Build program-level graph representation of all operations • Perform data object memory operation merging: • Respect correctness constraints of the program

  11. 200 bytes 400 bytes 1 Kbyte Global Data Graph Representation • Nodes: Operations, either memory or non-memory • Memory operations: loads, stores, malloc callsites • Edges: Data flow between operations • Node weight: Data object size • Sum of data sizes forreferenced objects • Object size determined by: • Globals/locals: pointer analysis • Malloc callsites: memory profile int x[100] malloc site 1 struct foo

  12. Non-memory op int x[100] Memory op malloc site 1 Cluster 0 Cluster 1 struct foo struct bar malloc site 2 Global Data Partitioning Example BB1 2 Objects referenced 80 Kb BB2 2 Objects referenced 200 Kb 1 Object referenced 100 Kb

  13. BB1 BB1 Pass 2: Computation Partitioning • Observation:Global-level data partition is only half the answer: • Doesn’t account for operation resource usage • Doesn’t consider code scheduling regions • Second pass of partitioning on each scheduling region • Memory operations from first phase locked in place BB1

  14. Experimental Methodology • Compared to: • 2 Clusters: • One Integer, Float, Memory, Branch unit per cluster • All results relative to a unified, dual-ported memory

  15. Performance: 1-cycle Remote Access Unified Memory

  16. Performance: 10-cycle Remote Access Unified Memory

  17. Case Study: rawcaudio X Global Data Partition Greedy Profile-based X

  18. Summary • Global Data Partitioning • Data placement: first-order design principle • Global data-centric partition of computation • Phased ordered approach • Global-view for decisions on data • Region-view for decisions on computation • Achieves 96% performance of a unified memory on partitioned memories • Future work: apply to cache memories

  19. Data Partitioning for Multicores • Adapt global data partitioning for cache memory domain • Similar goals: • Increase data bandwidth • Maximize parallel computation • Different goals: • Reducing coherence traffic • Keep working set ≤ cache size

  20. Questions? http://cccp.eecs.umich.edu

  21. Backup

  22. Future Work: Cache Memories • Adapt global data partitioning for cache memory domain • Similar goals: • Increase data bandwidth • Maximize parallelcomputation • Different goals: • Reducing coherence traffic • Balancing working set

  23. Memory Operation Merging • Interprocedural pointer analysis determines memory relationships int * x; int foo [100]; int bar [100]; void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1]; b = 100; foo[0] = c; } malloc load “bar” load “foo” store “malloc” or “bar” store “foo”

  24. Multicluster Compilation • Previous techniques focused on operation partitioning [cite some papers] • Ignores the issue of data object placement in memory • Assumes shared memory accessible from each cluster

  25. Phase 2: Computation Partitioning • Observation:Global-level data partition is only half the solution: • Doesn’t properly account for resource usage details • Doesn’t consider code scheduling regions • Second pass of partitioning is done locally on each basic block of the program • Memory operations locked into specific clusters • Uses Region-based Hierarchical Operation Partitioner (RHOP)

  26. BB1 BB1 + + & & L L L L + + + + + + S S * * & & Computation Partitioning Example • Memory operations from first phase locked in place • RHOP performs a detailed resource-cognizant computation partition • Modified multi-level Kernighan-Lin algorithm using schedule estimates BB1 + & L L + + + S * &

More Related