Compiler-directed Data Partitioning for Multicluster Processors

Compiler-directed Data Partitioning for Multicluster Processors Michael Chu and Scott Mahlke Advanced Computer Architecture Lab University of Michigan March 28, 2006

Intercluster Communication Network Register File Register File M M I F I F I F M I F M Cluster 1 Cluster 2 Processor Data Memory Data Mem 1 Data Mem 2 Multicluster Architectures • Addresses the register file bottleneck • Decentralizes architecture • Compilation focuses on partitioning operations • Most previous work assumes a unified memory Register File Data Memory

I F M I F M Cluster 1 Cluster 2 Data Mem 1 Data Mem 2 Problem: Partitioning of Data • Determine object placement into data memories • Limited by: • Memory sizes/capacities • Computation operations related to data • Partitioning relevant to caches and scratchpad memories int x[100] struct foo int y[100]

I F M I F M Cluster 1 Cluster 2 int x[100] foo int y[100] Architectural Model • This work focuses on use of scratchpad-like static local memories • Each cluster has one local memory • Each object placed in one specific memory • Data object available in the memory throughout the lifetime of the program

Data Unaware Partitioning Lose average 30% performance by ignoring data

Our Objective • Goal: Produce efficient code • Strategy: • Partition both data objects and computation operations • Balance memory size across clusters • Improve memory bandwidth • Maximize parallelism int y [100] int x[100] struct foo

First Try: Greedy Approach • Computation-centric partition of data • Place data where computation references it most often • Greedy approach: • Pass 1: Region-view computation partition Greedy data cluster assignment • Pass 2: Region-view computation repartition Full knowledge of data location

Greedy Approach Results • 2 Clusters: • One Integer, Float, Memory, Branch unit per cluster • Relative to a unified, dual-ported memory • Improvement over Data Unaware, still room for improvement

Global Data Partition Regional Computation Partition Second Try: Global Data Partition • Data-centric partition of computation • Hierarchical technique • Pass 1: Global-view for data • Consider memory relationships throughout program • Locks memory operations to clusters • Pass 2: Region-view for computation • Partition computation based on data location

Step 1 Step 4 Step 2 Step 3 Interprocedural Pointer Analysis & Memory Profile METIS Graph Partitioner Build Program Data Graph Merge Memory Operations Pass 1: Global Data Partitioning • Determine memory relationships • Pointer analysis & profiling of memory • Build program-level graph representation of all operations • Perform data object memory operation merging: • Respect correctness constraints of the program

200 bytes 400 bytes 1 Kbyte Global Data Graph Representation • Nodes: Operations, either memory or non-memory • Memory operations: loads, stores, malloc callsites • Edges: Data flow between operations • Node weight: Data object size • Sum of data sizes forreferenced objects • Object size determined by: • Globals/locals: pointer analysis • Malloc callsites: memory profile int x[100] malloc site 1 struct foo

Non-memory op int x[100] Memory op malloc site 1 Cluster 0 Cluster 1 struct foo struct bar malloc site 2 Global Data Partitioning Example BB1 2 Objects referenced 80 Kb BB2 2 Objects referenced 200 Kb 1 Object referenced 100 Kb

BB1 BB1 Pass 2: Computation Partitioning • Observation:Global-level data partition is only half the answer: • Doesn’t account for operation resource usage • Doesn’t consider code scheduling regions • Second pass of partitioning on each scheduling region • Memory operations from first phase locked in place BB1

Experimental Methodology • Compared to: • 2 Clusters: • One Integer, Float, Memory, Branch unit per cluster • All results relative to a unified, dual-ported memory

Performance: 1-cycle Remote Access Unified Memory

Performance: 10-cycle Remote Access Unified Memory

Case Study: rawcaudio X Global Data Partition Greedy Profile-based X

Summary • Global Data Partitioning • Data placement: first-order design principle • Global data-centric partition of computation • Phased ordered approach • Global-view for decisions on data • Region-view for decisions on computation • Achieves 96% performance of a unified memory on partitioned memories • Future work: apply to cache memories

Data Partitioning for Multicores • Adapt global data partitioning for cache memory domain • Similar goals: • Increase data bandwidth • Maximize parallel computation • Different goals: • Reducing coherence traffic • Keep working set ≤ cache size

Questions? http://cccp.eecs.umich.edu

Backup

Future Work: Cache Memories • Adapt global data partitioning for cache memory domain • Similar goals: • Increase data bandwidth • Maximize parallelcomputation • Different goals: • Reducing coherence traffic • Balancing working set

Memory Operation Merging • Interprocedural pointer analysis determines memory relationships int * x; int foo [100]; int bar [100]; void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1]; b = 100; foo[0] = c; } malloc load “bar” load “foo” store “malloc” or “bar” store “foo”

Multicluster Compilation • Previous techniques focused on operation partitioning [cite some papers] • Ignores the issue of data object placement in memory • Assumes shared memory accessible from each cluster

Phase 2: Computation Partitioning • Observation:Global-level data partition is only half the solution: • Doesn’t properly account for resource usage details • Doesn’t consider code scheduling regions • Second pass of partitioning is done locally on each basic block of the program • Memory operations locked into specific clusters • Uses Region-based Hierarchical Operation Partitioner (RHOP)

BB1 BB1 + + & & L L L L + + + + + + S S * * & & Computation Partitioning Example • Memory operations from first phase locked in place • RHOP performs a detailed resource-cognizant computation partition • Modified multi-level Kernighan-Lin algorithm using schedule estimates BB1 + & L L + + + S * &

Compiler-directed Data Partitioning for Multicluster Processors

Compiler-directed Data Partitioning for Multicluster Processors

Presentation Transcript

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Evaluation Data Compiler

Data Partitioning in VLDB

Compiler Support for Superscalar Processors

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

Compiler-directed Synthesis of Multifunction Loop Accelerators

Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures

Region-based Hierarchical Operation Partitioning for Multicluster Processors

Task Partitioning for Multi-Core Network Processors

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors

Compiler-Based Code Partitioning for Intelligent Embedded Disk Processing

Compiler Issues for Embedded Processors

Compiler-directed Synthesis of Programmable Loop Accelerators

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Compiler-Directed instruction cache leakage optimizations

Pattern-Directed Circuit Virtual Partitioning for Test Power Reduction

Closely-Coupled Timing-Directed Partitioning in HAsim

Compiler Supports and Optimizations for PAC VLIW DSP Processors

On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Partitioning – A Uniform Model for Data Mining