Ramazan Bitirgen , Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI

Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors :A Machine Learning Approach RamazanBitirgen, EnginIpek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI

Introduction • Resource sharing problem in CMP • Increasing levels of pressure on shared system resources • Efficient sharing is necessary for high utilization and performance • Multiple interacting resources • Cache Space, DRAM Bandwidth and Power Budget • Allocation of a resource affects demands of other resources • Propose a resource allocation framework • At runtime, monitors the execution of each application and learns a predictive model of performance as a function of resource allocation decisions and periodically allocates resources to each core using the model

Resource Allocation Framework • Per-application HW performance model • Use Artificial Neural Networks (ANNs) • Predict each app’s performance as a function of the resources allocated to it • Global resource manager • At every interval, searches the possible resource allocations by querying the application performance model

How to Predict a Performance?(Artificial Neural Networks) • Use ANNs • Input units, hidden units and an output unit connected via a set of weighted edges • Hidden(output) unit calculates a weighted sum of their inputs(hidden values) based on edge weights • Edge weights are trained with training examples (data sets)

How to Predict a Performance?(Adaptation to per-APP Performance Model) • Input units • L2 cache space, off-chip bandwidth, power budget • Number of read hits, read misses, write hits, and write misses over the last 20K inst and over the 1.5M inst • Fraction of cache ways that are dirty (the amount of WB traffic) • Activation function • Use sigmoid (integer to value in [0, 1]) • Model performance as a function of its allocated resources and recent behavior • Training during first 1.2 billion cycle with randomly allocated resource • Always keep a training set consisting of 300 points • Retrained at every 2,500,000 cycle

How to Predict a Performance?(Adaptation to per-APP Performance Model) • Optimization • Prevent memorizing outliers in a sample data • Cross validation • Data set is divided into N equal-sized folds (N-1 training sets and 1 test set) • Ensemble consists of N ANN models • Performance is predicted averaging the predictions of all ANNs in the ensemble • Prediction error is estimated as a function of CoV of the predictions byeach ANN in the ensemble (will be used for resource allocation) Trning Test Training Test

Resource Allocation • Make resource allocation decision (at every 500,000 cycle) using the trained per-application performance model • Discard queries involving an app with a high error estimate • Fairly distribute resources to the running applications • Predict the perf and compute the prediction error • If the performance is estimated to be inaccurate (error > 9%), app is excluded from global resource allocation • Search the space with stochastic hill climbing • It starts with a random solution, and iteratively makes small changes to the solution, each time improving it a little. • When the algorithm cannot see any improvement anymore, it terminates • 2,000 trials produces the best tradeoff between search performance and overhead

Implementation & Overhead • HW implementation • Single HW ANN and multiplex edge weights on the fly to achieve 16 ‘virtual’ ANNs • 12 * 4 + 4 multipliers as many as weighted edges • 50 entry-table-based quantized sigmoid function • Calculate in a pipelined manner • Prediction(search) takes 16 cycles for 16 virtual ANNs • Area, Power, and Delay • 3% of the chip’s area • 3W power consumption • Possible to make 2,000 queries within 5% of interval • OS Interface • Embed training set and the ANN weights to the process state • OS communicates the desired objective function through CR

Experimental Setup • Tools & architecture • Heavily modified version of SESC • With Wattch(power), HotSpot(temperature) • Baseline : Intel’s Core2Quad, DDR2-800 • 4-core CMP, frequency = 0.9GHz-4.0GHz(0.1GHz unit) • 4MB, 16-way shared L2 cache • Distributed 60W power budget among 4 apps via per-core DVFS • Outs is limited to 57W • Statically allocate 5W • Partition L2 cache space at the granularity of cache ways • Allocate one way to each app • Distribute the remaining 12 ways • Each app statically allocated 800MB/s of off-chip DRAM bandwidth and the remaining 3.2GB/s is distributed

Experimental Setup • Metrics • Weighted speedup • Sum of IPCs • Harmonic mean of normalized IPCs • Weighted sum of IPCs • Workload • 9 quad-core multi-programmed workloads from SPEC2000 and NAS suites • Classify into 3 categories • CPU-bound • Memory-bound • Cache Sensitive

Experimental Setup • Configurations • Unmanaged • Isolated Cache Management (Cache) • Utility-based cache partitioning, MICRO’2006 • Distribute L2 cache ways to minimize miss rate • Isolated Power Management (Power) • An analysis of efficient multi-core global power management policies : Maximizing performance for a given power budget, MICRO’2006 • Isolated Bandwidth Management (BW) • Fair Queuing Memory System, Micro ‘06 • Uncoordinated Cache + Power, Cache + BW, Power + BW, Cache + Power + BW • Continuous Stochastic Hill-Climbing (Coordinated-HC) • Learning based SMT processor resource distribution(issue-queue, ROB, and register file), ISCA ’06 • Fair-share • Proposed scheme (Coordinated-ANN) • ANN-based models of the applications’ IPC response to resource allocation are used to guide a stochastic hill-climbing search

Evaluation Results • Performance • Results are normalized to Fair-Share • 14% average speedup over Fair-Share • Similar for other metrics P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Evaluation Results • Sensitivity to confidence threshold • Results are normalized to Fair-Share P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Evaluation Results • Confidence estimated mechanism • Fraction of the total execution time where the ANN could predict the resource allocation optimization for each application P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Conclusions • Proposed a resource allocation framework that Manages multiple shared CMP resources in a coordinated fashion through ANNs and periodic resource allocation scheme • Coordinated approach to multiple resource management is a key to delivering high performance in multi-programmed workloads

Extras P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P

Extras

Ramazan Bitirgen , Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI

Ramazan Bitirgen , Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI

Presentation Transcript

Multicast Based Micro-mobility: Design and Evaluation

REGULATORY COMPLIANCE Presented by: Joanne Muratori, Janice Turchin, Nancy Nisbett,

Joan Martinez-Alier ICTA, UAB

MICRO TEACHING

Crazy Day

Paintings made by Dr. Jose P. Rizal

ICICI Bank in Micro-finance: Breaking the barriers