1 / 42

Simulation and Evaluation Framework for Manycore Architectures

ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ. ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ. Simulation and Evaluation Framework for Manycore Architectures. Andreas Savva, UCY Final Project Report. OUTLINE. Introduction in Many-core architectures. Main technical objectives of the project. Project Breakdown. Work Packages.

ormand
Download Presentation

Simulation and Evaluation Framework for Manycore Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ Simulation and Evaluation Framework for Manycore Architectures Andreas Savva, UCY Final Project Report

  2. OUTLINE • Introduction in Many-core architectures. • Main technical objectives of the project. • Project Breakdown. • Work Packages. • Using the developed framework – Case Studies. • Simulation and Results. • Project Outcomes / Deliverables.

  3. Manycore Architectures • Emerging dominant trend in general purpose CPUS • Expected to be interconnected using on-chip networks • Tens to hundreds of cores • Simple cores, large parallelism • Several design parameters • I/O system • Processor Architecture • Interconnection Network Architecture • This project aims to: • Develop a simulation and evaluation framework so that researchers do parameter exploration related to the aforementioned parameters

  4. Main Technical Objectives – Achieved • Developed a simulation and evaluation framework for many-core architectures using JAVA programming language. • Developed benchmarks in order to evaluate many-core architectures. • Developed on-chip network simulator which supports different architectures / routing algorithms and different traffic patterns. • Developed cross-compiler in C/C++ programming language which translates programs into instructions which can be executed from the architectures which are under evaluation. • Developed new architectures in order to evaluate the framework.

  5. Project Breakdown • Work Packages: • Progress and Result Dissemination (WP1, WP2). • Develop simulator in order to interconnect cores (WP3). • Develop models for the execution units and the cores (WP4). • Develop Cross-Compiler (WP5). • Create benchmarks to measure performance (WP6). • Develop new architectures to evaluate the framework (WP7).

  6. WP1 + WP2: PROGRESS + RESULTS DISSEMINATION …OVERLAP… WP5 CROSS - COMPILER WP6 BENCHMARKS WP7 EVALUATE FRAMEWORK WP3 DEVELOP MANY–CORE SIMULATOR WP4 DEVELOP EXECUTION UNITS

  7. Project Management (WP1) • Kick-Off Meeting December 2008 • Targeted Application Models Developed • Application Design Trade-Offs • Roles • Six-Month Progress Reports • 18- Month (Interim) Progress Report • Financial Issues • Final Progress Report • Final Financial issues

  8. Dissemination of Results (WP2) • Project Website • http://www.ece.ucy.ac.cy/labs/easoc/Research/SEFMA/home.html • Publications • Publications in selected Journals and Conferences.

  9. WP3: Simulator for Interconnecting Cores • Determine specifications for many-core network simulator. • Evaluate existent simulation frameworks • POPNET simulator – C++ program language. • GPNOC simulator – JAVA program language. • Adapt simulation framework in order to simulate our many-core systems. • Develop traffic models based on many-core applications for future evaluation • Random Traffic Pattern. • Tornado Traffic Pattern. • Transpose Traffic Pattern. • Neighbor Traffic Pattern. C O M P L E T E D !

  10. WP4: Core and Execution Unit Models • Develop communication protocol between units and network • Design and develop unit models • Cores. • Memory. • Input/output data models. • Framework to develop models based on the specifications. C O M P L E T E D !

  11. WP5: Cross - Compiler • Create instruction set architecture. • Study existing compilers for RISC processors. • Adapt existing compiler to translate programs into machine instructions. • Adapt compiler into the framework. C O M P L E T E D !

  12. WP6: Benchmarks • Define and evaluate all possible functions of the system based on : • Performance • Power consumption • Reliability • Develop algorithms to measure performance, power consumption, reliability. • Develop benchmarks for many-core processors in Assembly language. C O M P L E T E D !

  13. WP7: Framework Evaluation • WP Goals: • Develop and evaluate novel many-core architectures. • Develop and evaluate algorithms for work distribution in many-core processors. • Cross-evaluation of the developed framework based on the new many-core architectures. C O M P L E T E D !

  14. USING/EVALUATING THE FRAMEWORK Case Studies

  15. Reducing power consumption • Power Consumption: Major limitation in NoCs. • Links and NoC routers: the most power-hungry components. • Intel’s Teraflop NoC prototype suggests that link power consumption could be as high as 17% and the rest power consumption is dedicated at routers. • Reduce both static and dynamic power consumption. • Proposed works focus on simple static threshold mechanisms. Need of new intelligent dynamic power management policy for NoCs.

  16. Reducing power consumption Threshold based algorithm for turning links off/on: • Run Simulation and check link utilization. • Choose threshold. • Run simulation. • If new link utilization smaller than threshold  turn link off for a period of time. • After x cycles turn link back on. NEXT: A new Intelligent Dynamic on/off Link Management for NoCs based on ANNs.

  17. Reducing power consumption Artificial Neural Networks • Information processing paradigm inspired by the way biological neurons process information. • Composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. • Used as prediction and forecasting mechanisms in several application areas • Able to determine hidden and strongly non-linear dependencies.

  18. Reducing power consumption Intelligent ANN algorithm: • Pre-training. • Choose links with minimum link utilization • Size of network more manageable • Prediction scheme based on ANN • Divide network into smaller nets • Pass chosen links as inputs in ANNs • Output  links to turn off Power Saves for 8x8 mesh and torus networks ANN can be used for prediction since they can discover hidden dependencies.

  19. Reducing power consumption ANN predictor with NoCs and an 8×8 network partition into four 4×4 networks with their ANNs.

  20. Reducing power consumption • Experiments with several NoC regions. • Compare hardware overheads and responding power savings. • 4×4 NoC region offers satisfactory power savings and less ANN overheads when compared to a 5×5 NoC region. • 3×3 NoC region does not provide enough information to the ANN in order to make accurate predictions. • We designed the based ANN system to monitor 4x4 NoC regions. Power Saves and hardware overheads for 3x3, 4x4,5x5 NoC regions

  21. Reducing power consumption Prediction scheme based on ANN • ANN mechanism receives all the average link utilizations from all the links of the 4×4 NoC partition. • ANN uses the utilization values to find optimal threshold • Determine if a link is going to be turned off or on for the next n-cycle interval.

  22. Reducing power consumption ANN hardware optimization • A 4x4 ANN monitors 16 routers => at least 8 input neurons. • Eight neurons at the input layer of the ANN => hidden layer should have five neurons. • Based rule of thumb that a satisfactory number of the hidden layer neurons equals to half the number of input neurons plus one neuron. Try to minimize the size of the hidden layer…

  23. Reducing power consumption • Choose appropriate size of the hidden layer of the ANN. • Three different ANNs were developed with five, four and three neurons at the hidden layer. • Using four neurons (instead of five), in the hidden layer exhibits the best power savings for all the traffic patterns. Power Savings for different neuron sizes in the hidden layer

  24. Reducing power consumption • How the bit representation of the training weights affects the threshold computation? • 24, 16, 8, 6 and 4 bit representations were used. • 24, 16, 8 and 6 bits show similar power savings, but these savings are significantly reduced when 4 bits are used, due to reduced training accuracy. • => 6 bits are chosen, which made the multiplier-accumulation hardware very small Power savings for different training weight bit representations

  25. Simulation and Results... • Power savings of the ANN-based mechanism are better than the savings in the other cases. • ANN-based mechanism can identify a significant amount of future behavior in the observed traffic patterns. • Can intelligently select the threshold necessary for the next timing interval. Power Saves for 8x8 mesh and torus networks

  26. Simulation and Results... • Measure throughput in each mechanism. • Having no on/off mechanism yields a higher throughput, the ANN-based technique shows better throughput results compared to statically determined threshold techniques. Throughput for 8x8 mesh and torus networks

  27. Simulation and Results... • Measure energy in each mechanism. • Energy consumed using ANN mechanism is less than the other cases. • The ANN exhibits a reduction in the overall energy, because of a balanced performance-to-power savings ratio, when compared to not having on/off links or when compared to static threshold computation. Normalized Energy for 8x8 torus networks

  28. Simulation and Results... • Measure packet latency in each mechanism. • The ANN-based mechanism incurs more delay, but we believe that the delay penalty is acceptable when compared to the associated power savings. Average Packet Latency

  29. Reducing power consumption New Intelligent ANN algorithm: • Pre-training. • Choose router ports with minimum port utilization • Size of network more manageable • Prediction scheme based on ANN • Divide network into smaller nets • Pass chosen ports as inputs in ANNs • Output  ports to turn off

  30. Reducing power consumption • When the router ports become unavailable, temporarily or permanently, X-Y routing cannot guarantee deadlock free system. • Since router ports are turned off in our work, a new routing algorithm must be developed in order to make sure that there are no deadlocks. • Fully adaptive routing algorithms perform better in the cases of faults but they are very difficult to implement due to higher overhead in silicon area and energy consumption. • Based on this, a partially adaptive routing algorithm was chosen in order to achieve a certain degree of fault tolerance in our system.

  31. Reducing power consumption • Fault Tolerant Negative First algorithm is based on the turn models. • It makes certain turns forbidden so that the deadlock can be avoided. • A packet is routed at first in the negative direction in each dimension and then, it is routed at the positive direction. The forwarding message at first moves to west or south until the offset is zero and after that it moves to the north or east. Negative First Routing Algorithm in 8x8 Mesh network

  32. Simulation Results • The power savings of the ANN-based mechanism are better compared to statically-determined case, and the case without any on/off ports for all the traffic models. Power Saves for 8x8 mesh and torus networks

  33. Simulation Results... • Having no on/off mechanism yields a higher throughput; however, the ANN-based technique yields better throughput when compared to the statically-determined threshold Normalized throughput for 8x8 mesh and torus networks

  34. Results from the framework use • Framework can be used from researchers in order to evaluate many-core architectures. • It helps to compare how the number of cores affects the total power consumption of the network. • Intel showed that the number of cores may be affected from the power consumption because of the increase number of routers, interconnects and data travelling through the network. • Researchers can do parameter exploration related to many-core architectures. • This new Network on Chip framework helps researchers to solve different NoC tasks through simulations.

  35. Project Outcomes • Smooth flow of work • Some simulator problems have been overcome • Help from Dr. Soteriou and Drs. Michael and Chadjicostis • Results Dissemination on target with Project Goals. • Publications in conferences/journals • Participation in ISVLSI Conference July 2011, Chennai, India. • Publication in Journal of Electrical and Computer Engineering, Hindawi Publishing Corporation, 2012. • Submission at the ISVLSI 2012: paper for turning router ports on/off. (Under Review)

  36. Publications ARTICLES: • A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Link Management for On-Chip Networks”, In Proc. IEEE Annual Symposium on VLSI, pp. 343 – 344, July 2011. • Under Review: A. Savva, T. Theocharides, V. Soteriou, “Intelligent On/Off Router Ports Management for Networks on Chip”, ISVLSI Conference 2012 JOURNALS: • Andreas G. Savva, T. Theocharides, V. Soteriou, "Intelligent On/Off Dynamic Link Management for On-Chip Networks," Journal of Electrical and Computer Engineering, vol. 2012, Article ID 107821, 2012 POSTER: • Poster atHiPEAC Ph.D. Student Poster Presentation - Paphos, Cyprus, January 2009. WORKSHOP: • Results of this work were presented in a workshop at KIOS Research Centre – 30 Nov. 2011

  37. Project Deliverables: • D1:Six Month, Interim, Final Report, Financial Reports • D2:Project Website, Publications • D3:Network communication simulator in JAVA, Four traffic models for purposes of simulation and evaluation of the network (Available source code) • D4:RISC processor models, memory models, core models, Input Output models (VHDL/C++ Code) • D5:Cross-compiler • D6: Benchmarks, Algorithms for power consumption and performance measurements. • D7: Many-core architectures, Evaluation of the developed framework.

  38. Acknowledgements to: • Dr. Maria K. Michael – for the verification and automation algorithms feedback. • Dr. ChristoforosHadjicostis – for the reliability aspects and the discrete event algorithms employed in building the simulator. • Dr. VassosSoteriou - for the feedback on the Interconnect. • Dr. TheocharisTheocharides - for the coordination of this project and all the help.

  39. This work falls under the Cyprus Research Promotion Foundation’s Framework Programme for Research, Technological Development and Innovation 2008 (DESMI 2008), co-funded by the Republic of Cyprus and the European Regional Development Fund, and specifically under Grant PENEK/ENISX/0308 ΕΥΡΩΠΑΪΚΗ ΕΝΩΣΗ ΚΥΠΡΙΑΚΗ ΔΗΜΟΚΡΑΤΙΑ

  40. THANK YOU! Project Host Organization University of Cyprus Andreas Savva, TheocharisTheocharides , Maria K. Michael, ChristoforosHadjicostis Collaborating Partners Cyprus University of Technology VassosSoteriou

More Related