Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

1. Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia Ingrid Verbauwhede, Christian Piguet, Bart Kienhuis/Ed Deprettere, Patrick Schaumont

2. Outline Intro Low Power Observations SoC Architectures: Ingrid Low Power Components: Christian Design Methods: Ed Design Methods: Patrick Conclusion

3. Applications Mapped onto Architectures Introduction Embedded DSP & Multimedia Design Methods = Low Power! This slide is a reformulation of the title. These are the three components of this presentations. It is about Applications (Embedded DSP & Multimedia) It is about mapping them onto architectures (Design Methods) It is about architectures for these applications, Low Power This slide is a reformulation of the title. These are the three components of this presentations. It is about Applications (Embedded DSP & Multimedia) It is about mapping them onto architectures (Design Methods) It is about architectures for these applications, Low Power

4. Low Power observation 1: architecture tuned to application The first observation we can make is that real-time embedded applications Are NOT mapped onto general purpose platforms. They are a mixture of dedicated units And small embedded processors. For example PalmPilot i705.The first observation we can make is that real-time embedded applications Are NOT mapped onto general purpose platforms. They are a mixture of dedicated units And small embedded processors. For example PalmPilot i705.

5. Observation 2: Energy-flexibility trade-off

6. Example: DSP processors Specialized instructions: MAC Dedicated co-processors: Viterbi acceleration

7. FIR on TI C55x: Dual MAC

8. Energy comparison

9. Viterbi on TIC54x CMPS instruction Result: 4 cycles per butterfly instead of 20 or more cycles per butterfly = Energy efficientCMPS instruction Result: 4 cycles per butterfly instead of 20 or more cycles per butterfly = Energy efficient

10. Observation: Also general purpose architectures become heterogeneous. General purpose architectures are becoming heterogeneous also. This applies to general purpose micro-processor, think only at the multimedia instruction set extensions. This applies as well to general purpose FPGA architectures. As shown on this slide, new generation FPGA�s contain more CLB�s but also rows of MAC�s, special block RAM�s, Power PC�s and so on. General purpose architectures are becoming heterogeneous also. This applies to general purpose micro-processor, think only at the multimedia instruction set extensions. This applies as well to general purpose FPGA architectures. As shown on this slide, new generation FPGA�s contain more CLB�s but also rows of MAC�s, special block RAM�s, Power PC�s and so on.

11. Question Energy - flexibility are opposite demands! How to navigate in this jungle? 3D design space: Next question: how to map (or compile) an application onto such an architecture? Clearly, specialized architectures, instruction sets, components, and so on are beneficial to reduce energy. The question is: how to choose which feature of the architecture to fix and which one to leave flexible? Therefore, we define a 3D design space. � The next question is then: how to map or compile an application onto such an architecture? This is a question For the design methods presentation. Clearly, specialized architectures, instruction sets, components, and so on are beneficial to reduce energy. The question is: how to choose which feature of the architecture to fix and which one to leave flexible? Therefore, we define a 3D design space. � The next question is then: how to map or compile an application onto such an architecture? This is a question For the design methods presentation.

12. Flexibility (1) - Abstraction level Instruction set level = �programmable� CLB level = �reconfigurable� SKIP this slide and add the message to the next slide??? Reconfiguration or re-programming can be a different abstraction levels. Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level. We call it �programmable processors� Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing. Both are two forms of general purpose computing. SKIP this slide and add the message to the next slide??? Reconfiguration or re-programming can be a different abstraction levels. Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level. We call it �programmable processors� Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing. Both are two forms of general purpose computing.

13. Flexibility (2) - Reconfigurable feature Basic components:

14. Flexibility (3) - Binding rate Compare processing to binding Configurable (�compile-time�) Re-configurable Dynamic reconfigurable (�adaptive�) SKIP this slide and add the message to the next slide??? Reconfiguration or re-programming can be a different abstraction levels. Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level. We call it �programmable processors� Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing. Both are two forms of general purpose computing. SKIP this slide and add the message to the next slide??? Reconfiguration or re-programming can be a different abstraction levels. Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level. We call it �programmable processors� Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing. Both are two forms of general purpose computing.

15. SOC architecture: RINGS Work with domains! Domain architecture is tuned towards application domain Programmability is tuned towards application domain This means some domains require high flexibility other can do with less flexibility Central (low clock frequency) CPU controls the SoC (not the details within the domain) The domains are connected together by a reconfigurable interconnect. I will give more details about in the next slides.Work with domains! Domain architecture is tuned towards application domain Programmability is tuned towards application domain This means some domains require high flexibility other can do with less flexibility Central (low clock frequency) CPU controls the SoC (not the details within the domain) The domains are connected together by a reconfigurable interconnect. I will give more details about in the next slides.

16. Instruction set extension Instruction set extension Register mapped Tightly coupled Experiment: DFT Other examples of tightly coupled: Tensilica Other examples of tightly coupled: Tensilica

17. Co-processor Memory mapped Loosely coupled Experiment: AES Other examples of loosely coupled: the Turbo accelerators of the TI-C6x Memory overhead is big issue, interconnect architecture of existing processor (e.g. Amba bus). Limits the energy improvementOther examples of loosely coupled: the Turbo accelerators of the TI-C6x Memory overhead is big issue, interconnect architecture of existing processor (e.g. Amba bus). Limits the energy improvement

18. Independent IP Loosely coupled Network on chip connected Flexible interconnect Experiment: TCP/IP checksum Other examples of IP ?Other examples of IP ?

19. Communication: Energy-flexibility Also energy - flexibility conflict! General purpose NOC: tiles FPGA: general purpose Therefore: domain specific NOC

20. Conclusion Low Power by going domain-specific Energy-flexibility conflict How to �program� this RINGS? Next: Ultra-low power components: Christian Design exploration: Ed Co-design environment: Patrick

21. Applications Mapped onto Architectures Introduction Embedded DSP & Multimedia Design Methods = Low Power! This slide is a reformulation of the title. These are the three components of this presentations. It is about Applications (Embedded DSP & Multimedia) It is about mapping them onto architectures (Design Methods) It is about architectures for these applications, Low Power This slide is a reformulation of the title. These are the three components of this presentations. It is about Applications (Embedded DSP & Multimedia) It is about mapping them onto architectures (Design Methods) It is about architectures for these applications, Low Power

22. Efficient Embedded DSPUltra-Low-Power Components Christian Piguet, CSEM

23. Ultra Low Power DSP Processors The design of DSP processors is very challenging, as it has to take into account contradictory goals: an increased throughput request at a reduced energy budget New issues due to very deep submicron technologies such as interconnect delays and leakage History of hearing aids circuits: analog filters 15 years ago digital ASIC-like circuits 5 years ago powerful DSP processors today, below 1 Volt and 1 mW

24. DSP Architectures for Low-Power single MAC DSP core of 5-10 years ago parallel architectures with several MAC working in parallel VLIW or multitask DSP architectures Benchmark: number of simple operations executed per clock cycle, up to 50 or more Drawbacks of VLIW: very large instruction words up to 256 bits Some instructions in the set are still missing transistor count is not favorable to reduce leakage

25. VLIW TMS320C6x (VelociTI)

26. 3 Ways to be more Energy Efficient To to design specific very small DSP engines for each task, in such a way that each DSP task is executed in the most energy efficient way on the smallest piece of hardware (N co-processors) to design reconfigurable architectures such as the DART cluster in which configuration bits allow the user to modify the hardware in such a way that it can much better fit to the executed algorithms.

27. Co-processors

28. DART: O. Sentieys, ENSSAT

29. Reconfigurable DSP Architectures Not FPGA, much more efficient than FPGA. The key point is to reconfigure only a limited number of units Reconfigurable datapath Reconfigurable interconnections Reconfigurable Addressing Units (AGU) FPGA: MACGIC DSP consumes 1 mW/MHz in 0.18 Same MACGIC in Altera Stratix consumes 10 mW/MHz plus 900 mW of static power, so 1�000 mW at 10 MHz

30. Reconfigurable Datapaths

31. Reconfigurable Addressing Modes operands fetch is generally a severe bottleneck in parallel machines for which 8-16 operands are required each clock cycle. sophisticated addressing modes can be dynamically reconfigured depending on the DSP task to be executed

32. Power-consumption for a 24-bits, 10 MHz synthesis @0.9 V in the 0.18�m TSMC technology (SYNOPSYS & MACHTA/PA simulations) NOP: 25 �A / MHz ADD 24-bit: 102 �A / MHz 98k MOPS/Watt MAC 24-bit/56-bit: 137 �A / MHz 81k MOPS/Watt 4* ADD 24-bit: 167 �A / MHz 120k MOPS/Watt 4* MAC 24-bit/56-bit: 283 �A / MHz 86k MOPS/Watt MACV 24-bit/56-bit: 269 �A / MHz 90k MOPS/Watt CBFY4 radix-4 FFT: 273 �A / MHz 131k MOPS/Watt Number of transistors for this 24-bit version : � 600�000 Number of transistors for a 16-bit version: � 400�000 MACGIC: Performance results

33. Performance results: 64 pts Cpx FFT Macgic� : ~250 clock cycles CARMEL: 526 clock cycles PalmDSPCore: 450 clock cycles SC140 Starcore: 288 clock cycles R.E.A.L DSP: 850 clock cycles SP-5flex (3DSP): 500 clock cycles TI C62x: 675 clock cycles TI C64x: 276 clock cycles

34. Comparison


36. Design & ArchitectureExploration Ed Deprettere, Professor Bart Kienhuis, Assistant Professor Leiden University LIACS, The Netherlands

37. Embedded DSP Architectures

38. System Level Design Three aspects are important in System Level Design The Architecture The Application How the Application is Mapped on the Architecture. To optimize a system, you need to take all three aspect into consideration. This is expressed in terms of the Y-Chart

39. Y-chart Approach The y-chart approach measures the performance of a set of applications mapped onto a particular architecture instance. The y-chart approach measures the performance of a set of applications mapped onto a particular architecture instance.

40. Y-chart Approach Instead of making changes in the architecture, the Y-chart also shows that a better performing system can be obtained when changing the way we map applications onto an architecture and the way we describe the set of applications. It is not limited to architectures only. The central element in the Y-chart is performance analysis. Within performance analysis there is a fundamental trade-off between three elements. The cost of modeling The cost of evaluating the model The accuracy of the evaluation These three elements can be draw in an diagram we call the �Abstraction Pyramid�Instead of making changes in the architecture, the Y-chart also shows that a better performing system can be obtained when changing the way we map applications onto an architecture and the way we describe the set of applications. It is not limited to architectures only. The central element in the Y-chart is performance analysis. Within performance analysis there is a fundamental trade-off between three elements. The cost of modeling The cost of evaluating the model The accuracy of the evaluation These three elements can be draw in an diagram we call the �Abstraction Pyramid�

41. Design Space Exploration

42. Design Space Exploration

43. Y-chart Design For GP Processors

44. Y-chart Design For DSP Applications

45. How to improve performance How can we improve the performance of the system we are interested in. Others focus on architecture, we want to focus on the application. For a low-power architecture parallelism is important. Exploiting more parallelism leads to fast calculations using Voltage and Frequency scaling, we assume that power is saved There is already a lot of theory developed to employ bit-level parallelism, instruction parallelism and Task Level parallelism. Especially Task Level parallelism is getting more and more important to effectively map DSP application onto the new emerging architectures.

46. Programming Problem

47. Kahn Process Network (KPN) Kahn Process Networks [Kahn 1974][Parks&Lee 95] Processes run autonomously Communicate via unbounded FIFOs Synchronize via blocking read Process is either executing (execute) communicating(send/get) Characteristics Deterministic Distributed Control No Global Scheduler is needed Distributed Memory No memory contention

48. Kahn Process Network (KPN)

49. Matlab to Process Networks to FPGA

50. Matlab to Matlab Transformations To make the flow from Matlab to FPGA interesting, we had to give the designers means to change the characteristics Unrolling (Unfolding) Increases parallelism Retiming (Skewing) Improved pipeline behavior Clustering (Merging) Reducing parallelism All these operations can be applied to the source level of Matlab, leading to a new Matlab program

51. Y-chart Design For DSP Applications

52. Case Study We use the Y-chart environment on a real case study Adaptive QR [DAES 2002] Using commercial IP cores QinetiQ Ltd. Vectorize 42 pipeline stages Rotate 55 pipeline stages QR is interesting as it requires deeply pipelined IP cores Most Design tools have difficulties with such IP cores We will explore a number of simple steps to improve the performance of the QR algorithm Was reported to run at 12Mflops.

53. Example: Adaptive QR (Step 1)




57. Conclusions Optimizing a System Level Design for Low-Power requires that you look at the architecture, the mapping, and the application. The Y-chart gives a simple framework to tune a system for optimal performance. The Y-chart forms the basis for DSE We showed that by playing with the way applications are written, we get in a number of steps orders better performance 60MFlops -> 673MFlops. This without changing the architecture!


59. Domain-Specific Co-design Environments Patrick Schaumont, UCLA Good Evening, This is the fourth part of the embedded tutorial on Energy-Efficient architectures. I will tell you a few words on the codesign tools you can use for these architectures.Good Evening, This is the fourth part of the embedded tutorial on Energy-Efficient architectures. I will tell you a few words on the codesign tools you can use for these architectures.

60. Let�s waste no time ! Well yes, there is one thing left between you and the DATE party. <animate> It�s this presentation. So let�s get on with it. Well yes, there is one thing left between you and the DATE party. <animate> It�s this presentation. So let�s get on with it.

61. Design of the RINGS Multiprocessor In the previous talks by Ingrid and Christian, it was pointed out that distributed, specialized processing is the key to energy-efficient processing. So this means you are designing systems that contain multiple cores, with a number of tightly-coupled or stand-alone hardware processors. All of these are embedded in a dedicated on-chip communication architecture. <animite> This presentation will talk about the tools that you need for such an architecture. Let us concentrate on efficient simulation. In that case you would need to combine one or more an instruction-set simulation environments with a hardware simulation kernel. In the previous talks by Ingrid and Christian, it was pointed out that distributed, specialized processing is the key to energy-efficient processing. So this means you are designing systems that contain multiple cores, with a number of tightly-coupled or stand-alone hardware processors. All of these are embedded in a dedicated on-chip communication architecture. <animite> This presentation will talk about the tools that you need for such an architecture. Let us concentrate on efficient simulation. In that case you would need to combine one or more an instruction-set simulation environments with a hardware simulation kernel.

62. ARMZILLA The design environment I will be discussing is called ARMZILLA. ARMZILLA allows you to instantiate a number of ARM instruction-set simulators, and integrate these on a custom hardware model. The input to ARMZILLA consists of three elements: A number of ARM executables, created for example with a cross-compiler A Configuration file, that indicates how many cores are needed, and which executables they should run A hardware description, which contains the system interconnect description and any number of dedicated hardware processors that you need in the multiprocessor. In the following few slides, I will walk you through each of these three elements. Before that, a few words on simulation accuracy. In system level design, there are a large number of abstraction levels at which you can design a system, and also their simulation accuracy varies widely. High abstraction levels are preferred because they allow a more concise and compact system description, and also faster simulation. However, raising the modeling abstraction levels also creates a design hole in between spec and implementation that needs to be filled up later on. In ARMZILLA, simulations are done cycle true, so hardware is modeled semantically at the register transfer level, and the instruction-set simulators run cycle true.The design environment I will be discussing is called ARMZILLA. ARMZILLA allows you to instantiate a number of ARM instruction-set simulators, and integrate these on a custom hardware model. The input to ARMZILLA consists of three elements: A number of ARM executables, created for example with a cross-compiler A Configuration file, that indicates how many cores are needed, and which executables they should run A hardware description, which contains the system interconnect description and any number of dedicated hardware processors that you need in the multiprocessor. In the following few slides, I will walk you through each of these three elements. Before that, a few words on simulation accuracy. In system level design, there are a large number of abstraction levels at which you can design a system, and also their simulation accuracy varies widely. High abstraction levels are preferred because they allow a more concise and compact system description, and also faster simulation. However, raising the modeling abstraction levels also creates a design hole in between spec and implementation that needs to be filled up later on. In ARMZILLA, simulations are done cycle true, so hardware is modeled semantically at the register transfer level, and the instruction-set simulators run cycle true.

63. ARMZILLA Let us now first look at the programs runnings on the ARM instruction-set simulators. Let us now first look at the programs runnings on the ARM instruction-set simulators.

64. Requirements for an ISS ina multiprocessor simulator In a multiprocessor simulator, we will combine a number of instruction-set simulators with a hardware simulation kernel. Running multiple instruction-set simulators together is not the same as running a single one, or even a single ISS in System-on-Chip context. And in fact, there are a number of requirements one can define for an instruction set simulator in a multiprocessor environment. <animate> First of all, we need a linkable model. This means that the instruction-set simulator can be linked as a library in a single executable for the system simulation. This is not a hard requirement, but it is something sensible to do from the efficiency point-of-view. Some multiprocessor simulations work with instruction-set simulators instantiated as multiple program images that are linked together with interprocess communication. However, performance and complexity-wise, such solutions are worse then the linked approach. <animate> A second requirement is to have a reentrant instruction-set simulator. This way, several ISS can coexist in a single environment without interfering with each other. This mean for example that the ISS cannot rely on global variables and global namespaces. It is a simple and practical requirement, but a lot of ISS I have seen so far are not reentrant. <animate> A third requirement is that the ISS needs to be accessible. You need two types of accessibility. The first is that you need to be able to control the ISS. In ARMZILLA, we run all the ARM ISS in lockstep � this means, all models in turn one clock cycle at a time. A second type of accessibility is that you need to be able to exchange data between the hardware model and the C programs running on the ISS. In ARMZILLA, we intercept calls to the memory interface and thus can implement a memory-mapped interface this way. In a multiprocessor simulator, we will combine a number of instruction-set simulators with a hardware simulation kernel. Running multiple instruction-set simulators together is not the same as running a single one, or even a single ISS in System-on-Chip context. And in fact, there are a number of requirements one can define for an instruction set simulator in a multiprocessor environment. <animate> First of all, we need a linkable model. This means that the instruction-set simulator can be linked as a library in a single executable for the system simulation. This is not a hard requirement, but it is something sensible to do from the efficiency point-of-view. Some multiprocessor simulations work with instruction-set simulators instantiated as multiple program images that are linked together with interprocess communication. However, performance and complexity-wise, such solutions are worse then the linked approach. <animate> A second requirement is to have a reentrant instruction-set simulator. This way, several ISS can coexist in a single environment without interfering with each other. This mean for example that the ISS cannot rely on global variables and global namespaces. It is a simple and practical requirement, but a lot of ISS I have seen so far are not reentrant. <animate> A third requirement is that the ISS needs to be accessible. You need two types of accessibility. The first is that you need to be able to control the ISS. In ARMZILLA, we run all the ARM ISS in lockstep � this means, all models in turn one clock cycle at a time. A second type of accessibility is that you need to be able to exchange data between the hardware model and the C programs running on the ISS. In ARMZILLA, we intercept calls to the memory interface and thus can implement a memory-mapped interface this way.

65. ARMZILLA Let us next take a look at the configuration model, and the hardware/software interfaces. The goal of the configuration file is to indicate how many instruction-set simulators are required, and to wire them up to the hardware model. So we will also discuss the hardware-software interfaces.Let us next take a look at the configuration model, and the hardware/software interfaces. The goal of the configuration file is to indicate how many instruction-set simulators are required, and to wire them up to the hardware model. So we will also discuss the hardware-software interfaces.

66. Memory-mapped channels easy in C ARMZILLA uses memory-mapped interfaces. These are very simple to program in C. A memory-mapped interface in C is simply an initialized pointer. Reading and writing on this pointer simply will result in access to the memory-mapped interface. <animate> A configuration in a multiprocessor simulator is required because you have multiple instruction-set simulators running. Therefore, a memory-address space is not unique, each core has its� own one. Conceptually, what configuration does is give each core a symbolic name, and associate an executable with it. <animate> Then, when we attach memory-mapped interfaces to the hardware model, we are able to hook up several different ISS and distinguish each ISS and memory space uniquely by referring to the symbolic name.ARMZILLA uses memory-mapped interfaces. These are very simple to program in C. A memory-mapped interface in C is simply an initialized pointer. Reading and writing on this pointer simply will result in access to the memory-mapped interface. <animate> A configuration in a multiprocessor simulator is required because you have multiple instruction-set simulators running. Therefore, a memory-address space is not unique, each core has its� own one. Conceptually, what configuration does is give each core a symbolic name, and associate an executable with it. <animate> Then, when we attach memory-mapped interfaces to the hardware model, we are able to hook up several different ISS and distinguish each ISS and memory space uniquely by referring to the symbolic name.

67. Hardware Simulation Kernel The third part of ARMZILLA is a hardware simulation kernel, called GEZEL. GEZEL is used to model hardware coprocessors and standalone processors, as well as the network on chip, The third part of ARMZILLA is a hardware simulation kernel, called GEZEL. GEZEL is used to model hardware coprocessors and standalone processors, as well as the network on chip,

68. GEZEL Hardware Simulation Kernel has Hybrid Architecture Here you see an overview of the internals of the GEZEL kernel. It is a C++ library, but with a build-in parser. GEZEL can read hardware models in dedicated, cycle-true modeling language. The modeling language expresses networks finite-state machines and datapaths, FSMD for short. Once a hardware model is read in the GEZEL kernel, it can be either simulated or else converted to synthesizable code. For simulation, a number of cosimulation interfaces are available to cores such as to SH3 DSP, the LEON2 Sparc, ARM. These is also a SystemC cosimulation interface. In ARMZILLA, we use the cosimulation interface to ARM. Beause the hardware model can be parsed in after the simulator is already compiled, this approach of combining C++ with a dedicated scripting language saves a lot of compilation time when you are exploring hardware models.Here you see an overview of the internals of the GEZEL kernel. It is a C++ library, but with a build-in parser. GEZEL can read hardware models in dedicated, cycle-true modeling language. The modeling language expresses networks finite-state machines and datapaths, FSMD for short. Once a hardware model is read in the GEZEL kernel, it can be either simulated or else converted to synthesizable code. For simulation, a number of cosimulation interfaces are available to cores such as to SH3 DSP, the LEON2 Sparc, ARM. These is also a SystemC cosimulation interface. In ARMZILLA, we use the cosimulation interface to ARM. Beause the hardware model can be parsed in after the simulator is already compiled, this approach of combining C++ with a dedicated scripting language saves a lot of compilation time when you are exploring hardware models.

69. The PONG Example Designed in 1967 by Baer - interactive TV feature 1977, General Instruments AY-3-8500 pong-in-a-chip Magnavox, Coleco, Atari, Philips, URL, GHP, ... I will next discuss a small design in ARMZILLA to show how such a multiprocessor system can be useful. We will be using a simple video game called PONG. If there is one game that can claim to be absolutely the oldest video game around, it will be PONG. It think the odds somebody in this room has never played PONG are close to zero. But, to make them defintely zero, the objective of pong is to move a paddle around and return the ball to your opponent. The ball moves in geometric patterns and bounces off the walls. PONG was designed in the sixties by one of the godfathers of video games, Ralph Baer. General instruments made a famous chip with it, that has found way in countless consoles. A large number of manufacturers have developed such consoles. I will next discuss a small design in ARMZILLA to show how such a multiprocessor system can be useful. We will be using a simple video game called PONG. If there is one game that can claim to be absolutely the oldest video game around, it will be PONG. It think the odds somebody in this room has never played PONG are close to zero. But, to make them defintely zero, the objective of pong is to move a paddle around and return the ball to your opponent. The ball moves in geometric patterns and bounces off the walls. PONG was designed in the sixties by one of the godfathers of video games, Ralph Baer. General instruments made a famous chip with it, that has found way in countless consoles. A large number of manufacturers have developed such consoles.

70. Multiprocessor Model of PONG In our multiprocessor model, we will map pong to four processors, and let the system play against itself. <animate> We will use a processor for each paddle.The goal of the paddle processor is to determine the player strategy. <animate> We will also use a processor to simulate the ball dynamics. This processor will decide how the ball bounces off the walls and the paddles. <animate> The communication in the system will essentially consist of bouncing messages. Such messages announce the speed and the position of the ball. <animate> Finally there will be a processor for the playing-field, who has to render this playing field on-screen. In the next few slides, I will discuss the operation of the system. One point you should note here, is how easy this application maps to a parallel processor model. Once you map an application according to the natural actors in the system, it becomes a lot easier to implement. In this case, the natural actors are the paddles, the ball, and the rendering field.In our multiprocessor model, we will map pong to four processors, and let the system play against itself. <animate> We will use a processor for each paddle.The goal of the paddle processor is to determine the player strategy. <animate> We will also use a processor to simulate the ball dynamics. This processor will decide how the ball bounces off the walls and the paddles. <animate> The communication in the system will essentially consist of bouncing messages. Such messages announce the speed and the position of the ball. <animate> Finally there will be a processor for the playing-field, who has to render this playing field on-screen. In the next few slides, I will discuss the operation of the system. One point you should note here, is how easy this application maps to a parallel processor model. Once you map an application according to the natural actors in the system, it becomes a lot easier to implement. In this case, the natural actors are the paddles, the ball, and the rendering field.

71. Multiprocessor Operation - Initialize We represent the operation of the system as a message sequence chart, where time runs from top to bottom, and each processor has a separate column. This slide shows the message initialization sequence. <animate> At startup, the field processor announces the dimensions of the field to the paddle processors and the ball. As a result, the paddles and ball can choose a dimension and position that is suitable for the playing field. <animate> So the paddle processors will sens a message back to the field with their size and position. The field processor can then draw the paddles on the playing field. <animate> Then, the ball will broadcast its position and initially choosen speed vector to all processors. The reason for braodcasting is that each of the field and paddle processors will track the ball individually. Since each one knows how fast the ball is going, they can estimate the position of the ball by means of a simple calculation.We represent the operation of the system as a message sequence chart, where time runs from top to bottom, and each processor has a separate column. This slide shows the message initialization sequence. <animate> At startup, the field processor announces the dimensions of the field to the paddle processors and the ball. As a result, the paddles and ball can choose a dimension and position that is suitable for the playing field. <animate> So the paddle processors will sens a message back to the field with their size and position. The field processor can then draw the paddles on the playing field. <animate> Then, the ball will broadcast its position and initially choosen speed vector to all processors. The reason for braodcasting is that each of the field and paddle processors will track the ball individually. Since each one knows how fast the ball is going, they can estimate the position of the ball by means of a simple calculation.

72. Multiprocessor Operation - Play Now we look at what happens during play. As told before, each of the processors field, paddle 1 and paddle 2 is continously estimating the position of the ball. The field will redraw the ball when it changes position. The paddles will align them selves so that they can hit the ball. <animate> When the position of a paddle changes, it will inform the field that it has taken a new position. At that moment, the field processor can redraw the paddle in the new position. <animate> Sooner or later, one of the field or the paddles will conclude that it has hit the ball. That processor will create a collision message addressed to the ball processor. This is a message that indicates where the ball has hit something, and the nature of the collision, like the upper wall, the lower wall or the left or right paddle. In this case, paddle1 detects a collisiton and informs the ball <animate> The ball processor then evaluates a new speed vector according to the ball dynamics, and broadcasts the new position and speed to all parties in the system. This message is will inform each of the processors that the speedvector of the ball has changed. So the field should change the direction of the trackm and the paddles should change the moving strategy. Now we look at what happens during play. As told before, each of the processors field, paddle 1 and paddle 2 is continously estimating the position of the ball. The field will redraw the ball when it changes position. The paddles will align them selves so that they can hit the ball. <animate> When the position of a paddle changes, it will inform the field that it has taken a new position. At that moment, the field processor can redraw the paddle in the new position. <animate> Sooner or later, one of the field or the paddles will conclude that it has hit the ball. That processor will create a collision message addressed to the ball processor. This is a message that indicates where the ball has hit something, and the nature of the collision, like the upper wall, the lower wall or the left or right paddle. In this case, paddle1 detects a collisiton and informs the ball <animate> The ball processor then evaluates a new speed vector according to the ball dynamics, and broadcasts the new position and speed to all parties in the system. This message is will inform each of the processors that the speedvector of the ball has changed. So the field should change the direction of the trackm and the paddles should change the moving strategy.

73. Let�s play!

74. Multiprocessor Architecture The simulation that you just saw is one that uses a point-to-point model for the multiprocessor. This point-to-point model is shown here. There are multiple processors in the system, and they communicate withy memory mapped interfaces. Those interfaces are demultiplexed to dedicated point-to-point busses. The communication protocol uses request-acknowledge handshaking. Such a model is easy to create but requires a lot of hardware connections.The simulation that you just saw is one that uses a point-to-point model for the multiprocessor. This point-to-point model is shown here. There are multiple processors in the system, and they communicate withy memory mapped interfaces. Those interfaces are demultiplexed to dedicated point-to-point busses. The communication protocol uses request-acknowledge handshaking. Such a model is easy to create but requires a lot of hardware connections.

75. Multiprocessor Architecture A multiplexed communication structure is more conservative with area. One way to build this is with an on-chip communcation network, such as the one shown on this slide. <animate> The messages flowing in the network will then be mapped to payload packets. <animate> We also need to indicated an adressee for each packet, because the routers need to know where to deliver the data. <animate> Also, we also need to indicate the sender of the message. This is required for example for the field processor, since that one receives different messages from the paddles and from the ball. There is an extensive set of optimization tasks in this area, that would improve communication speed and protocol stack complexity. I am not going to talk about this. A multiplexed communication structure is more conservative with area. One way to build this is with an on-chip communcation network, such as the one shown on this slide. <animate> The messages flowing in the network will then be mapped to payload packets. <animate> We also need to indicated an adressee for each packet, because the routers need to know where to deliver the data. <animate> Also, we also need to indicate the sender of the message. This is required for example for the field processor, since that one receives different messages from the paddles and from the ball. There is an extensive set of optimization tasks in this area, that would improve communication speed and protocol stack complexity. I am not going to talk about this.

76. Links Let me rather give you some links on the software that I was mentioning. All of this software is open source, so it costs you nothing to start experimenting with energy-efficient multiprocessor systems. The ARM instruction-set simulator that we used is the open-source SimIt-ARM ISS, developed by Wei Qin at Princeton university. The cross compiler was downloaded from the ARM linux FTP site in the UK. The GEZEL and ARMZILLA environment have their own homepage at UCLA. <animate> All these tools are under GNU Public license and so are free. Keep in mind however, that this means free as in freedom, not free as in free beer. The tools are given as a service to the community.Let me rather give you some links on the software that I was mentioning. All of this software is open source, so it costs you nothing to start experimenting with energy-efficient multiprocessor systems. The ARM instruction-set simulator that we used is the open-source SimIt-ARM ISS, developed by Wei Qin at Princeton university. The cross compiler was downloaded from the ARM linux FTP site in the UK. The GEZEL and ARMZILLA environment have their own homepage at UCLA. <animate> All these tools are under GNU Public license and so are free. Keep in mind however, that this means free as in freedom, not free as in free beer. The tools are given as a service to the community.

77. Applications Mapped onto Architectures Conclusion Embedded DSP & Multimedia Design Methods = Low Power! Conclusion slide: repeat main message: Low Power is all about closing the gap between applications and architectures. It does require supporting design methods. Conclusion slide: repeat main message: Low Power is all about closing the gap between applications and architectures. It does require supporting design methods.

78. Thanks for your attention ! Thanks a lot. Thanks a lot.

Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia

Presentation Transcript

Towards Scalable and Energy-Efficient Memory System Architectures

Energy-Efficient System Virtualization for Mobile and Embedded Systems

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Workshop on Optimizations for DSP and Embedded Systems

Integrated Project Delivery and Energy Efficient Design

Towards Scalable and Energy-Efficient Memory System Architectures

Architectures, Techniques and Methods for Resource Discovery

Software Architectures and Embedded Systems

An Energy-Efficient Reconfigurable Multiprocessor IC for DSP Applications

Techniques and Data Structures for Efficient Multimedia Similarity Search

Energy-Efficient Mapping and Scheduling for DVS Enabled Distributed Embedded Systems

Energy-Efficient Computing and Computing for Efficient Energy Usage

Decision Optimization Techniques for Efficient Delivery of Multimedia Streams

DSP architectures for wireless communications

DSP Architectures Additional Slides

Design and Technologies for Energy Efficient Motors

ENEL619.23 DSP Architectures

Embedded Computer Architectures

Embedded Computer Architectures

Energy-Efficient System Virtualization for Mobile and Embedded Systems

High performance sensor interfaces: Efficient system architectures and calibration techniques

Energy Efficient Scheduling Techniques For Real-Time Embedded Systems