770 likes | 930 Views
Outline. IntroLow Power ObservationsSoC Architectures: IngridLow Power Components: ChristianDesign Methods: EdDesign Methods: PatrickConclusion. ApplicationsMappedontoArchitectures. . Introduction. Embedded DSP
E N D
1. Architectures and Design Techniques for Energy Efficient Embedded DSP and Multimedia Ingrid Verbauwhede, Christian Piguet, Bart Kienhuis/Ed Deprettere, Patrick Schaumont
2. Outline Intro
Low Power Observations
SoC Architectures: Ingrid
Low Power Components: Christian
Design Methods: Ed
Design Methods: Patrick
Conclusion
3.
Applications
Mapped
onto
Architectures Introduction
Embedded DSP & Multimedia
Design Methods
= Low Power! This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
4. Low Power observation 1: architecture tuned to application The first observation we can make is that real-time embedded applications
Are NOT mapped onto general purpose platforms. They are a mixture of dedicated units
And small embedded processors. For example PalmPilot i705.The first observation we can make is that real-time embedded applications
Are NOT mapped onto general purpose platforms. They are a mixture of dedicated units
And small embedded processors. For example PalmPilot i705.
5. Observation 2: Energy-flexibility trade-off
6. Example: DSP processors Specialized instructions: MAC
Dedicated co-processors: Viterbi acceleration
7. FIR on TI C55x: Dual MAC
8. Energy comparison
9. Viterbi on TIC54x CMPS instruction
Result: 4 cycles per butterfly instead of 20 or more cycles per butterfly
= Energy efficientCMPS instruction
Result: 4 cycles per butterfly instead of 20 or more cycles per butterfly
= Energy efficient
10. Observation: Also general purpose architectures become heterogeneous. General purpose architectures are becoming heterogeneous also.
This applies to general purpose micro-processor, think only at the multimedia instruction set extensions.
This applies as well to general purpose FPGA architectures.
As shown on this slide, new generation FPGA’s contain more CLB’s but also rows of
MAC’s, special block RAM’s, Power PC’s and so on. General purpose architectures are becoming heterogeneous also.
This applies to general purpose micro-processor, think only at the multimedia instruction set extensions.
This applies as well to general purpose FPGA architectures.
As shown on this slide, new generation FPGA’s contain more CLB’s but also rows of
MAC’s, special block RAM’s, Power PC’s and so on.
11. Question Energy - flexibility are opposite demands!
How to navigate in this jungle?
3D design space:
Next question: how to map (or compile) an application onto such an architecture?
Clearly, specialized architectures, instruction sets, components, and so on are beneficial to reduce energy.
The question is: how to choose which feature of the architecture to fix and which one to leave flexible?
Therefore, we define a 3D design space.
…
The next question is then: how to map or compile an application onto such an architecture? This is a question
For the design methods presentation. Clearly, specialized architectures, instruction sets, components, and so on are beneficial to reduce energy.
The question is: how to choose which feature of the architecture to fix and which one to leave flexible?
Therefore, we define a 3D design space.
…
The next question is then: how to map or compile an application onto such an architecture? This is a question
For the design methods presentation.
12. Flexibility (1) - Abstraction level Instruction set level = “programmable”
CLB level = “reconfigurable”
SKIP this slide and add the message to the next slide???
Reconfiguration or re-programming can be a different abstraction levels.
Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level.
We call it “programmable processors”
Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing.
Both are two forms of general purpose computing. SKIP this slide and add the message to the next slide???
Reconfiguration or re-programming can be a different abstraction levels.
Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level.
We call it “programmable processors”
Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing.
Both are two forms of general purpose computing.
13. Flexibility (2) - Reconfigurable feature Basic components:
14. Flexibility (3) - Binding rate Compare processing to binding
Configurable (“compile-time”)
Re-configurable
Dynamic reconfigurable (“adaptive”)
SKIP this slide and add the message to the next slide???
Reconfiguration or re-programming can be a different abstraction levels.
Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level.
We call it “programmable processors”
Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing.
Both are two forms of general purpose computing. SKIP this slide and add the message to the next slide???
Reconfiguration or re-programming can be a different abstraction levels.
Compiling (i.e. mapping) a new application onto a processor is a form of reconfiguration at the instruction level.
We call it “programmable processors”
Mapping (I.e. compiling) a new application onto an FPGA is called reconfigurable computing.
Both are two forms of general purpose computing.
15. SOC architecture: RINGS Work with domains!
Domain architecture is tuned towards application domain
Programmability is tuned towards application domain
This means some domains require high flexibility other can do with less flexibility
Central (low clock frequency) CPU controls the SoC (not the details within the domain)
The domains are connected together by a reconfigurable interconnect.
I will give more details about in the next slides.Work with domains!
Domain architecture is tuned towards application domain
Programmability is tuned towards application domain
This means some domains require high flexibility other can do with less flexibility
Central (low clock frequency) CPU controls the SoC (not the details within the domain)
The domains are connected together by a reconfigurable interconnect.
I will give more details about in the next slides.
16. Instruction set extension Instruction set extension
Register mapped
Tightly coupled
Experiment: DFT Other examples of tightly coupled: Tensilica
Other examples of tightly coupled: Tensilica
17. Co-processor Memory mapped
Loosely coupled
Experiment: AES Other examples of loosely coupled: the Turbo accelerators of the TI-C6x
Memory overhead is big issue, interconnect architecture of existing processor
(e.g. Amba bus). Limits the energy improvementOther examples of loosely coupled: the Turbo accelerators of the TI-C6x
Memory overhead is big issue, interconnect architecture of existing processor
(e.g. Amba bus). Limits the energy improvement
18. Independent IP Loosely coupled
Network on chip connected
Flexible interconnect
Experiment: TCP/IP checksum
Other examples of IP ?Other examples of IP ?
19. Communication: Energy-flexibility Also energy - flexibility conflict!
General purpose NOC: tiles
FPGA: general purpose
Therefore: domain specific NOC
20. Conclusion Low Power by going domain-specific
Energy-flexibility conflict
How to “program” this RINGS?
Next: Ultra-low power components: Christian
Design exploration: Ed
Co-design environment: Patrick
21.
Applications
Mapped
onto
Architectures Introduction
Embedded DSP & Multimedia
Design Methods
= Low Power! This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
22. Efficient Embedded DSPUltra-Low-Power Components Christian Piguet, CSEM
23. Ultra Low Power DSP Processors The design of DSP processors is very challenging, as it has to take into account contradictory goals:
an increased throughput request
at a reduced energy budget
New issues due to very deep submicron technologies such as interconnect delays and leakage
History of hearing aids circuits:
analog filters 15 years ago
digital ASIC-like circuits 5 years ago
powerful DSP processors today, below 1 Volt and 1 mW
24. DSP Architectures for Low-Power single MAC DSP core of 5-10 years ago
parallel architectures with several MAC working in parallel
VLIW or multitask DSP architectures
Benchmark:
number of simple operations executed per clock cycle, up to 50 or more
Drawbacks of VLIW:
very large instruction words up to 256 bits
Some instructions in the set are still missing
transistor count is not favorable to reduce leakage
25. VLIW TMS320C6x (VelociTI)
26. 3 Ways to be more Energy Efficient To to design specific very small DSP engines for each task, in such a way that each DSP task is executed in the most energy efficient way on the smallest piece of hardware (N co-processors)
to design reconfigurable architectures such as the DART cluster
in which configuration bits allow the user to modify the hardware in such a way that it can much better fit to the executed algorithms.
27. Co-processors
28. DART: O. Sentieys, ENSSAT
29. Reconfigurable DSP Architectures Not FPGA, much more efficient than FPGA. The key point is to reconfigure only a limited number of units
Reconfigurable datapath
Reconfigurable interconnections
Reconfigurable Addressing Units (AGU)
FPGA:
MACGIC DSP consumes 1 mW/MHz in 0.18
Same MACGIC in Altera Stratix consumes 10 mW/MHz plus 900 mW of static power, so 1’000 mW at 10 MHz
30. Reconfigurable Datapaths
31. Reconfigurable Addressing Modes operands fetch is generally a severe bottleneck in parallel machines for which 8-16 operands are required each clock cycle.
sophisticated addressing modes can be dynamically reconfigured depending on the DSP task to be executed
32. Power-consumption for a 24-bits, 10 MHz synthesis @0.9 V in the 0.18µm TSMC technology (SYNOPSYS & MACHTA/PA simulations)
NOP: 25 µA / MHz
ADD 24-bit: 102 µA / MHz 98k MOPS/Watt
MAC 24-bit/56-bit: 137 µA / MHz 81k MOPS/Watt
4* ADD 24-bit: 167 µA / MHz 120k MOPS/Watt
4* MAC 24-bit/56-bit: 283 µA / MHz 86k MOPS/Watt
MACV 24-bit/56-bit: 269 µA / MHz 90k MOPS/Watt
CBFY4 radix-4 FFT: 273 µA / MHz 131k MOPS/Watt
Number of transistors for this 24-bit version : ˜ 600’000
Number of transistors for a 16-bit version: ˜ 400’000
MACGIC: Performance results
33. Performance results: 64 pts Cpx FFT Macgic® : ~250 clock cycles
CARMEL: 526 clock cycles
PalmDSPCore: 450 clock cycles
SC140 Starcore: 288 clock cycles
R.E.A.L DSP: 850 clock cycles
SP-5flex (3DSP): 500 clock cycles
TI C62x: 675 clock cycles
TI C64x: 276 clock cycles
34. Comparison
35.
Applications
Mapped
onto
Architectures Introduction
Embedded DSP & Multimedia
Design Methods
= Low Power! This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
36. Design & ArchitectureExploration Ed Deprettere, Professor
Bart Kienhuis, Assistant Professor
Leiden University
LIACS, The Netherlands
37. Embedded DSP Architectures
38. System Level Design Three aspects are important in System Level Design
The Architecture
The Application
How the Application is Mapped on the Architecture.
To optimize a system, you need to take all three aspect into consideration.
This is expressed in terms of the Y-Chart
39. Y-chart Approach The y-chart approach measures the performance of a set of applications mapped onto a particular architecture instance. The y-chart approach measures the performance of a set of applications mapped onto a particular architecture instance.
40. Y-chart Approach Instead of making changes in the architecture, the Y-chart also shows that a better performing system can be obtained when changing the way we map applications onto an architecture and the way we describe the set of applications. It is not limited to architectures only.
The central element in the Y-chart is performance analysis. Within performance
analysis there is a fundamental trade-off between three elements.
The cost of modeling
The cost of evaluating the model
The accuracy of the evaluation
These three elements can be draw in an diagram we call the
“Abstraction Pyramid”Instead of making changes in the architecture, the Y-chart also shows that a better performing system can be obtained when changing the way we map applications onto an architecture and the way we describe the set of applications. It is not limited to architectures only.
The central element in the Y-chart is performance analysis. Within performance
analysis there is a fundamental trade-off between three elements.
The cost of modeling
The cost of evaluating the model
The accuracy of the evaluation
These three elements can be draw in an diagram we call the
“Abstraction Pyramid”
41. Design Space Exploration
42. Design Space Exploration
43. Y-chart Design For GP Processors
44. Y-chart Design For DSP Applications
45. How to improve performance How can we improve the performance of the system we are interested in.
Others focus on architecture, we want to focus on the application.
For a low-power architecture parallelism is important.
Exploiting more parallelism leads to fast calculations
using Voltage and Frequency scaling, we assume that power is saved
There is already a lot of theory developed to employ bit-level parallelism, instruction parallelism and Task Level parallelism.
Especially Task Level parallelism is getting more and more important to effectively map DSP application onto the new emerging architectures.
46. Programming Problem
47. Kahn Process Network (KPN) Kahn Process Networks [Kahn 1974][Parks&Lee 95]
Processes run autonomously
Communicate via unbounded FIFOs
Synchronize via blocking read
Process is either
executing (execute)
communicating(send/get)
Characteristics
Deterministic
Distributed Control
No Global Scheduler is needed
Distributed Memory
No memory contention
48. Kahn Process Network (KPN)
49. Matlab to Process Networks to FPGA
50. Matlab to Matlab Transformations To make the flow from Matlab to FPGA interesting, we had to give the designers means to change the characteristics
Unrolling (Unfolding)
Increases parallelism
Retiming (Skewing)
Improved pipeline behavior
Clustering (Merging)
Reducing parallelism
All these operations can be applied to the source level of Matlab, leading to a new Matlab program
51. Y-chart Design For DSP Applications
52. Case Study We use the Y-chart environment on a real case study
Adaptive QR [DAES 2002]
Using commercial IP cores
QinetiQ Ltd.
Vectorize 42 pipeline stages
Rotate 55 pipeline stages
QR is interesting as it requires deeply pipelined IP cores
Most Design tools have difficulties with such IP cores
We will explore a number of simple steps to improve the performance of the QR algorithm
Was reported to run at 12Mflops.
53. Example: Adaptive QR (Step 1)
54. Example: Adaptive QR (Step 2)
55. Example: Adaptive QR (Step 3)
56. Example: Adaptive QR (Step 4)
57. Conclusions Optimizing a System Level Design for Low-Power requires that you look at the architecture, the mapping, and the application.
The Y-chart gives a simple framework to tune a system for optimal performance.
The Y-chart forms the basis for DSE
We showed that by playing with the way applications are written, we get in a number of steps orders better performance 60MFlops -> 673MFlops. This without changing the architecture!
58.
Applications
Mapped
onto
Architectures Introduction
Embedded DSP & Multimedia
Design Methods
= Low Power! This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
This slide is a reformulation of the title.
These are the three components of this presentations.
It is about Applications (Embedded DSP & Multimedia)
It is about mapping them onto architectures (Design Methods)
It is about architectures for these applications, Low Power
59. Domain-Specific Co-design Environments Patrick Schaumont, UCLA Good Evening,
This is the fourth part of the embedded tutorial on Energy-Efficient architectures.
I will tell you a few words on the codesign tools you can use for these architectures.Good Evening,
This is the fourth part of the embedded tutorial on Energy-Efficient architectures.
I will tell you a few words on the codesign tools you can use for these architectures.
60. Let’s waste no time ! Well yes, there is one thing left between you and the DATE party.
<animate> It’s this presentation. So let’s get on with it.
Well yes, there is one thing left between you and the DATE party.
<animate> It’s this presentation. So let’s get on with it.
61. Design of the RINGS Multiprocessor In the previous talks by Ingrid and Christian, it was pointed out that distributed, specialized processing is the key to energy-efficient processing. So this means you are designing systems that contain multiple cores, with a number of tightly-coupled or stand-alone hardware processors. All of these are embedded in a dedicated on-chip communication architecture.
<animite> This presentation will talk about the tools that you need for such an architecture. Let us concentrate on efficient simulation. In that case you would need to combine one or more an instruction-set simulation environments with a hardware simulation kernel.
In the previous talks by Ingrid and Christian, it was pointed out that distributed, specialized processing is the key to energy-efficient processing. So this means you are designing systems that contain multiple cores, with a number of tightly-coupled or stand-alone hardware processors. All of these are embedded in a dedicated on-chip communication architecture.
<animite> This presentation will talk about the tools that you need for such an architecture. Let us concentrate on efficient simulation. In that case you would need to combine one or more an instruction-set simulation environments with a hardware simulation kernel.
62. ARMZILLA The design environment I will be discussing is called ARMZILLA.
ARMZILLA allows you to instantiate a number of ARM instruction-set simulators, and integrate these on a custom hardware model.
The input to ARMZILLA consists of three elements:
A number of ARM executables, created for example with a cross-compiler
A Configuration file, that indicates how many cores are needed, and which executables they should run
A hardware description, which contains the system interconnect description and any number of dedicated hardware processors that you need in the multiprocessor.
In the following few slides, I will walk you through each of these three elements.
Before that, a few words on simulation accuracy. In system level design, there are a large number of abstraction levels at which you can design a system, and also their simulation accuracy varies widely. High abstraction levels are preferred because they allow a more concise and compact system description, and also faster simulation. However, raising the modeling abstraction levels also creates a design hole in between spec and implementation that needs to be filled up later on.
In ARMZILLA, simulations are done cycle true, so hardware is modeled semantically at the register transfer level, and the instruction-set simulators run cycle true.The design environment I will be discussing is called ARMZILLA.
ARMZILLA allows you to instantiate a number of ARM instruction-set simulators, and integrate these on a custom hardware model.
The input to ARMZILLA consists of three elements:
A number of ARM executables, created for example with a cross-compiler
A Configuration file, that indicates how many cores are needed, and which executables they should run
A hardware description, which contains the system interconnect description and any number of dedicated hardware processors that you need in the multiprocessor.
In the following few slides, I will walk you through each of these three elements.
Before that, a few words on simulation accuracy. In system level design, there are a large number of abstraction levels at which you can design a system, and also their simulation accuracy varies widely. High abstraction levels are preferred because they allow a more concise and compact system description, and also faster simulation. However, raising the modeling abstraction levels also creates a design hole in between spec and implementation that needs to be filled up later on.
In ARMZILLA, simulations are done cycle true, so hardware is modeled semantically at the register transfer level, and the instruction-set simulators run cycle true.
63. ARMZILLA Let us now first look at the programs runnings on the ARM instruction-set simulators. Let us now first look at the programs runnings on the ARM instruction-set simulators.
64. Requirements for an ISS ina multiprocessor simulator In a multiprocessor simulator, we will combine a number of instruction-set simulators with a hardware simulation kernel.
Running multiple instruction-set simulators together is not the same as running a single one, or even a single ISS in System-on-Chip context.
And in fact, there are a number of requirements one can define for an instruction set simulator in a multiprocessor environment.
<animate> First of all, we need a linkable model. This means that the instruction-set simulator can be linked as a library in a single executable for the system simulation. This is not a hard requirement, but it is something sensible to do from the efficiency point-of-view. Some multiprocessor simulations work with instruction-set simulators instantiated as multiple program images that are linked together with interprocess communication. However, performance and complexity-wise, such solutions are worse then the linked approach.
<animate> A second requirement is to have a reentrant instruction-set simulator. This way, several ISS can coexist in a single environment without interfering with each other. This mean for example that the ISS cannot rely on global variables and global namespaces. It is a simple and practical requirement, but a lot of ISS I have seen so far are not reentrant.
<animate> A third requirement is that the ISS needs to be accessible. You need two types of accessibility. The first is that you need to be able to control the ISS. In ARMZILLA, we run all the ARM ISS in lockstep – this means, all models in turn one clock cycle at a time. A second type of accessibility is that you need to be able to exchange data between the hardware model and the C programs running on the ISS. In ARMZILLA, we intercept calls to the memory interface and thus can implement a memory-mapped interface this way.
In a multiprocessor simulator, we will combine a number of instruction-set simulators with a hardware simulation kernel.
Running multiple instruction-set simulators together is not the same as running a single one, or even a single ISS in System-on-Chip context.
And in fact, there are a number of requirements one can define for an instruction set simulator in a multiprocessor environment.
<animate> First of all, we need a linkable model. This means that the instruction-set simulator can be linked as a library in a single executable for the system simulation. This is not a hard requirement, but it is something sensible to do from the efficiency point-of-view. Some multiprocessor simulations work with instruction-set simulators instantiated as multiple program images that are linked together with interprocess communication. However, performance and complexity-wise, such solutions are worse then the linked approach.
<animate> A second requirement is to have a reentrant instruction-set simulator. This way, several ISS can coexist in a single environment without interfering with each other. This mean for example that the ISS cannot rely on global variables and global namespaces. It is a simple and practical requirement, but a lot of ISS I have seen so far are not reentrant.
<animate> A third requirement is that the ISS needs to be accessible. You need two types of accessibility. The first is that you need to be able to control the ISS. In ARMZILLA, we run all the ARM ISS in lockstep – this means, all models in turn one clock cycle at a time. A second type of accessibility is that you need to be able to exchange data between the hardware model and the C programs running on the ISS. In ARMZILLA, we intercept calls to the memory interface and thus can implement a memory-mapped interface this way.
65. ARMZILLA Let us next take a look at the configuration model, and the hardware/software interfaces.
The goal of the configuration file is to indicate how many instruction-set simulators are required, and to wire them up to the hardware model.
So we will also discuss the hardware-software interfaces.Let us next take a look at the configuration model, and the hardware/software interfaces.
The goal of the configuration file is to indicate how many instruction-set simulators are required, and to wire them up to the hardware model.
So we will also discuss the hardware-software interfaces.
66. Memory-mapped channels easy in C ARMZILLA uses memory-mapped interfaces.
These are very simple to program in C. A memory-mapped interface in C is simply an initialized pointer. Reading and writing on this pointer simply will result in access to the memory-mapped interface.
<animate> A configuration in a multiprocessor simulator is required because you have multiple instruction-set simulators running. Therefore, a memory-address space is not unique, each core has its’ own one. Conceptually, what configuration does is give each core a symbolic name, and associate an executable with it.
<animate> Then, when we attach memory-mapped interfaces to the hardware model, we are able to hook up several different ISS and distinguish each ISS and memory space uniquely by referring to the symbolic name.ARMZILLA uses memory-mapped interfaces.
These are very simple to program in C. A memory-mapped interface in C is simply an initialized pointer. Reading and writing on this pointer simply will result in access to the memory-mapped interface.
<animate> A configuration in a multiprocessor simulator is required because you have multiple instruction-set simulators running. Therefore, a memory-address space is not unique, each core has its’ own one. Conceptually, what configuration does is give each core a symbolic name, and associate an executable with it.
<animate> Then, when we attach memory-mapped interfaces to the hardware model, we are able to hook up several different ISS and distinguish each ISS and memory space uniquely by referring to the symbolic name.
67. Hardware Simulation Kernel The third part of ARMZILLA is a hardware simulation kernel, called GEZEL.
GEZEL is used to model hardware coprocessors and standalone processors, as well as the network on chip,
The third part of ARMZILLA is a hardware simulation kernel, called GEZEL.
GEZEL is used to model hardware coprocessors and standalone processors, as well as the network on chip,
68. GEZEL Hardware Simulation Kernel has Hybrid Architecture Here you see an overview of the internals of the GEZEL kernel. It is a C++ library, but with a build-in parser. GEZEL can read hardware models in dedicated, cycle-true modeling language. The modeling language expresses networks finite-state machines and datapaths, FSMD for short.
Once a hardware model is read in the GEZEL kernel, it can be either simulated or else converted to synthesizable code.
For simulation, a number of cosimulation interfaces are available to cores such as to SH3 DSP, the LEON2 Sparc, ARM. These is also a SystemC cosimulation interface.
In ARMZILLA, we use the cosimulation interface to ARM.
Beause the hardware model can be parsed in after the simulator is already compiled, this approach of combining C++ with a dedicated scripting language saves a lot of compilation time when you are exploring hardware models.Here you see an overview of the internals of the GEZEL kernel. It is a C++ library, but with a build-in parser. GEZEL can read hardware models in dedicated, cycle-true modeling language. The modeling language expresses networks finite-state machines and datapaths, FSMD for short.
Once a hardware model is read in the GEZEL kernel, it can be either simulated or else converted to synthesizable code.
For simulation, a number of cosimulation interfaces are available to cores such as to SH3 DSP, the LEON2 Sparc, ARM. These is also a SystemC cosimulation interface.
In ARMZILLA, we use the cosimulation interface to ARM.
Beause the hardware model can be parsed in after the simulator is already compiled, this approach of combining C++ with a dedicated scripting language saves a lot of compilation time when you are exploring hardware models.
69. The PONG Example Designed in 1967 by Baer - interactive TV feature
1977, General Instruments AY-3-8500 pong-in-a-chip
Magnavox, Coleco, Atari, Philips, URL, GHP, ... I will next discuss a small design in ARMZILLA to show how such a multiprocessor system can be useful.
We will be using a simple video game called PONG. If there is one game that can claim to be absolutely the oldest video game around, it will be PONG. It think the odds somebody in this room has never played PONG are close to zero. But, to make them defintely zero, the objective of pong is to move a paddle around and return the ball to your opponent. The ball moves in geometric patterns and bounces off the walls.
PONG was designed in the sixties by one of the godfathers of video games, Ralph Baer. General instruments made a famous chip with it, that has found way in countless consoles. A large number of manufacturers have developed such consoles.
I will next discuss a small design in ARMZILLA to show how such a multiprocessor system can be useful.
We will be using a simple video game called PONG. If there is one game that can claim to be absolutely the oldest video game around, it will be PONG. It think the odds somebody in this room has never played PONG are close to zero. But, to make them defintely zero, the objective of pong is to move a paddle around and return the ball to your opponent. The ball moves in geometric patterns and bounces off the walls.
PONG was designed in the sixties by one of the godfathers of video games, Ralph Baer. General instruments made a famous chip with it, that has found way in countless consoles. A large number of manufacturers have developed such consoles.
70. Multiprocessor Model of PONG In our multiprocessor model, we will map pong to four processors, and let the system play against itself.
<animate> We will use a processor for each paddle.The goal of the paddle processor is to determine the player strategy.
<animate> We will also use a processor to simulate the ball dynamics. This processor will decide how the ball bounces off the walls and the paddles.
<animate> The communication in the system will essentially consist of bouncing messages. Such messages announce the speed and the position of the ball.
<animate> Finally there will be a processor for the playing-field, who has to render this playing field on-screen.
In the next few slides, I will discuss the operation of the system. One point you should note here, is how easy this application maps to a parallel processor model. Once you map an application according to the natural actors in the system, it becomes a lot easier to implement. In this case, the natural actors are the paddles, the ball, and the rendering field.In our multiprocessor model, we will map pong to four processors, and let the system play against itself.
<animate> We will use a processor for each paddle.The goal of the paddle processor is to determine the player strategy.
<animate> We will also use a processor to simulate the ball dynamics. This processor will decide how the ball bounces off the walls and the paddles.
<animate> The communication in the system will essentially consist of bouncing messages. Such messages announce the speed and the position of the ball.
<animate> Finally there will be a processor for the playing-field, who has to render this playing field on-screen.
In the next few slides, I will discuss the operation of the system. One point you should note here, is how easy this application maps to a parallel processor model. Once you map an application according to the natural actors in the system, it becomes a lot easier to implement. In this case, the natural actors are the paddles, the ball, and the rendering field.
71. Multiprocessor Operation - Initialize We represent the operation of the system as a message sequence chart, where time runs from top to bottom, and each processor has a separate column. This slide shows the message initialization sequence.
<animate> At startup, the field processor announces the dimensions of the field to the paddle processors and the ball. As a result, the paddles and ball can choose a dimension and position that is suitable for the playing field.
<animate> So the paddle processors will sens a message back to the field with their size and position. The field processor can then draw the paddles on the playing field.
<animate> Then, the ball will broadcast its position and initially choosen speed vector to all processors. The reason for braodcasting is that each of the field and paddle processors will track the ball individually. Since each one knows how fast the ball is going, they can estimate the position of the ball by means of a simple calculation.We represent the operation of the system as a message sequence chart, where time runs from top to bottom, and each processor has a separate column. This slide shows the message initialization sequence.
<animate> At startup, the field processor announces the dimensions of the field to the paddle processors and the ball. As a result, the paddles and ball can choose a dimension and position that is suitable for the playing field.
<animate> So the paddle processors will sens a message back to the field with their size and position. The field processor can then draw the paddles on the playing field.
<animate> Then, the ball will broadcast its position and initially choosen speed vector to all processors. The reason for braodcasting is that each of the field and paddle processors will track the ball individually. Since each one knows how fast the ball is going, they can estimate the position of the ball by means of a simple calculation.
72. Multiprocessor Operation - Play Now we look at what happens during play.
As told before, each of the processors field, paddle 1 and paddle 2 is continously estimating the position of the ball.
The field will redraw the ball when it changes position.
The paddles will align them selves so that they can hit the ball.
<animate> When the position of a paddle changes, it will inform the field that it has taken a new position.
At that moment, the field processor can redraw the paddle in the new position.
<animate> Sooner or later, one of the field or the paddles will conclude that it has hit the ball. That processor will create a collision message addressed to the ball processor. This is a message that indicates where the ball has hit something, and the nature of the collision, like the upper wall, the lower wall or the left or right paddle. In this case, paddle1 detects a collisiton and informs the ball
<animate> The ball processor then evaluates a new speed vector according to the ball dynamics, and broadcasts the new position and speed to all parties in the system. This message is will inform each of the processors that the speedvector of the ball has changed. So the field should change the direction of the trackm and the paddles should change the moving strategy.
Now we look at what happens during play.
As told before, each of the processors field, paddle 1 and paddle 2 is continously estimating the position of the ball.
The field will redraw the ball when it changes position.
The paddles will align them selves so that they can hit the ball.
<animate> When the position of a paddle changes, it will inform the field that it has taken a new position.
At that moment, the field processor can redraw the paddle in the new position.
<animate> Sooner or later, one of the field or the paddles will conclude that it has hit the ball. That processor will create a collision message addressed to the ball processor. This is a message that indicates where the ball has hit something, and the nature of the collision, like the upper wall, the lower wall or the left or right paddle. In this case, paddle1 detects a collisiton and informs the ball
<animate> The ball processor then evaluates a new speed vector according to the ball dynamics, and broadcasts the new position and speed to all parties in the system. This message is will inform each of the processors that the speedvector of the ball has changed. So the field should change the direction of the trackm and the paddles should change the moving strategy.
73. Let’s play!
74. Multiprocessor Architecture The simulation that you just saw is one that uses a point-to-point model for the multiprocessor.
This point-to-point model is shown here. There are multiple processors in the system, and they communicate withy memory mapped interfaces. Those interfaces are demultiplexed to dedicated point-to-point busses. The communication protocol uses request-acknowledge handshaking. Such a model is easy to create but requires a lot of hardware connections.The simulation that you just saw is one that uses a point-to-point model for the multiprocessor.
This point-to-point model is shown here. There are multiple processors in the system, and they communicate withy memory mapped interfaces. Those interfaces are demultiplexed to dedicated point-to-point busses. The communication protocol uses request-acknowledge handshaking. Such a model is easy to create but requires a lot of hardware connections.
75. Multiprocessor Architecture A multiplexed communication structure is more conservative with area. One way to build this is with an on-chip communcation network, such as the one shown on this slide.
<animate> The messages flowing in the network will then be mapped to payload packets.
<animate> We also need to indicated an adressee for each packet, because the routers need to know where to deliver the data.
<animate> Also, we also need to indicate the sender of the message. This is required for example for the field processor, since that one receives different messages from the paddles and from the ball.
There is an extensive set of optimization tasks in this area, that would improve communication speed and protocol stack complexity. I am not going to talk about this.
A multiplexed communication structure is more conservative with area. One way to build this is with an on-chip communcation network, such as the one shown on this slide.
<animate> The messages flowing in the network will then be mapped to payload packets.
<animate> We also need to indicated an adressee for each packet, because the routers need to know where to deliver the data.
<animate> Also, we also need to indicate the sender of the message. This is required for example for the field processor, since that one receives different messages from the paddles and from the ball.
There is an extensive set of optimization tasks in this area, that would improve communication speed and protocol stack complexity. I am not going to talk about this.
76. Links Let me rather give you some links on the software that I was mentioning. All of this software is open source, so it costs you nothing to start experimenting with energy-efficient multiprocessor systems.
The ARM instruction-set simulator that we used is the open-source SimIt-ARM ISS, developed by Wei Qin at Princeton university.
The cross compiler was downloaded from the ARM linux FTP site in the UK.
The GEZEL and ARMZILLA environment have their own homepage at UCLA.
<animate> All these tools are under GNU Public license and so are free.
Keep in mind however, that this means free as in freedom, not free as in free beer.
The tools are given as a service to the community.Let me rather give you some links on the software that I was mentioning. All of this software is open source, so it costs you nothing to start experimenting with energy-efficient multiprocessor systems.
The ARM instruction-set simulator that we used is the open-source SimIt-ARM ISS, developed by Wei Qin at Princeton university.
The cross compiler was downloaded from the ARM linux FTP site in the UK.
The GEZEL and ARMZILLA environment have their own homepage at UCLA.
<animate> All these tools are under GNU Public license and so are free.
Keep in mind however, that this means free as in freedom, not free as in free beer.
The tools are given as a service to the community.
77.
Applications
Mapped
onto
Architectures Conclusion
Embedded DSP & Multimedia
Design Methods
= Low Power! Conclusion slide: repeat main message:
Low Power is all about closing the gap between applications and architectures.
It does require supporting design methods.
Conclusion slide: repeat main message:
Low Power is all about closing the gap between applications and architectures.
It does require supporting design methods.
78. Thanks for your attention ! Thanks a lot.
Thanks a lot.