Outline

Modeling of ArchitecturesEmbedded Computer Architecture5KK73Henk CorporaalBart MesmanHamed Fatemi2011

Outline • We will look at models for Area, Delay and Energy • Processor structure • Register files - Register cell • Model (area, power, delay) • details for several register file configurations • Apply this to the Imagine architecture • Stream register file (SRF) • Network

Processor • Single processor • Instruction Memory (IM) • Controller • Processing Element (PE) • Register File (RF) • ALU • Data Memory (DM) • SIMD • Multiple PEs • VLIW • Multiple ALUs • Multi-Processor • Several processors • Connected by a bus or network IM Controller PE RF ALU DM Network

Register File (RF) Area model 1-bit of size w*h • Assume: • p = number of ports • For large RF row decoder small compared to cell area • 1-Bit area = w*h (tracks) If p is large • Schematic of 1 register cell • 1 wordline and bitline per port needed

Register file (RF) Delay model Delay (d): • Wire Propagation delay • Fan-in/out delay • Delay ~ wire length ~ connected cells • R = number of registers, each b bits wide => Nbits = bR • Assuming square bit-layout (for large p wiring dominates) Note: for N FUs (ALUs), p ~ 3N, R ~ N → d ~ N3/2

Register file (RF) Power model • Power (P): • Proportional to the capacitance that must be switched for each access • In each access every bit-line and one word-line  bit-line capacitance • Each port drives (bR)1/2 bit lines • Each bit line has length (h+p) (bR)1/2 Register file If p is large: power is dominated by wire capacitance Note: for N FUs (ALUs), p ~ 3N, R ~ N → P ~ N3

Register File organization • Processor with one level register • Central (shared register file) ALU N ALU 1 DRF (distributed register file): ALU 1 ALU N

Comparing Area model of Central and Distributed RF • Central (shared) RF: • 2 read ports, one write port per ALU • R= rN: number of registers of b bits • r: number of register per ALU • N: number of ALUs • DRF: • Only 2 ports: one read, one write • This would give A(1 RF) ~ N • Area of switch has same area cost complexity Square layout & organization of the DRF, including 2N*N crossbar

Delay and Power models of central versus distributed RF Assume N ALUs • Central RF: • #registers R=rN • #ports p =3N • Large N • DRF: • Constant #registers per ALU • #ports p=2 (also constant!) • DRF has a fixed delay and power (per RF) • Wire propagation determines delay and power (for large N) • For large N

Register File Register (memory) storage and communication between ALUs are critical parts for area, energy and performance in media processor. Hierarchical register storage

RF2 (level 2) RF2 (level 2) RF1 (level 1) RF1 (level 1) DRF: ALU 1 ALU N 2-levels register files (Hierarchical) Central: ALU N ALU 1 • RF1 serves the ALUs, while RF2 is used to cover the memory latency • Overall tendency for Area is the same as having one level RF

Register Files • Processor with stream register files: • Replace each port into the memory staging RF with a stream buffer • All stream buffers share a single port into the memory staging RF, allowing that single physical port to act as many logical ports. Central: ALU N ALU 1

Register Files • The payoff the transformation into a stream architecture is that we can achieve an area proportional to N^2, since R2 (memory storage) only needs 1 port. We also have to add in the area of the stream buffers, which grows as N^2 with a very small constant. DRF: ALU 1 ALU N

Results area per ALU (Normalized to 1 ALU)

Results Local delay

Results Power overhead

Imagine Architecture Cell placement of Imagine Die Photo of Imagine

Imagine Floorplan • 22 million transistors • 500 MHz • Area, Energy, Delay models • Clusters, Micro-controller, SRF, Network Interface

Stream register File

Network: • Area of network grows with (like DRF switch) : More details in khailany paper [2003]

Exploration Intra-cluster scaling

Exploration Inter-cluster scaling

end • More details: • Scott Rixner, William J. Dally, Brucek Khailany, Peter Mattson, Ujval J.Kapasi, and John D. Owens. Register Organization for Media Processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA), pages 375–386, Toulouse, France, January 2000. IEEE Computer Society. • Brucek Khailany, William Dally, Scott Rixner, Ujval Kapasi, John Owens, and Brian Towles. Exploring the vlsi scalability of stream processors. In Proceedings of the Ninth Symposium on High Performance Computer Architecture (HPCA), pages 153–164, Anaheim, California, USA, February 2003. IEEE Computer Society.

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: