SUN ULTRASPARC-III ARCHITECTURE

SUN ULTRASPARC-III ARCHITECTURE CMPE 511 PRESENTATION Prepared by:Balkır Kayaaltı

Introduction • SPARC stands for a Scalable Processor ARChitecture. • It is an open processor architecture.(i.e. Member companies to the SPARC community can freely produce the processor) • SUN ULTRA SPARCv9 is a robust RISC architecture with -64 bit integer address and data -Superscalar implementations -Extremely fast trap handling and context switching. The presentation will look in detail to the SUN Microsystem’sUltra SPARC III v9 architecture.

Major Architectural units The processor’s micro-architecture design has six major functional units that perform relatively independently: • Instruction issue unit(IIU) • Floating point unit(FPU) • Integer execution unit(IEU) • Data cache unit(DCU) • External memory unit(EMU) • System interface unit(SIU) The units communicate requests and results among themselves through well-defined interface protocols, as the next figure

Communication paths between architectural units

Instruction issue unit • This unit feeds the execution pipelines with the instructions. • It independently predicts the control flow through a program and fetches the predicted path from the memory system. • Fetched instructions are staged in a queue before forwarding to the two execution units: ‘integer and floating point’ This unit includes: • 32-Kbyte, four-way associative ‘Instruction cache’ • ‘The instruction address translation buffer’ • A 16 K-entry ‘branch predictor’

Ultra SPARC-III pipeline and physical data Pipeline feature Parameter Instruction issue 4 integer 2 float point 2 graphics Level-one(L1) caches Data: 64-Kbyte, 4-way Instructions: 32-Kbyte, 4-way Prefetch: 2-Kbyte,4-way Write : 2-Kbyte,4-way Level-two(L2) cache Unified (data and instructions) 4- and 8-Mbyte,1-way On-chip tags;off chip data

Pipeline

Pipeline blocks Stage Function A Generate instruction fetch addresses, generate pre-decoded instruction bits on P Fetch first cycle of instructions from cache; access first cycle of branch prediction F Fetch second cycle of instructions from cache; access second cycle of branch prediction;translate virtual-to- physical address B Calculate branch target addresses; decode first cycle of instructions I Decode second cycle of instructions;enqueue instructions into the queue J Steer instructions to execution units R Read integer register file operands; check operand dependencies E Execute integers for arithmetic, logical, and shift instructions; read, and check dependency of, first cycle of data cache access floating-point register file

Pipeline blocks[2] Stage Function C Access second cycle of data cache, and forward load data for word and doubleword loads; execute first cycle of floating-point instructions M Load data alignment for half-word and byte loads; execute second cycle of floating-point instructions W Write speculative integer register file; execute third cycle of floating-point instructions X Extend integer pipeline for precise floating-point traps; execute fourth cycle of floating-point instructions T Report traps D Write architectural register file

Pipeline • The instruction issue unit :Stages A-J • The execution unit :Stages R-D • data cache:E, C, M, and W stages of the pipe in parallel with integer execution unit stages • Floating point unit:Side pipeline parallel E through D stages of the integer pipeline

Pipeline

Instruction issue unit cont. • To increase the performance high level of instruction parallelism is desired. • Ultra SPARC is a static speculation machine. - Dynamic speculation machines require very high fetch bandwidths to fill an instruction window and find instruction-level parallelism. - In a static speculation machine the compiler can make the speculated path sequential, resulting in fewer requirements on the instruction fetch unit.

Instruction issue unit: Stage A: Address lines enter to the instruction cache. All fetch address generation and selection occurs. Stage P,F: Instruction cache access. Branch prediction Instruction address translation access

By the time the instructions are available from the cache in the B stage, we also have the physical address from the translator and a prediction for any branch that was fetched. The processor uses all this information in the B stage to determine whether to follow a sequential or taken-branch path

Branch prediction • The processor also determines whether the instruction cache access was a hit or miss. If the processor predicts a taken branch in the B stage, the processor sends back the target address for the branch to the A stage to redirect the fetch stream. • Waiting until the B stage to redirect the fetch stream lets us use a large, accurate branch predictor. • Branch predictor uses a ‘G-share algorithm’ with 16K 2-bit saturating up/down counters • Predictor is pipelined since it is big.

Instruction buffer(queue) • There are 2 instruction queue’s designed (instruction queue and miss queue) • The 20-entry instruction queue decouples the fetch unitfrom the execution units, allowing each to proceed at its own rate • If a branch is taken at the two cycles that should pass for filling the queue with right instructions , immediately instructions in the miss queue can be used.

Integer execute unit • Execution pipelines can support concurrent launch up to six instructions; which can consist of: -two integer operations,A0/A1 pipelines -two FP operations,FP pipelines -one memory operation(load/store),MS pipeline -one special purpose memory operation(prefetch cache load only) -one control transfer instruction(CTI),BR pipeline However only four Instructions per cycle(IPC) can be executed in a sustain manner.

Working and Architectural Register File(WARF) • Physically it is a one block but logically it can be seen as two separate register files.(working register file and architectural) • SPARC architectures use register files and windowing techniques. • Any time 8 global registers can be reached g0 – g7 • Global register g0 is always ‘0’. • At any time,an instruction can access the 8global anda 24-register window into the registers.A register window comprises the 8 ‘in’ and 8 ‘local registers’ of a particular register set, ttogether with the 8 ‘in’ registers of an adjacent register set, which are addressable from the current window as out registers.

Register windows

WARF • WRF consist of 32 – 64-bit registers(each of with 3 write,7 read ports and 32*64=2048 minus 64 =1984 bit write port to transport data from Architectural register file • ARF has 160 entries (Total 8 register windows) 8x8=64 for local registers in the window 8x8=64 registers for 16 IN/OUT shared registers. 28 register for 4 set of 8 global registers. • The WRF manages as single window & updated as results computed

The processor accesses the WRF in the pipeline’s R stage and supplies integer operands to the execution units. • Most integer operations complete in one cycle , so result can be written immediately at C stage. • If an exceptional event occurs, results written must be undone; so original copies of integer registers are copied using broadside copy of all integer files from appropriate ARF window. • The place where to architecture register file is written at the end of the pipeline since all exceptions should be resolved. • ARF fills 16 WRF entries after a window change • On an exception 31 nonzero registers of WRF should be updated.

On chip memory system Chache diagram used in the architecture

On chip memory system Level-one(L1) caches Data: 64-Kbyte, 4-way Instructions: 32-Kbyte, 4-way Prefetch: 2-Kbyte,4-way Write : 2-Kbyte,4-way Level-two(L2) cache Unified (data and instructions) 4- and 8-Mbyte,1-way On-chip tags; off chip data average latency = L1 hit time + L1 miss rate * L1miss time + L2miss rate * L2 miss time

Prefetch cache • Performance is highly increased by using a ‘Prefetch Cache’ in parallel with the ‘L1 data cache’. • By issuing up to eight in-flight prefetches to main memory, the prefetch cache enables program to utilize 100% of the available mainmemory bandwidth without incurring a slow-down due to the main memory latency.

Prefetch cache • The prefetch cache :2-Kbyte SRAM organized as 32 entries of 64 bytes and using four-way associativity with an LRU replacement policy. • A multi-port SRAM design let us achieve a very high throughput. • Data can be streamed through the prefetch cache in a manner similar to stream buffers. • On every cycle, each of two independent read ports supply 8 bytes of data to the pipeline while a third write port fills the cache with 16 bytes.

Prefetch cache • Some early processors like Ultra Sparc II uses prefetch instructions. • Autonomous stride prefetch engine that tracks the program counters of load instructions and detects when a load instruction is striding through memory . • When the prefetch engine detects a striding load, the prefetch engine issues a hardware prefetch independent of any software prefetch. • This allows the prefetch cache to be effective even on codes that do not include prefetch instructions.

Write cache • Write-caching is an excellent way to reduce the bandwidth due to store traffic. • A write cache is used in SPARC-III to reduce the store traffic bandwidth to the off-chip L2 data cache • Size is 2Kbyte -4 way associative • Advantage of using it is : being the sole source of on-chip dirty data, the write cache easily handles both multiprocessor and on-chip cache consistency. • Error recovery also becomes easier with the write cache, since the write cache keeps all other on-chip caches clean and simply invalidates them when an error is detected.

Write chaching • A byte validate policy is used on the write cache. Rather than reading the data from the L2 cache for the bytes within the line that are not being overwritten, we just keep an individual valid bit for each byte. Not performing the read-on-allocate saves considerable L2 cache bandwidth by postponing a read-modify-write until the write cache evicts a line. Frequently, by eviction time the entire line has been written so the write cache can eliminate the read. • Write cache is included in the L2 data cache and write-cache data can supersede read data from the L2 data cache . We handle this by a byte-merging multiplexer on the incoming L2 cache data bus that can choose either writecache data or L2 cache data for each byte.

Floating point unit • This unit contains data paths and control logic to execute floating point and partitioned fixed-point data type instructions. • Three data paths concurrently execute floating point or graphics instructions, one each per cycle from the following classes: -Divide/multiply(single or double precision or partitioned) -Add/subtract/compare(single or double precision or partitioned) -An independent division datapath which lets non-pipelined divide proceed concurrently with the full pipelined multiply and adder paths. • In order to meet the cycle time of the floating point operations latency cycles must be added. • With using advanced circuit techniques for floating point add multiply units a latency cycle will be enough.

External memory interface • External memory consist of a large L2 cache built off chip and a main memory built off chip using synchronous DRAM’s. • Size of L2 caches: 4 or 8 Mbyte • Latency: 12 clock cycles to support 32 byte line to L1 • Tags for the L2 is placed on-chip to early detect L2 miss (L2 cache controller accesses on-chip tags parallel with the start of the off-chip SRAM access and provide a way select signal to a late select address pinon the off-chip SRAMs)

L2 caches are Wave-pipelined and operate at 600MHz., • Main memory DRAM controller is on chip, reducing memory latency and scales the memory bandwidth with the number of processor. • The memory controller supports up to 4 Gbytes of SDRAM memory organized as four independent banks.

Trap stage in the pipeline • In this architecture classical stall signal( which freezes the state of the pipeline is eliminated for performance purposes) • Instead a trap stage is put at the end of the pipeline to restore a state when an unexpected event occurs. • It’s handled like a trap:the instructions that are in the pipeline will be refetched from Stage A.

Conclusion • One of the advanced RISC microprocessor is the Sun Microsystems UltraSPARC.It finds many application in desktops, network systems , scientific calculation machines. • The internal architecture of the UltraSPARC-III. is represented . • Various parts of the processor is examined like: instruction issue, execution, on chip and external memory.

References • 1) ‘Ultra Sparc III:Designing Third -Generation 64-Bit performance’ ,IEEE Micro ,June 1999 • 2)’Design Decisions Influencing Ultra SPARC’s Instruction Fetch Architecture’, 29th annual IEEE/ACM International Symposium on Microarchitecture ,p178-190,1996 Paris • 3)Ultra SPARC III v9 Manual,Sun Microsystems.

THANK YOU

SUN ULTRASPARC-III ARCHITECTURE