Hitachi SuperH SH-4 By: Herman Sheremetyev 5/10/2002
Inspiration I was inspired to do this presentation on the Hitachi SH-4 processor because this is the processor used in the Sega Dreamcast video game system. I own a Dreamcast and after being assigned this project I became very interested in its internal workings. As a result of my research I found that there was quite a bit of software ported to this platform, starting with a NetBSD port and followed by a Linux port which can actually transform the Dreamcast into a usable X terminal. These ports were largely possible due to the fact that Hitachi released the complete specifications as well as a Programmer’s Manual for the processor. What follows are excerpts from the Hitachi Hardware Manual that briefly describe SH-4’s most interesting aspects which I loosely tailored to the Dreamcast implementation.
Sources • Most of the information in this presentation is taken from the Hitachi Hardware Manual on the SH4 family of processors • The manual can be found at http://www.julesdcdev.com/ and probably on the Hitachi website
Features Summary • The SH-4 (SH7750 Series (SH7750, SH7750S)) has been developed as the top-end model in the SuperH™ RISC engine family, featuring a 128-bit graphic engine for multimedia applications and 360 MIPS performance.
Features • In addition to single- and double-precision floating-point operation capability, the on-chip FPU has a 128-bit graphic engine that enables 32-bit floating-point data to be processed 128 bits at a time. • It also supports 4 ´ 4 array operations and inner product operations, enabling a performance of 1.4 GFLOPS to be achieved.
Features • Operating frequency is 200Mhz • A superscalar architecture is employed that enables simultaneous execution of two instructions (including FPU instructions) • An 8-kbyte instruction cache and 16-kbyte data cache are also provided, and the on-chip memory management unit (MMU) handles translation from the 4-Gbyte virtual address space to the physical address space.
Registers • Sixteen 32-bit general registers (and eight 32-bit shadow registers) • Seven 32-bit control registers • Four 32-bit system registers • Register operands are always longwords (32 bits). When a memory operand is only a byte (8 bits)or a word (16 bits), it is sign-extended into a longword when loaded into a register.
Data Formats in Memory • Memory data formats are classified into bytes, words, and longwords. Memory can be accessed in 8-bit byte, 16-bit word, or 32-bit longword form. A memory operand less than 32 bits in length is sign-extended before being loaded into a register. • A word operand must be accessed starting from a word boundary (even address of a 2-byte unit: address 2n), and a longword operand starting from a longword boundary (even address of a 4-byte unit: address 4n). An address error will result if this rule is not observed. • A byte operand can be accessed from any address.
“Endianess” • Big endian or little endian byte order can be selected for the data format. Big endian is the preferred method of operation. • The endian cannot be changed dynamically. • Bit positions are numbered left to right from most-significant to least-significant. Thus, in a 32-bit longword, the leftmost bit, bit 31, is the most significant bit and the rightmost bit, bit 0, is the least significant bit.
Operand and Instruction Caches • The operand cache consists of 512 cache lines, each composed of a 19-bit tag, validity bit(V), dirty bit(U), and 32-byte data. • The instruction cache consists of 256 cache lines, each composed of a 19-bit tag, validation bit (V), and 32-byte data (16 instructions). • (Tag - stores the upper 19 bits of the 29-bit external memory address of the data line to be cached.)
Cache-Memory coherence • Coherency between cache and external memory should be assured by software. • Several cache operations instructions are provided, including a prefetch instruction
Cache operations (operand cache only) • Invalidate instruction: OCBI @Rn Cache invalidation (no write-back) • Purge instruction: OCBP @Rn Cache invalidation (with write-back) • Write-back instruction: OCBWB @Rn Cache write-back • Allocate instruction: MOVCA.L R0,@Rn Cache allocation
Floating Point Unit (FPU) • Conforms to IEEE754 standard • 32 single-precision floating-point registers (can also be referenced as 16 double-precision registers) • Two rounding modes: Round to Nearest and Round to Zero • Two denormalization modes: Flush to Zero and Treat Denormalized Number • Six exception sources: FPU Error, Invalid Operation, Divide By Zero, Overflow, Underflow, and Inexact • Comprehensive instructions: Single-precision, double-precision, graphics support, system control
FPU Data Formats • A floating-point number consists of the following three fields: · Sign (s) · Exponent (e) · Fraction (f) • 32 bit Single-Precision (s=1,e=8,f=23) • 64 bit Double-Precision (s=1,e=11,f=52)
FPU Rounding • Round to Nearest: The value is rounded to the nearest expressible value. If the unrounded value is 2^Emax (2 – 2^(–P)) or more, the result will be infinity with the same sign as the unrounded value. • Round to Zero: The digits below the round bit of the unrounded value are discarded. If the unrounded value is larger than the maximum expressible absolute value, the value will be the maximum expressible absolute value.
FPU Graphics Support • The SH7750 Series supports two kinds of graphics functions: • instructions for geometric operations • pair single-precision transfer instructions that enable high-speed data transfer.
FPU Geometric functions • Geometric operation instructions perform approximate-value computations. To enable high-speed computation with a minimum of hardware, the SH7750 Series ignores comparatively small values in the partial computation results of four multiplications.
FPU Pair Single-Precision Data Transfer • In addition to the geometric operation instructions, the SH7750 Series also supports high-speed data transfer instructions. • These instructions enable two single-precision (2 ´ 32-bit) data items to be transferred; that is, the transfer performance of these instructions is doubled.
Instruction Format • the instruction set is implemented with 16-bit fixed length instructions. • operations are basically executed using registers. • Except for bit-manipulation operations such as logical AND that are executed directly in memory, operands in an operation that requires memory access are loaded into registers and the operation is executed between the registers.
Instruction Format (cont’d) • Delayed Branches: Except for the two branch instructions BF and BT, branch instructions and RTE are delayed branches. (In a delayed branch, the instruction following the branch is executed before the branch destination instruction.) • Constant Values: An 8-bit constant value can be specified by the instruction code and an immediate value. 16-bit and 32-bit constant values can be defined as literal constant values in memory
Addressing Modes • Register direct • Register indirect (supports post and pre decrement and increment as well as displacement) • Indexed register indirect, i.e. the effective address is sum of register Rn and R0 contents. • Immediate
Instruction Set • Over 100 different instructions including FP, mostly variations on MOV, ADD, etc. to accommodate different addressing modes. • Instruction mnemonic: OP, Sz, SRC, DEST • OP: Operation code • Sz: Size • SRC: Source • DEST: Source and/or destination operand
Instruction Level Parallelism • The SH7750 Series is a 2-ILP (instruction-level-parallelism) superscalar pipelining microprocessor. • Instruction execution is pipelined, and two instructions can be executed in parallel. • Parallel execution depends on the instructions – not all instructions can be executed in parallel with all others
Pipelining • The instruction pipeline has 5 stages: • Instruction fetch (I) • decode and register read (D) • execution (EX/SX/F0/F1/F2/F3) • data access (NA/MA) • write-back (S/FS)
ILP Illustration http://www.hitachisemiconductor.com/sic/jsp/japan/eng/products/ mpumcu/32bit/image/2_way.gif
Direct Memory Access • The SH7750 Series includes an on-chip four-channel direct memory access controller (DMAC). • The DMAC can be used in place of the CPU to perform high-speed data transfers among external devices equipped with DACK (DMA transfer end notification), external memories, memory mapped external devices, and on-chip peripheral modules (except the DMAC, BSC, and UBC). • Using the DMAC reduces the burden on the CPU and increases the operating efficiency of the chip.
Serial Communication Interface (SCI) • The SH7750 is equipped with a single-channel serial communication interface (SCI) and a single channel serial communication interface with built-in FIFO registers (SCI with FIFO: SCIF). • The SCI can handle both asynchronous and synchronous serial communication. A function is also provided for serial communication between processors (multiprocessor communication function).
Smart Card Interface • An IC card (smart card) interface conforming to ISO/IEC 7816-3 (Identification Card) is supported as a serial communication interface (SCI) extension function. • Switching between the normal serial communication interface and the smart card interface is carried out by means of a register setting.