Codesigned Virtual Machine -Transmeta CRUSOE-

Codesigned Virtual Machine-Transmeta CRUSOE- 2005. 11 Chang Hyun Lee System Design Group Seoul National University

Index • Introduction • Overview (and some details) of HW & SW • HW part - VLIW architecture • SW part - Code Morphing Software • Detailed description • Speculation and Recovery • Precise trap • Memory mapped IO • Data speculation • Self modifying • Conclusion

Introduction • Who is Transmeta? • CTO David Ditzel, who had worked for SUN Microsystems, established Transmeta in 1995 • Nobody had interest in Transmeta before Transmeta employed Linus Torvalds. • “Our first product’s name is crusoe “ & "We have rethought the microprocessor“ <- only this two sentences was displayed on their homepage. • 2000.1.19 they introduced CRUSOE into a market. • Design Goal of Crusoe • The everchanging technology market continues to drive the need for ever more compact designs. • high performance but are smaller in size, consume less power for longer battery life, and run cooler without the need for fans. • Lighter , Longer, Cooler

Introduction • Characteristics of Crusoe • Transmeta’s Crusoe microprocessor is a full, systemlevel implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS) • 128bit VLIW(very long instruction word) engine • The Crusoe processor features a 128-bit wide VLIW (Very Long Instruction Word) engine that can issue up to 4 instructions per clock cycle. • Code Morphing Software • Code Morphing Software (CMS) layer provides the Transmeta Crusoe processor with x86 compatibility while empowering the complex microprocessor with the flexibilities • Integrated Architecture • To ease system design and enhance performance, Transmeta has integrated Northbridge functionality — SDR and DDR SDRAM memory controllers, a 32-bit, 33MHz PCI bus controller and a Serial ROM interface controller — directly into the Crusoe processor die. • LongRun Technology • Transmeta LongRun technology allows the Transmeta Crusoe processor to conserve power by dynamically adjusting its voltage and clock frequency.

Overview of HW part • What is VLIW(Very Long Instruction Word)? • A set of instruction to be executed in parallel format as a single large instruction.

Overview of HW : Architecture • TM5800 Microarchitecture 2MB VMM 8KB Local Program memory 512KB Compressed VMM 14MB Crusoe Data& Translation 8KB Local data memory 64KB I-Cache 64KB D-Cache Memory Hierachy of Crusoe

Overview of HW : register set • Register Set • The processor has 64 GPRs, with the following specialized semantics: • * %r63 (%zero) always reads 0 when used as a source operand • * %r62 (%sink) is a discarded destination (e.g., for compares); it is never read • * %r59 (%from) saved return address • * %r58 (%link) return address • * %r47 (%sp) is the current stack pointer • * %r0 (%eax) for current x86 machine state • * %r1 (%ecx) for current x86 machine state • * %r2 (%edx) for current x86 machine state • * %r3 (%ebx) for current x86 machine state • The 48 of these GPRs are backed by shadowed GPRs: whenever a bundle has its commit bit set, the Commit stage latches the current values of the GPRs into the 'known good' shadow GPRs. • The processor also includes 32 80-bit floating point registers and 16 FP shadow registers. • There are also a wide variety of special purpose registers (SPRs), including the condition codes, profiling registers, power control settings and so on.

Overview of HW : Instruction set • Instruction Encoding • Instructions are encoded in little endian byte and word order as shown in the following diagram: • All instructions (except branches) have a 9-bit opcode field. All opcodes share a common mapping into this 9-bit space. • The ALU0|imm32 and ALU0|ALU1 bundle types share the same format code (10) but the ALU1 slot is interpreted as an imm32 depending on the opcode: • 11xxxx011: 32-bit immediate in place of ALU1 • All others: execute ALU1 as instruction • If the 11xxxx011 pattern appears in an ALU1 slot, an 8-bit immediate is used instead. It is not clear why this encoding is sometimes used instead of the normal 8-bit immediate form.

Overview of HW : Instruction set • Figure below shows the formatting for instructions to the two ALUs • Two ALUs are provided in Crusoe. It appears that ALU1 executes a superset of the operations available on ALU0. • The ALU1 slot is also used for all floating point and MMX operations, as indicated by ALU1's type select bits being something other than '00'. • Figure below shows the formatting for instructions to the two ALUs • All LSU operations take a fully calculated address in register ra; as with most VLIW architectures, no ra+offset or ra+rb addressing modes are provided.

Overview of HW : Instruction set • Figure below shows the formatting for instructions to the Branch • Branches (both conditional and unconditional) within CMS use a 23 bit absolute target address aligned to a 64-bit boundary (i.e., abstarget is shifted left 3 bits). • Conditional branches use the exact same condition code set (cc bits) as the x86 encoding in jump instructions. Unconditional branches can optionally write the return address to the %link register (%r58) if the L bit (bit 0 of the cc field) is set. • Indirect branches occur through a general purpose register. It appears that special instructions are provided to prepare for an indirect branch when the target address is known in advance; this avoids the three-cycle branch penalty. • ex) Lookup_Jump Rk instruction that performs the jump to the TPC if there is a hit, otherwise it falls through

Overview of HW : pipelining • The top row of the diagram indicates the pipeline for an ALU instruction, with the other rows representing the two other types of logical units. The pipeline is a fairly typical RISC design: • Fetch0: The first 64 bits of a 64-bit or 128-bit bundle are fetched Fetch1: The second 64 bits are fetched (for 128-bit bundles only) Regs: Read source registers and decode/disperse instructions> ALU: Execute single cycle operations in ALU0 and ALU1 Except: Complete two-cycle ALU0/ALU1 ops and detect exceptions Cache0: Initiate L1 data cache access based on register address Cache1: Complete L1 data cache access, TLB access and alias checks Write: Write results back to GPRs or store buffer Commit: Optionally latch the lower 48 GPRs into the shadow registers

Overview of SW part : CMS • The Code Morphing software is fundamentally a dynamic translation system, a program that compiles instructions for one instruction set architecture (in this case, the x86 target ISA) into instructions for another ISA (the VLIW host ISA). • Transmeta’s Crusoe microprocessor is a full, system level implementation of the x86 architecture, comprising a native VLIW microprocessor with a software layer, the Code Morphing Software (CMS), that combines an interpreter, dynamic binary translator, optimizer, and runtime system.

Overview of SW : CMS • CMS have to satisfy.. • CMS must faithfully implement the complete x86 architecture: all instructions (including memorymapped I/O), architectural registers, and complete exception behavior. • CMS can make no assumptions about the operating system running on the processor and cannot depend on information or other assistance from the system. It is a system-level implementation, not application-level, and even executes the BIOS code. • CMS must provide robust performance for a wide variety of systems and applications. This requires dealing with unpleasant realities like self-modifying code and precise exceptions.

Overview of SW : CMS • Typical CMS control flow • CMS is structured like many other dynamic translation systems. • Initially, an interpreter decodes and executes x86 instructions sequentially. • When the number of executions of a section of x86 code reaches a certain threshold, its address is passed to the translator. • The translator selects a region and stores the translation with various related information in the translation cache. • From then on, until something invalidates the translation cache entry, CMS executes the translation when the x86 flow of control reaches the translated code region. Once the branch target is identified as another translation, the branch operation is modified to go directly there, a process called chaining But a variety of exceptional events may interrupt this typical control flow.

Overview of SW : CMS • translation example • X86 instruction A. addl %eax,(%esp) // load data from stack, add to %eax B. addl %ebx,(%esp) // ditto, for %ebx C. movl %esi,(%ebp) // load %esi from memory D. subl %ecx,5 // subtract 5 from %ecx registe • In a first pass, the front end of the translation – simple translation ld %r30,[%esp] // load from stack, into temporary add.c %eax,%eax,%r30 // add to %eax, set condition codes. ld %r31,[%esp] add.c %ebx,%ebx,%r31 ld %esi,[%ebp] sub.c %ecx,%ecx,5 • In a second pass, the optimizer. applying well known compiler optimization skill such as common subexpression elimination, loop invariant removal or dead code elimination. ld %r30,[%esp] // load from stack only once add %eax,%eax,%r30 add %ebx,%ebx,%r30 // reuse data loaded earlier ld %esi,[%ebp] sub.c %ecx,%ecx,5 // only this last condition code needed • In a final pass, the scheduler. reordering atoms into molecules. 1. ld %r30,[%esp]; sub.c %ecx,%ecx,5 2. ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30

Speculation & Recovery • Speculation : to make and exploit assumptions – unproven at translation time – about the code being translated. • ex) the translator might assume that two specific load and store instructions reference non-overlapping memory • This type of speculation enables generation of much more efficient translations, but should one or more assumptions prove to be false, incorrect results may be produced. • CMS uses a combination of hardware and software mechanism to detect failing assumptions. • commit stage and shadow register. ( discussed in next page)

Speculation & Recovery • Hardware Support for Speculation and Recovery • There exist two copies of each register, a working copy and a shadow copy. • Normal atoms : only update working copy • when execution reaches the end of a translation, commit operation copies all working registers into their corresponding shadow registers. • if any exception condition occur occur inside a translation block, the runtime system undoes the effects of all molecules executed :rollback • roll back : copies the shadow register values (committed at the end of the previous translation) back into the working registers. • Following a rollback, CMS usually interprets the x86 instructions corresponding to the faulting translation, executing them in the original program order.

Challenges due to Speculation • Challenges which CMS meets by applying the procedure of speculation • CMS must faithfully reproduce the precise exception behavior of the x86 target, without overly constraining the scheduling of its translations. • CMS must respond to interrupts at precise x86 instruction boundaries, where the system possesses a consistent target state. • CMS must efficiently handle memory-mapped I/O and other system-level operations, without penalizing normal (non-I/O) memory references. • Legacy PC software, especially games, often includes performance-critical self-modifying code. Similar problems result from pages containing both code and data, common in Windows/9X device drivers, BIOSs, and embedded systems running a real-time operating system.

Precise Exception & interrupt • In the x86 ISA, exceptions are precise: when one instruction causes an exception, all instructions preceding it must complete before the exception is reported, and none of the subsequent instructions may complete. • with hardware support for commit and rollback and the interpreter-based recovery procedure in place, CMS has much more flexibility in scheduling the translated instructions. • Commit and rollback serve a similar purpose with respect to interrupts.

Memory-mapped I/O • One of the most important rules associated with I/O transactions is that they must be performed in the original (x86) program order since they trigger irrevocable interactions with external devices. • In the x86 architecture, devices can be accessed via two different mechanisms: explicit I/O instructions (“in/out”), and memory-mapped accesses. • The former are easily recognized and translated appropriately. • Memory mapped I/O, however, cannot be distinguished at translation time from regular memory accesses. In addition, a given x86 instruction can access both regular memory and I/O space over the course of program execution.

Memory-mapped I/O • To solve the problem, load and store atoms on the Crusoe hardware specify whether they have been reordered with respect to the original x86 program. • When such a speculative memory atom accesses a memory page that is mapped to I/O space, the hardware raises an exception. : To identify code regions that access volatile memory, an access protection bit can be added to TLB. • At this point, CMS performs a rollback to the previously committed state and interprets. If the faults recur too often, CMS regenerates the translation, this time without reordering the offending memory reference.

Data Speculation • In particular, it is often desirable to be able to reorder load instructions ahead of store instructions. • However, doing that is incorrect if the load happens to use data from the preceding store • The Crusoe host provides innovative alias hardware that addresses this problem. • When the translator moves a load operation ahead of a store operation, it converts the load into a load-and-protect (which in addition to loading data also records the address and size of the data loaded) and the store into a store-under-alias-mask (which checks for protected regions). • In the (unlikely) event that the store operation overwrites the previously loaded data, the processor raises an exception and the runtime system can take corrective action.

Data Speculation • example • X86 code • ld %r30,[%x] // first load from location X • ... • st %data,[%y] // might overwrite location X • ld %r31,[%x] // this accesses location X again • use %r31 • VLSI code • ldp %r30,[%x] // load from X and protect it • ... • stam %data,[%y] // this store traps if it writes X • use %r30 // can use data from first load

Self Modifying Code • At times, x86 instructions in memory get overwritten, either because the operating system is loading a new program, or because an application is using self-modifying code. • When this happens to code that has already been translated, the Code Morphing software needs to be notified to keep it from erroneously executing a translation for the old code. • To this end, whenever the system translates a block of x86 code, it write-protects the page of x86 memory containing that code. It does so by setting a dedicated “translated” bit in that page’s entry in the processor’s memory management unit. (As with other details of the VLIW hardware, that bit is invisible to x86 software.) • When a protected page is written to, the simplest remedy is to invalidate the affected translation(s). As the runtime system dynamically learns more about the program’s behavior, it switches to more sophisticated strategies (Self-Revalidating & Self-checking).

Self Modifying Code : Fine-Grain Protection • The Crusoe processor provides hardware support for write-protecting memory at granularity finer than full pages. • SW : manage TLB • HW : write protect table which hold the fine-grained write protection checking SMC Optimization Fine-Grain Protection Self-Revalidating Self-Checking

Self Modifying Code : Self Revalidating Translation • Once a candidate translation for self-revalidation is identified, it is flagged. • The next time it is encountered, it is re-translated in order to capture the translated x86 code • Later, if the handler for a fine-grain protection fault determines that the translation(s) might be affected, it enables the prologue and turns off protection to avoid faulting again. • When the translation is next invoked, the prologue verifies that the x86 code corresponding to the translation has not changed, re-enables protection, re-verifies the x86 code, disables the prologue, and then executes the translation.

Self Modifying Code : Self Revalidating Translation • it can be quite efficient if the writes are much less frequent than executions of the affected translations. • Further, this technique does not work if it is the translation itself that is writing on its associated x86 region, since the write occurs after the checking prologue has completed, causing a new fault and preventing forward progress. • For such cases, the following technique -self checking- for optimizing fault detection may work better.

Self Modifying Code : Self Checking Translation • Instead of protecting the x86 page when creating a translation, it is possible to leave the memory page unprotected, and have the translation itself check that the source x86 bytes have not changed, by fetching them and comparing them to their values when the translation was created. • We can merge the checking code into the normal translation code. • the overhead of self-checking a translation once is many times smaller than that of self-revalidating it once, although its average cost may be much higher if the translation is executed many times between protection faults.

Conclusion • Crusoe broke new ground in using a codesigned VM for achieving power efficiency and design simplicity. • Crusoe achieves • Low power • Mobility • Compatibility with X86 • But Performance wasn’t good as their advertisement in real applications.

Reference • Anonymous, Crusoe Exposed: Reverse Engineering the Transmeta TM5xxx Architecture I • Anonymous, Crusoe Exposed: Reverse Engineering the Transmeta TM5xxx Architecture II • Alexander Klaiber , The Technology Behind Crusoe™ Processors • James C. Dehnert, The Transmeta Code Morphing Software: • Linda Geppert , Magic show : Crusoe Report

Codesigned Virtual Machine -Transmeta CRUSOE-