Codesigned Virtual Machines Part <II>

Codesigned Virtual MachinesPart <II> 2006. 10. 18 Yu, Young Jin DCSLAB

Contents • Introduction • Case Study (1) • Transmeta Crusoe • Case Study (2) • IBM AS/400

Applying Codesigned VMs • Advantages(performance, power efficiency, flexibility) can be achieved, • At the macro level: entirely new ISAs • VLIW: Transmeta Crusoe, IBM Daisy/BOA • OO source ISA: IBM AS/400 • At the micro level • The implementation of specific performance enhancement • Instructions reordering, …

Case Study (1):Transmeta Crusoe

Introduction • In Jan. of 2000, Transmeta Corp. introduced the Crusoe processors. • Remarkably low power consumption • As might not be expected, The new technology is fundamentally software-based. • The power savings come from replacing large numbers of transistors with software.

The Crusoe Processor • Consists of a hardware engine logically surrounded by a software layer. • H/W: The engine • is a VLIW CPU capable of executing up to four operations in each clock cycle. • No resemblance to the x86 instruction set. • S/W: Code Morphing Software(CMS) • Dynamically “morphs” x86 instructions into VLIW instructions

The Crusoe Processor

The Crusoe Processor • CMS technology changes the entire approach to designing microprocessors. • Demonstrate practical microprocessors can be implemented as HW-SW hybrids. • Expanded the design space • Development teams may enlist software experts, working in parallel with hardware engineers to bring products to market faster.

Technology Perspective • Decoupled the x86 ISA from the underlying processor hardware. • Each new CPU design only requires a new version of the Code Morphing software to translate x86 instructions to the new CPU’s native instruction set. • Because the CMS would typically reside in standard Flash ROMs on the motherboard, improved versions can even be downloaded into processor in the field.

x86 vs. Crusoe

Crusoe Processor Fundamentals • VLIW engine • Two integer units, a floating point unit, a memory(store/load) unit, a branch unit • Molecule: a long(64 or 128bits) instruction word contain up to four RISC-like instructions, called atom. • All atoms within a molecule are executed in parallel, and the molecule format directly determines how atoms get routed to functional units. • This greatly simplifies the decode and dispatch hardware.

Crusoe Processor Fundamentals • The integer register file • Has 64 registers, %r0 through %r63 • CMS allocates some registers to hold x86 state while others contain state internal to the system, or can be used as temporary registers.

Crusoe Processor Fundamentals • To keep the processor running at full speed, molecules are packed as fully as possible with atoms.

Conventional superscalar… • This type of processor hardware is much morecomplex than the Crusoe processor’s simple VLIW engine.

Code Morphing Software • CMS • Is fundamentally a dynamic translation system • In this case, x86 ISA -> VLIW ISA • “x86 ISA” is the only thing x86 code sees. • The only program written directly for the VLIW engine is the Code Morphing Software itself.

Hierarchy

Crusoe’s VLIW instr. Scheduling

Code Morphing Software

CMS Memory Layout

CMS: Drawing the HW-SW line • Choosing which functions to implement in HW and which in SW is a major engineering challenge • Involving issues such as cost and complexity, overall performance and power consumption • For example, The HW-SW line might be drawn differently for a high-end server processor.

CMS: Decoding and Scheduling • Code Morphing can translate an entire group of x86 instructions at once, • Whereas a superscalar x86 translates single instructions in isolation. • The Code Morphing approach can amortize the cost of translation over many executions. • Allowing it to use much more sophisticated translation and scheduling algorithm.

CMS: Caching • The translation cache resides in a separate memory space that is inaccessible to x86 code. • As an application executes, • Code Morphing “learns” more about the program and improves it so will execute faster and faster. • Some benchmarks do not accurately predict the performance of Crusoe processor!!

CMS: Filtering • The translation system needs to • Choose carefully how much effort to spend on translating and optimizing a given piece of x86 code. • A wide choice of execution modes • Interpretation only(no translation) • Simple-mined code generation • Highly-optimized code generation

CMS: Prediction and Path Selection • CMS can gather feedback • Instrumentation profiling • The translator adds code to collect info. • This data can be used later to decide when and what to optimize and translate. • For example, if a given branch is highly biased,…

CMS: Making a Translation Front end Well-known optimizations Scheduling The molecules explicitly encode the instruction-level parallelism, hence they can be executed by a simple VLIW engine.

HW Support for Code Morphing • Exceptions • “precise exception” problem trap “too soon” * Solution: Use Shadow Register !

HW Support for Code Morphing • All registers holding x86 state are shadowed. (working/shadow copy) • Normal atoms only update the working copy of the register. • “commit” operation: working -> shadow regs. • “rollback” operation: shadow -> working regs. • Undoing changes to memory • Holding store data in a “gated store buffer” • Commit / rollback

HW Support for Code Morphing • Alias Hardware • When the translator moves a load operation ahead of a store operation, • it converts the load into a load-and-protect and the store into a store-under-alias-mask. • Always safe to reorder memory ld/stores.

HW Support for Code Morphing • Alias Hardware <Original Code> St 0(r1), r2 … Ld r3, 0(r4) … St 0(r5), r6 … Ld r7, 0(r8) Add r9, r3, r7 <Rescheduled Code> - Unsafe Ld r3, 0(r4) Ld r7, 0(r8) St 0(r1), r2 … … St 0(r5), r6 … Add r9, r3, r7 <Rescheduled Code> - Protected Ldp r3, 0(r4) x Ldp r7, 0(r8) x x Stam 0(r1), r2 … … Stam 0(r5), r6 … Add r9, r3, r7 * The ldp/stam pair is an excellent example that illustrates the interplay between the codesigned hardware and software in a codesigned VM.

HW Support for Code Morphing • Coping with Self-Modifying Code • X86 inst. in memory get overwritten, either • Because OS is loading a new program, or • Because an application is using self-modifying code. • When this happens to code that has already been translated, • The CMS needs to be notified to keep it from erroneously executing a translation for the old code.

HW Support for Code Morphing • Coping with Self-Modifying Code • Whenever the system translates a block of x86 code, it write-protects the page. • It does so by setting a dedicated “translated” bit in that page’s entry in the processor’s memory management unit. • That bit is invisible to x86 software. • When a protected page is written to, the simplest remedy is to invalidate the affected translations.

Example: A complex translation

Case Study (2):IBM AS/400

From IBM’s homepage… • The accelerating rate of change of both hardware and software technologies necessitates that the system you select has been designed with the future in mind. • “We believe that the IBM AS/400 will be the number one choice !”

Introduction • The design of AS/400 insulates app programs from changing hw characteristics through the layer of microcode. • The interface: TIMI • The microcode layer: LIC • In 1995, AS/400 changed its processor technology ( CISC -> 64bit RISC ) • No recompiling/rewriting • Not only did they run, but they were fully 64-bit programs.

AS/400 architecture TIMI layer separates the hw and LIC from OS Instructions are translated to a specific hw instruction set as part of the backend of the compilation process.

AS/400 architecture • TIMI is a virtual instruction set. • All user-mode programs are stored as TIMI instructions. • Conceptually somewhat similar to the VM architecture of programming env such as Smalltalk, Java and .NET • Stored within the final program object • Object-based ISA

Memory Architecture • The TIMI has a memory architecture composed of objects. • The objects are completely isolated from one another and can only be accessed via pointers. • Actual address values contained in pointers are not made visible to SW above TIMI. • The implementation of the object-based memory is done entirely below the TIMI.

Memory Architecture • Protecting the integrity of pointers is an essential part of any Object-Based system. • The object pointers are encoded in 128bits. • Upper 64 bits: type info, authorization, … • Lower 64 bits: 64-bit PowerPC virtual addr. • Significant extension to PowerPC mem.arch. • Adding of protection for object pointers • Load/Store-pointer instruction. • 65th bit for indicating whether the location contains a pointer

2 bytes 2 bytes 3 bytes 3 bytes 3 bytes 3 bytes (optional) (optional) (optional) (optional) (optional) Instruction Set • TIMI instruction format • Multiway conditional branch • This is the “architected representation” • It is translated to an impl-dependent form, and it does the work of multiple RISC instructions.

Instruction Set 1 31 32 33 34 35 36 37 ODT Direction Vector ODT Entry String • Add numeric and multiply numeric, are generic • Entries in the ODT indicate the types of operands and the data flow. • The actual storage locations: after the TIMI is translated

Input/Output • The presence of IOPs simplifies the task of pushing the device-dependent aspects out of the central processor.

Input/Output • At the level of TIMI, • There is no secondary(disk) storage; rather it is part of the unified mem architecture. • All disk management SW, drivers, etc. exist in the impl-dependent part of the system. • The OS interacts with SW below the TIMI level(and with I/O devices) • through instructions that operate on the TIMI-level objects.

Input/Output • TIMI-Supported Objects • Access group, Context, … • Authorization List, User Profile, … • Dictionary, Index, … • Queue, Mode descriptor, … • Logical unit descriptor, … • Module, Program, …

Code Translation & Concealment • HLL -> Template(TIMI + ODT) -> Program Object • The contents of the program object cannot be directly observed above the TIMI level. • Materialization • Giving back to the user in the original, machine-independent form • The platform switch is transparent to the user.

Space object Progm. object HLL Program Compiler Space object Program Object <template> TIMI, ODT Impl-dependent Executable code <template> TIMI, ODT Code Translation & Concealment TIMI Level Translator

Codesigned Virtual Machines Part <II>