Codesigned Virtual Machines

Codesigned Virtual Machines Shin Gyu, Kim 2006. 10. 16

Codesigned VM Application Binary Application Binary Application Binary Native ISA Source ISA Source ISA Hardware Virtual Machine VM Software Target ISA Target ISA VM Hardware Hardware Conventional HW/SW interface Conventional Virtual Machine interface HW/SW Codesigned Virtual Machine SW becomes part of the HW  divide the implementation of HW & SW in an optimal way

Codesigned VM & System VM (1/2) • Codesigned VM & System VM • Support an entire system (OS + App)  Codesigned VM has a form of system VM. • But in codesigned VMs, • Not intended to virtualize HW resources • Not intended to support multiple VM environment. • The Goals include performance, power efficiency, design simplicity.

Codesigned VM & System VM (2/2) • We refer to the VM SW as a VM Monitor (VMM) VisibleMemory Application OS Source ISA (IA32) ConcealedMemory VMM CodeCache Translator Target ISA (Crusoe) Hardware

Codesigned VM & Process VM • Codesigned VM vs. Process VM • Similarity : emulate the source ISA, dynamic translation, code cache • But in codesigned VMs, • Intrinsic compatibility at the ISA level (not ABI level)  Both user-level & system-level ISA must be emulated. • Improved performance, power efficiency, design simplicity. Compatibility is just a requirement, not a motivation.

Codesigned VM & Superscalar processor • Codesigned VM vs. Superscalar processor • Similarity : perform translation • source ISA  target ISA • But in codesigned VMs, • The translation is done in SW. less cost, small size, design simplicity, much more optimization opportunities, low power consumption • Inter-instruction optimization is possible.

Code Translation Methods source  target source  target micro-op a micro-op b micro-op c micro-op d micro-op e . . . micro-op pmicro-op qmicro-op r instr. A instr. B instr. C instr. D . . . instr. M instr. 1 instr. 2 instr. 3 . . . instr. n instr. 1 instr. 2 instr. 3 . . . instr. n Code translation by SW Context-sensitive Code translation by HW Context-free

Contents • Memory & Register State Mapping • Self-Modifying Code & Self-Referencing Code • Support for Code Caching • Implementing Precise Traps • Input/Output

Register State Mapping • Register state mapping is easier. • Host register files can be made larger enough to accommodate the guest’s. PowerPC Daisy host ScratchSpeculative ResultsConstantsPointers

Memory State Mapping • Concealed Memory • A reserved region for VMM, code cache, other data used by VMM. • Never visible to guest SW. • This is possible because VMM takes control from the boot process. • Fixed size, normally diskless (to simplify the system design) • VMM may be stored in ROM.

Concealed Memory (1) • Memory system in Codesigned VM • I-cache only holds target ISA instructions Code Cache ICache Processor Core Concealed Memory VMM Code VMM Data Source ISA Code DCache Conventional Memory Source ISA Data

concealed memory mapping guest memory mapping Concealed Memory (2) • Memory mapping for Concealed memory 1. Concealed logical memory shares address space with the guest • Host address space must be enlarged Concealed Logical Address Concealed real memory Conventional Logical Address Conventional real memory

Concealed Memory (3) • Memory mapping for Concealed memory 2. Two separate logical address spaces. • Load/Store must select the mapping. • This can be controlled by the VMM. Concealed Logical Address Concealed real memory concealed memory mapping Conventional real memory Conventional Logical Address guest memory mapping

Concealed Memory (4) • Memory mapping for Concealed memory 3. Use real addressing for concealed memory. • Special case of option 2. • Separate set of Load/Store, or a mode bit. Concealed Real Address space Concealed real memory Conventional real memory Conventional Logical Address guest memory mapping

Self-Modifying Code (1) • Basically use same technique in a process VM as in Ch 3. • It is easiest Keep guest OS’s virtual-to-real page mapping intact • Write-protect guest code region  Any attempt to write into that region will cause a trap  Then VM can handle this • But in codesigned VMs, • cannot use a system call to write-protect, because it is the guest OS that manages the page tables

Self-Modifying Code (2) • TLB • TLB is managed by VMM • Additional bit indicating “write-protect”. • The VMM sets write-protect bit whenever an entry for a code page is loaded into TLB. • VMM should maintain a table of all the guest virtual pages for translated code.

Self-Modifying Code (3) • Special hardware support. • In the Transmeta Crusoe, a special hardware structure is added to speedup fine-grained write-protection checking. • Goal : Find out whether this is really write to translated code region • Virtual address  (TLB)  Real address  (Filtered by write-protect table)  write fault or not

Self-Modifying Code (4) source addr WP. bits TLB Page level write-protect fault Write-Protect Table virt. page No. hit/miss source code write fault phys. page No. Comparison Logic wp bit mask Page Offset Bits

Self-Modifying Code (5) • I/O writes to guest code memory must be caught. • For translated code in the code cache, keep track of all the real guest pages. • Maintain a hardware table for I/O writes – entries for all the real pages that hold guest code page. • A store to any of these pages cause an interrupt to the VMM. Then VMM flushes the translated code.

Support for Code Caching (1) • Code cache performance is the most important. • SPC  (Hash)  TPC  (if hit) access code cache • Involves multiple mem access + indirect jump • For direct jumps and branches • Superblock chaining eliminates table lookup • But how about indirect jumps?

Support for Code Caching (2) • To reduce table lookup overhead – use SW-based jump target prediction • But, • If SW prediction is incorrect  time is wasted • Many indirect jumps are difficult to predict (ex. returns) • Hardware support for code caching. • JTLB (Jump Translation Lookaside Buffers) • D-RAS (Dual-address Return Address Stack) if (Rx == #addr_1) goto #target_1else if (Rx == #addr_2) goto #target_2else map_lookup(Rx)

JTLB (1) • “a specially designed HW cache of map table entries” TPC MUX Hash SPC Tag Compare tag select tag hit or miss JTLB

JTLB (2) • JTLB_Lookup instruction • Lookup_Jump instruction and prediction • Predict using BTB (branch target buffer) • JTLB hit and prediction correct  OK • JTLB hit but misprediction  Redirect fetch to jump target TPC from JTLB • JTLB miss  Redirect fetch to fall-through addr. TPC hit/miss SPC JTLB_Lookup Ri, Rj, RkJump Ri, Rj == 0Jump map_lookup

JTLB (1) Lookup_JumpInstruction(in pipeline) JTLB RegisterFile BTB Registerindentifier Predictedjump targetTPC Lookup_Jump instructionPC SPC Tag TPC JumpdestinationSPC Tag TPC JumpdestinationTPC Next predicted fetch TPC BTB misprediction:Redirect fetch to jump target TPC from JTLB No Match? Yes Hit? Yes No JTLB miss:Redirect fetch to fall-through address BTB prediction is correct

D-RAS (1) • The RAS (Return address stack) helps solving return-jump problem. • Push the fall-through PC onto a stack • But, in codesigned VM, • We need TPC (not SPC) • If the procedure call is at the end of a translated superblock, the return address may not be correct. Translation Block A Call ??? Translation Block X Return

D-RAS (2) • A specialized dual-address RAS is used. Push_DRASinstruction Predicted SPC Opcode SPC TPC Predicted TPC Returninstruction Opcode SPC push pop Dual-Address Return Address Stack

Implementing Precise Traps • Similar techniques in Chapter 3, 4 • Maintain SW checkpoints • Code motion with extending register live range • Trap occurs  Interpretation beginning at the checkpoint to establish correct state • In codesigned VM, • Enough registers  live ranges can be extended with less register pressure • Restriction of code motion is relaxed.

HW Support for Checkpoints (1) set checkpoint • Use HW to set a checkpoint when each translation block is entered. Translation Block A set checkpoint Translation Block B set checkpoint Translation Block C set checkpoint Translation Block N

HW Support for Checkpoints (2) • If a trap occurs, • HW restores the state at the beginning of the block. • Then interpretation is used to provide the precise exception state. Source code interpret Translation Block A restore checkpoint Translation Block B trap !

HW Support for Checkpoints (3) • When a new translation block is entered, • The state from the previous block is “committed” • And a new checkpoint is set. • Setting register checkpoint • When checkpoint is set – registers are copied to shadow registers. • When a trap occurs – copy back from shadow registers to working registers • These copying are done very fast.

HW Support for Checkpoints (4) • Checkpointing memory • Gated store buffer • Store operations are buffered • Until the current translation block is exited (committed) • If an exception occurs, the buffered stores are flushed • Restrictions on code motion are relaxed. • The code inside a translation block can be reordered by software in any fashion. • Fixed size of store buffer constrain the translation block size.

HW Support for Checkpoints (5) Guest regs shadow Guest regs shadow ScratchSpeculative ResultsConstants ScratchSpeculative ResultsConstants When checkpoint is committed When trap is detected

Page Fault Compatibility (1) • Guest OS must observe exactly the same page fault as on a native platform. • If guest OS manages conventional memory • Page fault for data region will be detected naturally • During interpretation, page fault for code region will also be detected. • But executing translation code does not fetch any code from the guest memory

Page Fault Compatibility (2) • When a translated instruction is fetched from the code cache, we trigger a page fault, if • the corresponding guest instruction would have caused a page fault on a native platform. • Two approaches • Active approach • Lazy approach

Active Page Fault Detection (1) • Monitor potential page replacement by the guest OS. • Assuming architected page table, VMM can identify the mem region of page table. • VMM monitors the guest OS’s modification to the architected page table. • By write-protecting the page table, VMM can monitor any change of a virtual page mapping. • VMM keeps a table for : in which virtual pages each source instructions is contained

Active Page Fault Detection (2) • If the page table is modified, • VMM flushes all the translations in the code cache derived from that (modified) page. • Table 1 - Each source page : all the translation block (must-be-flushed blocks) • Table 2 – keep track of any link backpointers • links (for removed pages) are changed to point VMM emulation manager. • emulation process will detect the instruction page fault.

Lazy Page Fault Detection (1) • Code cache flushing is postponed until actual use of the replaced code. • Every time the translated code crosses a source page boundary, check the page table. • At the time crossing the boundary, Verify_Translation instruction is inserted. • It checks the page mapping • page mapped correctly  proceed • page not mapped  page fault

A B C A A B B C C D D D E F G E E F F G G H I J H H I I J J K L K K L L Lazy Page Fault Detection (2) Code Cache Guest Pages Page correctly mapped? No Jump to VMM Probe page table continue execution Yes Verify_Translation instruction

Input/Output (1) • If the VMM does not use any I/O, • All the guest device drivers can be run as is. • Any I/O instructions or memory mapped I/O is simply passed through. • Volatile memory inhibit optimization. So we need to identify access to the volatile memory. • Use access-protect bit : load/store to that page  trap  deoptimize for correct sequence. • Special volatile version of load/store

Input/Output (2) • Using disk in VMM • for disk-based code cache approach – large, persistent code cache • requires relaxed transparency • “concealed secondary storage” • VMM-aware special disk driver Guest OS Special Disk Driver VMM Concealed Disk region

Codesigned Virtual Machines