150 likes | 307 Views
Transmeta and Dynamic Code Optimization. Ashwin Bharambe Mahim Mishra Matthew Rosencrantz. Stuff Compilers Don’t (Can’t?) Do. Instruction reordering Common case detection and optimization Branch prediction Traces ( pre-fetching ) Optimizing traces
E N D
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz
Stuff Compilers Don’t (Can’t?) Do • Instruction reordering • Common case detection and optimization • Branch prediction • Traces ( pre-fetching ) • Optimizing traces • Why can’t compilers do these optimizations? • No runtime statistics • Legacy code ( inertia to recompile )
Therefore – Dynamic Code Optimization • Optimize on the fly ( runtime ) • Current processors do it to some extent • Instruction reordering • Branch prediction • You can do much better…
How Do You Implement This? • “Hardware Intensive” approach • Pentium Pro • Instruction Translator – Part of the critical path of the main processor • I-COP • Instruction-block Optimizer – Off the critical path • “Non-Hardware Intensive” approach • Transmeta, DAISY, Java HotSpot • Trade-offs ?
I-COP (Instruction Path Coprocessors) • What? • Add another processor that watches the instructions retire and can perform operations on them • Why? • Performance! • Principles • Keep the optimizations out of the critical path • Avoid slowdown due to software
Structure • Multiple VLIW processor “slices” makes the I-COP simple, but still able to keep up • I-COP slices have 10 special instructions for pattern matching in addition to 12 normal RISC type
Applications of I-COP • Trace cache fill • Find long strings of instructions that are executed frequently • Pre-fetching • Find a load that is used later as an address in another load • Instruction trace optimizations • Register move optimization
The I-COP Processor • Multiple VLIW slices allow multi-level statically scheduled and explicitly encoded parallelism • Predication and delay slots obviate branch prediction • 32 integer registers, 8 predicate registers • 22 instructions, 12 RISC type, and 10 special • Pattern matching, bit manipulation, instrumentation • Fill buffer collects instructions for analysis • Task queue acts as FIFO scheduler
Examples Of Special Instructions • SearchReplace • Finds a given pattern and replaces it with another given pattern, returns the number of replacements accomplished • Subset • Tests if the bits set in a given register are a subset of those set in a second register
Transmeta Crusoe • The best example of a “non-hardware-intensive” approach • New (and fast!) 128-bit VLIW processor • Aimed at systems where power efficiency is important • Mobile systems • “Dense” servers • Therefore, small gate count • BUT, need x86 compatibility • AND, at reasonable performance too
So how do they do it? • Have a “Code-Morphing” software layer that runs on the processor • All x86 software (BIOS, OS, apps) runs above this • CM software translates x86 code at runtime into VLIW processor’s native IS • Also optimizes the translations! • So processor is fast and simple
Code-Morphing Software • Translates an entire basic-block at once • Also does instruction re-ordering, branch prediction, register renaming • The translations are stored in a translation cache (part of main memory) • Instruments code to help with branch prediction, and detecting candidates for heavy optimizations
Code Morphing Software (cont.) • Also has some help from the hardware • Shadowed and working register sets • Alias hardware (load-and-protect operations) • “Translated” bit for each page table entry • Performance of systems with Crusoe: 2-3 times longer battery life, performance “comparable” to Intel mobile processors