indirect branching in the transmeta efficeon processor n.
Skip this Video
Loading SlideShow in 5 Seconds..
Indirect Branching in the Transmeta Efficeon Processor PowerPoint Presentation
Download Presentation
Indirect Branching in the Transmeta Efficeon Processor

Indirect Branching in the Transmeta Efficeon Processor

162 Views Download Presentation
Download Presentation

Indirect Branching in the Transmeta Efficeon Processor

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Indirect Branching in the Transmeta Efficeon Processor Naveen Kumar and Naveen Neelakantam Intel Corporation

  2. Introduction • Transmeta Efficeon processor • HW/SW co-designed processor marketed in 2003 • Binary translation of x86 to underlying VLIW hardware • Focus on how Efficeon handles indirect branches • Indirect branches are particularly difficult for binary translation • Efficeon provided a number of unique solutions • Many interesting HW/SW solutions to improve efficiency • Our hope is that we can use and build upon these ideas

  3. Disclaimer and Acknowledgement • A review of past work, not original research by authors • Efficeon was implemented by Transmeta, but details rarely published • Acknowledgement and thanks to the original Transmeta team • We continue further advancement of these ideas* * Intel purchased Transmeta IP

  4. Transmeta Efficeon Processor • 6–issue VLIW, in-order, 10 stage pipeline • Provides x86 compatibility • Co-designed with a software system • The Code Morphing Software (CMS) x86 Application and x86 OS x86 ISA Dynamic binary translation CMS RISC ISA VLIW Processor

  5. Dynamic Binary Translation • Intercept executing app • Interpret and profile • Dynamically compile “hot” code to host ISA • Cache and execute • Compiled code fragments are “chained” together • Difficult to chain across an indirect branch • Branch target unknown until runtime x86 Code Interpret Translate Translation Cache Host Processor

  6. Indirect Branch Translation • Several proposals to improve translation efficiency

  7. Indirect Branch Translation • System level translators • Branch target can change by a page-table/segment update • Page permission changes • Page table entry changes (LPN  PPN mappings) • Segment limit and permissions • Sharing translations across processes possible, but additional checks needed • Bottomline: Indirect branch translation is expensive in traditional BT systems

  8. Indirect Branch Prediction • Traditional processors often use a BTB • Insufficient: translated to a conditional direct branch • Conditional branches in an indirect branch translation • Multiple conditional branches in an indirect branch translation • Data-dependent on indirect branch target • These branches also become difficult to predict in hardware • Bottomline: Indirect branches lead to poor branch prediction in traditional BT systems

  9. Indirect Branching in Efficeon • Efficeon’s uses HW/SW co-design to address: • Efficient translation of indirect branches • Better branch prediction than in other BT systems • Next, we discuss how Efficeon handles: • x86 return emulation • x86 indirect branch emulation • Native indirect branches and returns

  10. x86 Return Example • Conventional hardware has near-perfect return target prediction • Front-end typically implements a return address stack foo: call bar … call bar … bar: … ret baz foo+2 foo+8 foo+2 baz foo+8 Return Address Stack

  11. x86 Return Translation foo: call bar … call bar … bar: … ret foo’: mov [esp], foo+2 sub esp, esp, 4 br bar’ foo+2’: … • mov [esp], foo+8 • sub esp, esp, 4 • br bar’ • foo+8’: … bar’: … add esp, esp, 4 brlookup_ibtc(esp) Return is emulated using an indirect branch which is difficult to predict • Inliningdoesn’t help foo+2 foo+8

  12. Hardware support: Flook Stack • 16-entry flook stack is explicitly managed by CMS • Intended for emulating call/return in a translation • Flook stack enables RAS-like target prediction • Includes “tag” validation of an entry before consumption

  13. Translation using Flook Stack foo’: movrtemp, <foo+2> movflook_x86_eip, rtemp strtemp, [esp-4] sub esp, esp, 4 precall <foo+2’> br <bar’> foo+2’: … bar‘: … ldrtemp, [esp] movflook_x86_eip, rtemp add esp, esp, 4 ret foo+2 foo+2’ x86 EIP foo+2

  14. x86 Indirect Branch Emulation • Translation similar to the one shown before • Additional architectural registers significantly reduce translation size • Multiple “inlined” comparisons with known targets • Monitoring and update of predicted targets in SW • Compare translation “context” with runtime “context” • Enhance branch prediction by co-design • Software inserts target address in a “link” register • Perform “other” computation • Pipeline front-end fetches instructions at predicted target • Actual branching happens later via a “brl” instruction

  15. Native Indirect Branches • Translation dispatch and interpreter • Both are frequent users of indirect branches • Lousy branch prediction • Software can aid in branch prediction • Link pipe • Push target addresses onto a hardware structure • Do “other” computation • Frontend can fetch the branch target in the mean time • Branch to the top of link pipe using “brlp” • Native subroutines • Link stack • Corollary to a traditional call stack

  16. Summary and Future Work • Indirect branches particularly expensive • Several techniques to speed-up indirect branches • Flook stack • Link register and brl • Link pipe and brlp • Link stack • Future Work: Since Efficeon, other proposals to enhance indirect branch handling in BT system • Hiser et al, Kim et al • Would be interesting to combine some of these ideas

  17. References • Bala et al, “Transparent Dynamic Optimization: The Design and Implementation of Dynamo”, 1999. • Banning et al, “Link pipe system for storage and retrieval of sequences of branch addresses”, 2003. • Banning et al, “Fast look-up of indirect branch destination in a dynamic translation system”, 2006. • Hiser et al, “Evaluating indirect branch handling mechanisms in software dynamic translation system”, 2007. • Kim et al, “Hardware Support for Control Transfers in Code Caches”, 2003. • Kevin Krewell, “Transmeta gets more Efficeon”, Microprocessor Report, 2003.