180 likes | 306 Views
Application of Instruction Analysis/Synthesis Tools to x86’s Functional Unit Allocation. Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information Engineering National Sun Yat-sen University Kaohsiung, Taiwan 80441 R. O. C. ijhuang@cie.nsysu.edu.tw.
 
                
                E N D
Application of Instruction Analysis/Synthesis Tools to x86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information Engineering National Sun Yat-sen University Kaohsiung, Taiwan 80441 R. O. C. ijhuang@cie.nsysu.edu.tw
Decoupled superscalar architecture register renaming branch prediction Assumptions no cache miss fast instruction fetcher and decoder 100% branch prediction correct load/store unit: 2 cycles;others: 1 cycle large RS and ROB Superscalar Model under Investigation
FU Usage 4A, 2M, 1B 3A, 0M, 0B 2A, 2M, 1B 2A, 1M, 0B 1A, 1M, 1B Frequency The Problem Q: How many functional units are needed in an x86 compatible superscalar core? A:The distribution of functional unit usage in typical x86 programs
How to Obtain FU Distribution? • Simulation-based approaches [Shinatani, 1995], [Davidson, 1995], [Hara et al., 1996], etc. • Running on different CPU platforms • Slow, but can explore many configurations • Monitoring-based approaches [Adams et al., 1989], [Bhandarkar et al., 1997], [Huang, 1997], etc. • Directly running on the same CPU platform • Fast, but work for only the configuration of the underlying CPU platform
ASIA: Automatic Synthesis of Instruction Set Architedcture • GOAL: analyzes and synthesizes application-specific instruction set for pipelined uni-processors. • APPROACH: a micro-operation scheduling engine based on a simulated annealing algorithm  The superscalar core is an application-specific RISC core for x86 emulation
ASIA-II: Extensions for Superscalar Architecture • Register renaming • Temporary registers are used on the fly to resolve anti and data dependencies. • Execution window • Instructions are dispatched sequentially. • Branch prediction • Effective sizes of basic blocks are enlarged.
Register Renaming • In ASIA-II: ignore output, anti dependencies during scheduling
Realistic Patterns in the Execution Window • Balanced distribution: 0bjective function includes both time steps and H/W counts • Window effect: MOP’s are displaced with a limited distance; long distance is possible with many iterations of displacement .as long as performance is improved.
Notation: A - Integer unit M - Memory unit B - Branch unit F - Floating unit Others is the sum of that frequent less than 1.0% Functional Unit Usage
Accumulated Coverage of Functional Unit Allocation (NSC 98) (IA-64) (AMD K6) (Pentium Pro) (Base Machine)
Conclusions • Synthesis/analysis tools have been used to observe the functional unit usage and MLP in superscalar core. • Speedup over simulation is over 600 times. • FUTURE WORK: investigate various microarchitecture features • register renaming vs. branch prediction • functional unit optimization