1 / 54

Profile-Based Dynamic Optimization Research for Future Computer Systems

Profile-Based Dynamic Optimization Research for Future Computer Systems. Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004. Brief history of ‘my’ research. 1970’s: The MPG System

norris
Download Presentation

Profile-Based Dynamic Optimization Research for Future Computer Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profile-Based Dynamic Optimization Research for Future Computer Systems Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004 Seminar@UW-Madison

  2. Brief history of ‘my’ research • 1970’s: The MPG System A Machine-Independent Efficient Microprogram Generator • 1980’s: MUNAP A Two-Level Microprogrammed Multiprocessor Computer • 1990’s: A-NET A Language-Architecture Integrated Approach for Parallel Object-Oriented Computation Seminar@UW-Madison

  3. A Two-Level Microprogrammed Multiprocessor Computer-MUNAP A 28-bit vertical microinstruction activates up to 4 nanoprograms in 4 PU’s every machine cycle MUNAP Seminar@UW-Madison

  4. A Parallel Object-Oriented Total Architecture A-NET(Actors-NETwork ) • Massively parallel computation • Each node consists of a PE and a router. • PE has the language-oriented, typical CISC architecture. • The programmable router is topology- independent. A-NET Multicomputer Seminar@UW-Madison

  5. Current dynamic optimization projects • Computation-oriented: • YAWARA: A meta-level optimizing computer system • HAGANE: Binary-level multithreading • Communication-oriented: • Spec-All: Aggressive Read/Write Access Speculation Method for DSM Systems • Cross-Line: Adaptive Router Using Dynamic Information Seminar@UW-Madison

  6. YAWARA: A Meta-Level Optimizing Computer System Seminar@UW-Madison

  7. Background • Moore’s Law will be maintained by the semiconductor technology • how can we utilize the huge amount of transistors for speedup of program execution? • our idea is to utilize some chip area for dynamicallyand autonomously tuning the configuration of on-chip multiprocessor Seminar@UW-Madison

  8. Base-level processor Memory Meta-level Meta-level processor Base-level Profile of control and data Results of optimization Base-level processor Results of computation Instructions and data Memory Seminar@UW-Madison

  9. Design considerations • HW vs. SW reconfiguration →SW reconfiguration • Static vs. dynamic reconfiguration →both a static and dynamic reconfig. capability • Homogeneous vs. heterogeneous architecture →unified homogeneous structure Seminar@UW-Madison

  10. Basic concepts of thread-level reconfiguration Meta-level Base-level Profiling MT PT Application PT PT PT CT CT CT CT CT PT Management Thread CT PT CT CT CT CT CT Optimization CT CT OT CT OT OT OT OT OT OT Memory MT: Management Thread, PT: Profiling Thread, OT: Optimizing Thread, CT: Computing Thread Seminar@UW-Madison

  11. Execution model Management Thread (MT) activate Profiling Thread (PT) Computing Thread (CT) Profiling-centric sleep collect profile wake up optimization initiate condition satisfied activate Optimizing Thread (OT) sleep collect profile Computing Thread (CT) Profiling Thread (PT) Computing-centric sleep collect profile optimization initiate condition satisfied Seminar@UW-Madison

  12. Change of configurations by meta-level optimization Meta-level Base-level MT OT OT PT PT CT OT OT OT PT PT CT OT OT OT PT CT CT OT OT PT PT CT CT OT OT PT CT CT PT OT OT PT PT CT CT MT OT PT CT CT CT MT OT PT CT CT CT OT OT PT CT CT CT OT PT CT CT CT CT OT OT PT CT CT CT PT CT CT CT CT CT OT PT PT CT CT CT CT CT CT CT CT CT OT PT CT CT CT CT CT CT CT CT CT CT OT PT CT CT CT CT CT CT CT CT CT CT Seminar@UW-Madison

  13. The YAWARA System • an implementation of the computation model • the SW system consists of static and dynamic optimization systems • the HW system includes uniformly structured thread engines (TE); each TE can execute base- and meta-level threads spirit of YAWARA・・・ "A flexible method prevails where a rigid one fails." Seminar@UW-Madison

  14. Software System Static feedback Source Code (C/C++,Java,Fortran,…) Execution Profile SOS (StaticOptimizationSystem) DOS (DynamicOptimizationSystem) Code Analysis Info Dynamic feedback Executable image Run-time Profile Execution Results TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) Thread Engines Seminar@UW-Madison

  15. Hardware System feedback-directed resource control TE TE TE TE I$ register file net- work OUT thread- code cache TE TE TE TE to/from network thread -0 thread- data cache thread -1 I$ D$ thread -2 net- work IN thread -N TE TE TE TE INT*4 + FP*1 execution control TE TE TE TE D$ profiling buffer profiling controller Thread Engine(TE) Seminar@UW-Madison

  16. 8 8 8 8 8 9 9 10 9 10 10 11 9 11 9 12 12 Speculative thread #0 11 13 13 11 11 12 12 12 13 12 13 (CT) 14 i - 1 14 14 i 15 15 15 i +1 21 21 #0 16 16 16 22 22 #1 17 #0 17 17 hit 8 18 18 18 9 20 19 20 19 20 19 11 12 21 21 21 23 22 23 22 23 22 24 24 24 25 25 25 19 Example application – compress – Speculative multithreading using path prediction mechanism Hot path Hot loop Phased behavior Hot path#0 Base #1 #1 hit miss ⇒ #1 ・speculative multithreading code generation ・helper threads generation ・path predictor generation (OT) ・management thread (MT) ・speculative multithreading profiling (PT) hot loop / hot path detection (PT, OT) Meta Seminar@UW-Madison

  17. Conclusion -YAWARA- • we proposed an autonomous reconfiguration mechanism based on dynamic behavior • we also proposed a software and hardware system, called YAWARA, that implements the reconfiguration efficiently • we are now developing the software system and the simulator. Seminar@UW-Madison

  18. YAWARA@PDCS2004 Prediction and Execution Methods of Frequently Executed Two Paths for Speculative Multithreading Seminar@UW-Madison

  19. #2 path #1 path other paths Occurrence ratios of the top-two paths compress/ compress 54.5% 22.4% ijpeg/ forward_DCT 42.1% 48.2% m88ksim/ killtime 97.0% 3.0% li/ sweep 80.7% 19.3% The top two paths occupy 80-100% of execution Seminar@UW-Madison

  20. Two-level path prediction • Introducing two-level branch prediction • history register keeps sequence of #1 path executions (1: #1, 0: the other paths) • counter table counts #1 path executions Single Path Predictor (SPP) history register if v13 >= X predict #1 counter table 1101 v0 v1 : v13 v14 v15 otherwise predict #2 threshold: X Seminar@UW-Madison

  21. Another path predictor Dual Path Predictor (DPP) #1 path history register #1 path counter table 1101 v0 v1 if v13 >= v2 predict #1 : v13 v14 v15 #2 path history register #2 path counter table 0010 v0 v1 otherwise predict #2 v2 : v14 v15 Seminar@UW-Madison

  22. Single Speculation (SS) When a thread fails … recovery process Abort succeeding threads #1 path #1 path Recovery process execute non-speculative thread Non-speculative execution Continue speculative execution continue speculative execution Speculation failure degrades performance #1 path #1 path Seminar@UW-Madison

  23. compress/ compress 54.5% 22.4% ijpeg/ forward_DCT 42.1% 48.2% m88ksim/ killtime 97.0% 3.0% li/ sweep 80.7% 19.3% Double Speculation (DS) • Even when 1st speculation fails, secondary choice has high possibility Top-Two Paths are Dominant. because expected #2 hit = 49.2% expected #2 hit = 81.3% expected #2 hit = 100% expected #2 hit = 100% Seminar@UW-Madison

  24. Double Speculation (DS) If secondary speculation succeeds, performance loss is not so large. recovery process #1 path #2 path #1 path #1 path #2 path #1 path secondary speculation #1 path continue speculative execution Seminar@UW-Madison

  25. Evaluation flow hot-path detection (SIMCA) • thread codes • #1 path speculative thread • #2 path speculative thread • non-speculative thread thread-code generation path history acquisition (SIMCA) path execution history performance estimator speculation hit ratio speed-up ratio Seminar@UW-Madison

  26. Prediction success ratio 100 compress 80 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 forward_DCT 80 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

  27. Prediction success ratio 100 80 killtime 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 80 sweep 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

  28. Speed-up ratio 2.0 compress 1.0 speed-up ratio 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4.0 forward_DCT 3.0 2.0 speed-up ratio 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

  29. Speed-up ratio 3.0 2.0 killtime speed-up ratio 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3.0 2.0 speed-up ratio sweep 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

  30. Conclusions- Two-Path-Limited Speculative Multithreading - • We proposed - path prediction method and predictors - speculation methods for path-based speculative multithreading • Preliminary performance estimation results are shown Seminar@UW-Madison

  31. Current and future works • Accurate and detailed evaluation for various applications  SPEC 2000, MediaBench, … • Integration to our Dynamic Optimization Framework YAWARA Seminar@UW-Madison

  32. Current dynamic optimization projects • Computation-oriented: • YAWARA: A meta-level optimizing computer system • HAGANE: Binary-level multithreading • Communication-oriented: • Spec-All: Aggressive Read/Write Access Speculation Method for DSM Systems • Cross-Line: Adaptive Router Using Dynamic Information Seminar@UW-Madison

  33. HAGANE:Binary-Level Multithreading Seminar@UW-Madison

  34. Background • Multithread programming is not so easy. → Automatic multithreading system However… • Source codes are not always available. →Multithreading at binary level Seminar@UW-Madison

  35. Binary Translator & Optimizer System Source BinaryCode Execution Profile Analysis Info STO (Static Translator & Optimizer) DTO (Dynamic Translator & Optimizer) Multithreaded Binary Code (statically translated) Multithreaded Binary Code (dynamically translated) Process Memory Image Multithread Processor Execution Profile Info Seminar@UW-Madison

  36. Continuation Continuation Continuation TSAG TSAG TSAG Computation Computation Computation Write-back Write-back Write-back Thread Pipelining Model - Loop iterations are mapped onto threads Thread i Thread i+1 Thread i+2 TSAG = Target Store Address Generation Seminar@UW-Madison

  37. Example translation mtc1 $zero[0],$f4 addu $v1[3],$zero[0],$zero[0] bstr slti $v0[2],$v1[3],5000 beq $v0[2],$zero[0],$ST_LL0 addu$t0[8],$a0[4],$zero[0] addu$t1[9],$a1[5],$zero[0] addi $v1[3],$v1[3],1 addi $a0[4],$a0[4],4 addi $a1[5],$a1[5],4 lfrk wtsagd addu$t2[10],$sp[28],$zero[0] altsw$t2[10] tsagd l.s $f0,0($t0[8]) l.s $f2,0($t1[9]) l.s$f4,0($t2[10]) mul.s $f0,f0,f2 add.s $f4,$f4,$f0 sttsw$t2[10],$f4 $ST_LL0: estr mov.s $f0,$f4 jr $ra[31] mtc1 $zero[0],$f4 addu $v1[3],$zero[0],$zero[0] $BB1: l.s $f0,0($a0[4]) l.s $f2,0($a1[5]) mul.s $f0,f0,f2 addiu $v1[3],$v1[3],1 add.s $f4,$f4,$f0 slti $v0[2],$v1[3],5000 addiu $a1[5],$a1[5],4 addiu $a0[4],$a0[4],4 bne $v0[2],$zero[0],$BB1 $BB2: mov.s $f0,$f4 jr $ra[31] Cont. TSAG Comp. Source Binary Code W.B. ・ Thread Management Instructions ・ Overhead code for multithreading Translated Code Seminar@UW-Madison

  38. Superthreaded Architecture L1 Instruction Cache Thread Processing Unit Thread Processing Unit Execution Unit Execution Unit Communication Unit Communication Unit ● ● ● Memory Buffer Memory Buffer Write-Back Unit Write-Back Unit L1 Data Cache Seminar@UW-Madison

  39. m88ksim (SPECint95) • poor speedup ratios • loop unrolling does not affect the performance • number of iterations is quite small. Seminar@UW-Madison

  40. ijpeg (SPECint95) • the thread code size is too small to hide the thread management • overhead • loop unrolling is effective to achieve good speedup ratios • excessive loop unrolling causes performance degradation • number of iterations is not so large. Seminar@UW-Madison

  41. swim (SPECfp95) • good speedup ratios • loop unrolling is effective to achieve linear speedup • number of iterations is large. Seminar@UW-Madison

  42. Conclusion-HAGANE- • We have evaluated the binary-level multithreading using some SPEC95 benchmark programs. • The performance evaluation results indicate: • the thread code size should be large enough to improve the performance. • loop unrolling is effective for the small loop body. • excessive loop unrolling degrades performance Seminar@UW-Madison

  43. HAGANE@PDCS2004 A Methodology ofBinary-Level Variable Analysisfor Multithreading Seminar@UW-Madison

  44. Background and Objective Usually, loop-iterations are interrelated through memory variables, such as induction ones. However, it is difficult to analyze this kind of dependency at binary level. Binary-level variable analysis method is strongly required for binary-level multithreading. Seminar@UW-Madison

  45. for (i = 1; i < N; i++) { z = i * 2; x = a[i-1]; y = x * 3; a[i] = z + y; } lw $a1[5], 16($s8[30]) lw $v1[3], 16($s8[30]) lw $a0[4], 16($s8[30]) sll $v1[3], $v1[3], 0x2 addu $v1[3], $v1[3], $a2[6] lw $v0[2], 16($s8[30]) lw $v1[3], -4($v1[3]) addiu $v0[2], $v0[2], 1 sw $v0[2], 16($s8[30]) lw $v0[2], 16($s8[30]) sll $a1[5], $a1[5], 0x1 sll $a0[4], $a0[4], 0x2 sll $v0[2], $v1[3], 0x1 addu $v0[2], $v0[2], $v1[3] lw $v1[3], 16($s8[30]) addu $a0[4], $a0[4], $a2[6] addu $a1[5], $a1[5], $v0[2] sw $a1[5], 0($a0[4]) slt $v1[3], $v1[3], $a3[7] Example Binary Code -4($v1[3]) 0($a0[4]) Seminar@UW-Madison

  46. Binary-Level Variable Analysis • Register values are analyzed using data flow trees. • When register values, used for memory references, are judged as the same, the memory location is regarded as a virtual register. • Using the virtual registers, steps (1) and (2) are repeated. Seminar@UW-Madison

  47. Construction of Dataflow Tree addiu $29#1, $29#0, -8 sw $0, 0($29#1) addu $5#1, $0, $0 lw $2#1, 0($29#1) addu $3#1, $5#1, $4#0 addiu $5#2, $5#1, 1 addu $2#2, $2#1, $3#1 sw $2#2, 0($29#1) slti $2#3, $5#2, 100 bne $2#3, $0, L1 Seminar@UW-Madison

  48. $2#2 + 14 * $4#0 4 Example Normalization Seminar@UW-Madison

  49. Detection of Loop Induction Variables Loop induction variable is the register, which • has inter-iteration dependency, and • increases with a fixed value between iterations. The concept of virtual register makes it possible to detect induction variables on memory. Seminar@UW-Madison

  50. Application • 101.tomcatv of SPECfp95 Benchmark • Fortran to C translator ver. 19940927 • GCC cross compiler ver 2.7.2.3 for SIMCA • Data set: test • The six most inner loops (#1-#6) are selected • They have induction variables on memory Seminar@UW-Madison

More Related