1 / 106

ISCA-2000 海外調査報告

ISCA-2000 海外調査報告. 電気通信大学大学院 情報システム学研究科 吉瀬謙二 kis@is.uec.ac.jp. 会議の概要. The 27th Annual International Symposium on Computer Architecture , Vancouver Canada 6 月 10 日~ 1 4日 キーノート1 パネル1 一般講演29(採択率17%) 参加者444人 (日本人13人,大学から4人) http://www.cs.rochester.edu/~ISCA2k/. 紹介する文献.

skule
Download Presentation

ISCA-2000 海外調査報告

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISCA-2000 海外調査報告 電気通信大学大学院 情報システム学研究科 吉瀬謙二 kis@is.uec.ac.jp

  2. 会議の概要 • The 27th Annual International Symposium on Computer Architecture, Vancouver Canada 6月10日~14日 • キーノート1 • パネル1 • 一般講演29(採択率17%) • 参加者444人 (日本人13人,大学から4人) • http://www.cs.rochester.edu/~ISCA2k/

  3. 紹介する文献 • Multiple-Banked Register File Architecture • On the Value Locality of Store Instructions • Completion Time Multiple Branch Prediction for Enhancing Trace Cache Performance • Circuits for Wide-Window Superscalar Processor • Trace Preconstruction • A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots • A Fully Associative Software-Managed Cache Design • Performance Analysis of the Alpha 21264-based Compaq ES40 System • Piranha: A Scalable Architecture Based on Single Chip Multiprocessing

  4. Session 8 – Microarchitecture Innovations Multiple-Banked Register File Architecture Jose-Lorenzo Cruz et al. Universitat Politecnica de Catalunya, Spain ISCA-2000 p.316-325

  5. Register File 5 64 value レジスタファイルの構成 Tri-state buffer Read out1 Reg31 RegNo 64 32 Registers Reg2 Reg1 Reg0 • the number of registers • the number of ports (Read, Write) 5 decoder

  6. 研究の動機 • The register file access time is one of the critical delays • The access time of the register file depends on the number of registers and the number of ports • Instruction window -> registers • Issue width -> ports

  7. 研究の目的 • レジスタファイルのポート数を増やす • シングル・サイクルでアクセスできるレジスタファイルに近づける value request RegFile RegFile Machine Cycle

  8. Impact of Register File Architecture SEPCint95

  9. Observation • Processor needs many physical registers but a very small number are actually required at a given moment. • Registers with no value • Value used by later instructions • Last-use and overwrite • Bypass only or never read

  10. Multiple-Banked Register File Architecture uppermost level Bank 1 Bank 2 Bank 1 Bank 2 lowest level (a) one-level (b) multi-level (register file cache)

  11. Register File Cache • The lowest level is always written. • Data is moved only from lower to upper level. • Cached in upper level based on heuristics. • There is a prefetch mechanism. uppermost Level 16 registers Bank 1 Bank 2 lowest Level 128 registers

  12. Caching and Fetching Policy Locality properties of registers and memory are very different. • Non-bypass caching • バイパスロジックから読まれていない結果のみを上位レベルに格納 • Ready caching • まだ発行されていない命令で必要とされている値のみを上位レベルに格納 • Fetch-on-demand • 必要となった時点で値を上位レベルに転送 • Prefetch-first-pair -> next slide

  13. Prefetch-first-pair • 命令(1) から(3) は,プロセッサ内でリネームステージを経過している. • P1 ~ P8 は,ハードウェアによって変換された物理的なレジスタの番号 • p1 = P2 + P3 • P4 = P3 + P6 • P7 = P1 + P8 命令(1)の結果レジスタ P1を最初に利用する命令(3) のもう一つのレジスタ P8をプリフェッチする. 命令(1)が発行される際に P8をプリフェッチ

  14. 評価結果 (conf. C3) • One-cycle single-banked • Area 18855, cycle 5.22 ns (191 MHz) • Read 4 port, Write 3 port • Two-cycle single-banked • Area 18855, cycle 2.61 ns (383 MHz) • Read 4 port, Write 3 port • Register file cache • Area 20529, cycle 2.61 ns (382 MHz) • Upper: Read 4 port, Write 4 port • Lower: Write 4 port, Bus 2

  15. 評価結果

  16. 研究の新規性 • Register File Cacheの提案 • 高速動作が可能 • 上位レベルのミス率を削減することで,アクセスのサイクル数を1に近づける. • 2つのキャッシュ方式と,2つのフェッチ方式の提案 • エリアとサイクル時間を考慮した性能評価

  17. 研究へのコメント • 巨大なレジスタファイル,ポート数の増加,アクセス時間の低減という要求 • 従来単純な構成だったレジスタファイルに関しても,キャッシュのように複数階層の構成が必要 • 今後,大規模なILPアーキテクチャにおける複雑化は避けられない?

  18. Session 5a – Analysis of Workloads and Systems On the Value Locality of Store Instructions Kevin M. Lepak et al. University of Wisconsin ISCA-2000 p.182-191

  19. Value Locality (Value Prediction) • Value locality • a program attribute that describes the likelihood of the recurrence of previously-seen program values • ある命令が前回生成した演算結果(データ値)と,今回生成するデータ値には関連がある. P1の演算結果の履歴 ... 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1 ? • p1 = P2 + P3 • P4 = P3 + P6 • P7 = P1 + P8

  20. 研究の目的 • Much publication has focused on load instruction outcome. • Examine the implications of store value locality • Introduce the notion of silent stores • Introduce two new definitions of multiprocessor true sharing • Reduce multiprocessor data and address traffic

  21. Memory-centric and producer-centric Locality • Program structure store value locality • The locality of values written by a particular static store. • Message-passing store value locality • The locality of values written to a particular address in data memory.

  22. 20%-62% of stores are silent stores Silent store is a store that does not change the system state.

  23. Silent Store Removal Mechanism • Realistic Method • All previous store addresses must be known. • Load the data from the memory subsystem. • If the data values are equal, the store is update-silent. • Remove from the LSQ and flag the RUU entry • If store is silent, the store retires with no memory access.

  24. Evaluation Results • Writeback Reduction • Range in reduction from 81% to 0% • Average 33% reduction • Instruction Throughput • Speedups of 6.3% and 6.9% for realistic and perfect removal mechanisms.

  25. New Definition of False Sharing • Multiprocessor applications • All of the previous definitions rely on the specific addresses in the same block. • No attempt is made to determine when the invalidation of a block is unnecessary because the value stored in the line does not change. • Silent stores and stochastically silent stores

  26. Address-based Definition of Sharing [Dubois 1993] • Cold Miss • The first miss to a given block by a processor • Essential Miss • A cold miss is an essential miss • If during the lifetime of a block, the processor accesses a value defined by another processor since the last essential miss to that block, it is an essential miss. • Pure True Sharing miss(PTS) • An essential miss that is not cold. • Pure False Sharing miss (PFS) • A non-essential miss.

  27. Update-based False Sharing (UFS) • Essential Miss • A cold miss is an essential miss • If during the lifetime of a block, the processor accesses an address which has had a different data value defined by another processor since the last essential miss to that block, it is an essential miss.

  28. Stochastic False Sharing (SFS) • It seems intuitive that • if we define false sharing to compensate for the effect of silent stores that we could also define it in the presence of stochastically silent stores (values that are trivially predictable via some mechanism)

  29. 研究の新規性 • Overall characterization of store value locality • Notion of silent stores • Uniprocessor speedup by squashing silent stores • Definition of UFS and SFS • How to exploit UFS to reduce address and data bus traffic on shared memory multiprocessors

  30. 研究へのコメント • ストア命令のデータ値の局所性に関する様々な事柄をまとめている. • 評価は初期的な構成のもので,今後の研究の動機付けとなる. • 並列計算機における局所性の利用に関しては詳細な検討が必要

  31. Session 2a – Exploiting Traces Completion Time Multiple Branch Prediction for Enhancing Trace Cache Performance Ryan Rakvic et al. Carnegie Mellon University ISCA-2000 p.47-58

  32. Branch Prediction andMultiple Branch Prediction Basic Block Not Taken Taken Not Taken Taken Taken Control Flow Graph Taken Branch Prediction: T or N? Multiple Branch Prediction: (T or N) (T or N) (T or N) ?

  33. 動機と目的 • Wide Instruction Fetching • 4-way -> 8-way • Multiple branch prediction • Branch Execution Rate: about 15% • One branch prediction per cycle is not enough. • Tree-Based Multiple Branch Predictor (TMP)

  34. 用いられている分岐予測の例 gshare Two-bit bimodal branch predictor B-3 Taken Tag 2BC Not taken Not taken N B-2 00 01 index N Not Taken T B-1 10 T N Not Taken T N B0 10 11 Taken T Taken Taken T N N B0 Global history

  35. B0 B1 B2 B3 B4 Tree-Based Multiple Branch Predictor (TMP) B-3 Tree-based Pattern History Table (Tee-PHT) Taken B-2 TTNT Tree Not Taken B-1 Not Taken B0 Tree(i) B1 T N N B0 Global history Predicted path B2 B3 B4

  36. Tree-based Pattern History Table Two-bit bimodal branch predictor Not taken Not taken Tree-based Pattern History Table (Tee-PHT) 00 01 10 11 Taken Taken tag Predicted Path Tree N T 01 11 T NTNT N 00 T 11

  37. Updating of 2-bit bimodal tree-node NTNT Old predicted path: Recently completed path: TNNN New predicted path: TNTT Not taken Not taken N T N 01 00 01 10 N T T N 11 T 10 T N 10 11 01 T Taken Taken N 11 00 10 T 11 11 10 01

  38. Tree-based Pattern History Table (Tee-PHT) tag Predicted Path Tree Tree-PHT with second level PHT(Node-PHT) for tree-node prediction Node Pattern History Table(Node-PHT) 01...1 n bits of local history 2-bit bimodal N T 01...1 • global(g) • per-branch(p) • shared(s)

  39. 研究の新規性と評価結果 • TMPの提案 • Three-level branch predictor • Maintain a tree structure • Completion time update • TMPs-best (shared) • The number of entries in the Tree-PHT: 2K • Local history bit: 6 • 72KB Memory • 96%: 1 block • 93%: 2 consecutive blocks • 87%: 3 consecutive blocks • 82%: 4 consecutive blocks

  40. 研究へのコメント • サイクル当たり複数の分岐命令の分岐先を予測するために,3レベルの予測機構を提案 • 分岐予測はさらに複雑になるが • 着実な性能向上

  41. Session 6 – Circuit Considerations Circuits for Wide-Window Superscalar Processor Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh and Rahul Sami ISCA-2000 p.236-247

  42. Instruction Window アウトオブオーダ実行のスーパースカラプロセッサと命令ウィンドウ Src1 Src2 実行結果 (タグ,値) • p1= P2 + P3 命令 ウィンドウ 命令供給 実行 命令 命令, データ Src1-Tag Valid Src2-Tag Valid Op Src1-Tag Valid Src2-Tag Valid Op • Wake-up • Schedule Src1-Tag Valid Src2-Tag Valid Op Src1-Tag Valid Src2-Tag Valid Op

  43. 研究の動機 実行命令列 • 命令ウィンドウを大きくすることで命令レベル並列性利用の可能性が増大 • Alpha 21264 のウィンドウサイズは35 • MIPS R10000 のウィンドウサイズは?? • サイズの大きい命令ウィンドウを構成することは困難 • Power4, two 4-issue processors • Intel Itanium, VLIW techniques 命令 ウィンドウ

  44. 研究の目的 • 高速動作する大きなサイズ(128)の命令ウィンドウを実現する • Log-depth cyclic segment prefix (CSP) circuit の提案 • Log-depth cyclic segment prefix circuitとサイクル時間の関係を議論 • 大きなサイズの命令ウィンドウによる性能向上を議論

  45. Gate-delay cyclic segmented prefix (CSP) out 0 in 0 Tail s 0 out 1 in 1 Head s 1 out 2 An 4-entry wrap-around Reordering buffer with Adjacent, linear gated-delay cyclic segmented prefix. in 2 s 2 out 3 in 3 s 3

  46. Commit login using CSP done out 0 done in 0 Tail Tail s 0 done out 1 done in 1 Head s 1 done out 2 done in 2 s 2 Not done out 3 Not done in 3 Head s 3

  47. Wake-up logic for logical register R5 D: R4=R5+R7 Not done out 0 in 0 Tail s 0 A: R5=R8+R1 done out 1 in 1 Head s 1 B: R1=R5+R1 Not done out 2 in 2 s 2 C: R5=R3+R3 Not done out 3 in 3 s 3

  48. Scheduler logic scheduling two FUs D: R4=R5+R7 request 2 Tail + A: R5=R8+R1 request 2 Head Logarithmic gate-delay implementations + B: R1=R5+R1 1 + C: R5=R3+R3 request • p.240 - 241 参照 1 +

  49. 評価結果 • 128エントリの命令ウィンドウを設計 • Commit logic: 1.41 ns (709 MHz) • Wakeup logic: 1.54 ns (649 MHz) • Schedule logic: 1.69 ns (591 MHz) • 現在のプロセス技術を用いて500MHz以上の動作速度を達成

  50. 研究へのコメント • 128エントリの命令ウィンドウの実現可能性を示した. • 従来,命令ウィンドウのエントリ数を増やすことは困難と考えられてきた. • この点を覆すという意味で面白い.

More Related