ＩＳＣＡ－２０００海外調査報告

ＩＳＣＡ－２０００海外調査報告 電気通信大学大学院情報システム学研究科吉瀬謙二 kis@is.uec.ac.jp

会議の概要 • The 27th Annual International Symposium on Computer Architecture, Vancouver Canada 6月10日～1４日 • キーノート１ • パネル１ • 一般講演２９（採択率１７％） • 参加者４４４人　（日本人１３人，大学から４人） • http://www.cs.rochester.edu/~ISCA2k/

紹介する文献 • Multiple-Banked Register File Architecture • On the Value Locality of Store Instructions • Completion Time Multiple Branch Prediction for Enhancing Trace Cache Performance • Circuits for Wide-Window Superscalar Processor • Trace Preconstruction • A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots • A Fully Associative Software-Managed Cache Design • Performance Analysis of the Alpha 21264-based Compaq ES40 System • Piranha: A Scalable Architecture Based on Single Chip Multiprocessing

Session 8 – Microarchitecture Innovations Multiple-Banked Register File Architecture Jose-Lorenzo Cruz et al. Universitat Politecnica de Catalunya, Spain ISCA-2000 p.316-325

Register File 5 64 value レジスタファイルの構成 Tri-state buffer Read out1 Reg31 RegNo 64 32 Registers Reg2 Reg1 Reg0 • the number of registers • the number of ports (Read, Write) 5 decoder

研究の動機 • The register file access time is one of the critical delays • The access time of the register file depends on the number of registers and the number of ports • Instruction window -> registers • Issue width -> ports

研究の目的 • レジスタファイルのポート数を増やす • シングル・サイクルでアクセスできるレジスタファイルに近づける value request RegFile RegFile Machine Cycle

Impact of Register File Architecture SEPCint95

Observation • Processor needs many physical registers but a very small number are actually required at a given moment. • Registers with no value • Value used by later instructions • Last-use and overwrite • Bypass only or never read

Multiple-Banked Register File Architecture uppermost level Bank 1 Bank 2 Bank 1 Bank 2 lowest level (a) one-level (b) multi-level (register file cache)

Register File Cache • The lowest level is always written. • Data is moved only from lower to upper level. • Cached in upper level based on heuristics. • There is a prefetch mechanism. uppermost Level 16 registers Bank 1 Bank 2 lowest Level 128 registers

Caching and Fetching Policy Locality properties of registers and memory are very different. • Non-bypass caching • バイパスロジックから読まれていない結果のみを上位レベルに格納 • Ready caching • まだ発行されていない命令で必要とされている値のみを上位レベルに格納 • Fetch-on-demand • 必要となった時点で値を上位レベルに転送 • Prefetch-first-pair -> next slide

Prefetch-first-pair • 命令(1) から(3) は，プロセッサ内でリネームステージを経過している． • P1 ～ P8 は，ハードウェアによって変換された物理的なレジスタの番号 • p1 = P2 + P3 • P4 = P3 + P6 • P7 = P1 + P8 命令(1)の結果レジスタ P1を最初に利用する命令(3) のもう一つのレジスタ P8をプリフェッチする．命令(1)が発行される際に P8をプリフェッチ

評価結果 (conf. C3) • One-cycle single-banked • Area 18855, cycle 5.22 ns (191 MHz) • Read 4 port, Write 3 port • Two-cycle single-banked • Area 18855, cycle 2.61 ns (383 MHz) • Read 4 port, Write 3 port • Register file cache • Area 20529, cycle 2.61 ns (382 MHz) • Upper: Read 4 port, Write 4 port • Lower: Write 4 port, Bus 2

評価結果

研究の新規性 • Register File Cacheの提案 • 高速動作が可能 • 上位レベルのミス率を削減することで，アクセスのサイクル数を１に近づける． • ２つのキャッシュ方式と，２つのフェッチ方式の提案 • エリアとサイクル時間を考慮した性能評価

研究へのコメント • 巨大なレジスタファイル，ポート数の増加，アクセス時間の低減という要求 • 従来単純な構成だったレジスタファイルに関しても，キャッシュのように複数階層の構成が必要 • 今後，大規模なＩＬＰアーキテクチャにおける複雑化は避けられない？

Session 5a – Analysis of Workloads and Systems On the Value Locality of Store Instructions Kevin M. Lepak et al. University of Wisconsin ISCA-2000 p.182-191

Value Locality (Value Prediction) • Value locality • a program attribute that describes the likelihood of the recurrence of previously-seen program values • ある命令が前回生成した演算結果（データ値）と，今回生成するデータ値には関連がある． P1の演算結果の履歴 ... 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1 ? • p1 = P2 + P3 • P4 = P3 + P6 • P7 = P1 + P8

研究の目的 • Much publication has focused on load instruction outcome. • Examine the implications of store value locality • Introduce the notion of silent stores • Introduce two new definitions of multiprocessor true sharing • Reduce multiprocessor data and address traffic

Memory-centric and producer-centric Locality • Program structure store value locality • The locality of values written by a particular static store. • Message-passing store value locality • The locality of values written to a particular address in data memory.

20%-62% of stores are silent stores Silent store is a store that does not change the system state.

Silent Store Removal Mechanism • Realistic Method • All previous store addresses must be known. • Load the data from the memory subsystem. • If the data values are equal, the store is update-silent. • Remove from the LSQ and flag the RUU entry • If store is silent, the store retires with no memory access.

Evaluation Results • Writeback Reduction • Range in reduction from 81% to 0% • Average 33% reduction • Instruction Throughput • Speedups of 6.3% and 6.9% for realistic and perfect removal mechanisms.

New Definition of False Sharing • Multiprocessor applications • All of the previous definitions rely on the specific addresses in the same block. • No attempt is made to determine when the invalidation of a block is unnecessary because the value stored in the line does not change. • Silent stores and stochastically silent stores

Address-based Definition of Sharing [Dubois 1993] • Cold Miss • The first miss to a given block by a processor • Essential Miss • A cold miss is an essential miss • If during the lifetime of a block, the processor accesses a value defined by another processor since the last essential miss to that block, it is an essential miss. • Pure True Sharing miss(PTS) • An essential miss that is not cold. • Pure False Sharing miss (PFS) • A non-essential miss.

Update-based False Sharing (UFS) • Essential Miss • A cold miss is an essential miss • If during the lifetime of a block, the processor accesses an address which has had a different data value defined by another processor since the last essential miss to that block, it is an essential miss.

Stochastic False Sharing (SFS) • It seems intuitive that • if we define false sharing to compensate for the effect of silent stores that we could also define it in the presence of stochastically silent stores (values that are trivially predictable via some mechanism)

研究の新規性 • Overall characterization of store value locality • Notion of silent stores • Uniprocessor speedup by squashing silent stores • Definition of UFS and SFS • How to exploit UFS to reduce address and data bus traffic on shared memory multiprocessors

研究へのコメント • ストア命令のデータ値の局所性に関する様々な事柄をまとめている． • 評価は初期的な構成のもので，今後の研究の動機付けとなる． • 並列計算機における局所性の利用に関しては詳細な検討が必要

Session 2a – Exploiting Traces Completion Time Multiple Branch Prediction for Enhancing Trace Cache Performance Ryan Rakvic et al. Carnegie Mellon University ISCA-2000 p.47-58

Branch Prediction andMultiple Branch Prediction Basic Block Not Taken Taken Not Taken Taken Taken Control Flow Graph Taken Branch Prediction: T or N? Multiple Branch Prediction: (T or N) (T or N) (T or N) ?

動機と目的 • Wide Instruction Fetching • 4-way -> 8-way • Multiple branch prediction • Branch Execution Rate: about 15% • One branch prediction per cycle is not enough. • Tree-Based Multiple Branch Predictor (TMP)

用いられている分岐予測の例 gshare Two-bit bimodal branch predictor B-3 Taken Tag 2BC Not taken Not taken N B-2 00 01 index N Not Taken T B-1 10 T N Not Taken T N B0 10 11 Taken T Taken Taken T N N B0 Global history

B0 B1 B2 B3 B4 Tree-Based Multiple Branch Predictor (TMP) B-3 Tree-based Pattern History Table (Tee-PHT) Taken B-2 TTNT Tree Not Taken B-1 Not Taken B0 Tree(i) B1 T N N B0 Global history Predicted path B2 B3 B4

Tree-based Pattern History Table Two-bit bimodal branch predictor Not taken Not taken Tree-based Pattern History Table (Tee-PHT) 00 01 10 11 Taken Taken tag Predicted Path Tree N T 01 11 T NTNT N 00 T 11

Updating of 2-bit bimodal tree-node NTNT Old predicted path: Recently completed path: TNNN New predicted path: TNTT Not taken Not taken N T N 01 00 01 10 N T T N 11 T 10 T N 10 11 01 T Taken Taken N 11 00 10 T 11 11 10 01

Tree-based Pattern History Table (Tee-PHT) tag Predicted Path Tree Tree-PHT with second level PHT(Node-PHT) for tree-node prediction Node Pattern History Table(Node-PHT) 01...1 n bits of local history 2-bit bimodal N T 01...1 • global(g) • per-branch(p) • shared(s)

研究の新規性と評価結果 • TMPの提案 • Three-level branch predictor • Maintain a tree structure • Completion time update • TMPs-best (shared) • The number of entries in the Tree-PHT: 2K • Local history bit: 6 • 72KB Memory • 96%: 1 block • 93%: 2 consecutive blocks • 87%: 3 consecutive blocks • 82%: 4 consecutive blocks

研究へのコメント • サイクル当たり複数の分岐命令の分岐先を予測するために，３レベルの予測機構を提案 • 分岐予測はさらに複雑になるが • 着実な性能向上

Session 6 – Circuit Considerations Circuits for Wide-Window Superscalar Processor Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh and Rahul Sami ISCA-2000 p.236-247

Instruction Window アウトオブオーダ実行のスーパースカラプロセッサと命令ウィンドウ Src1 Src2 実行結果　（タグ，値） • p1= P2 + P3 命令ウィンドウ命令供給実行命令命令，データ Src1-Tag Valid Src2-Tag Valid Op Src1-Tag Valid Src2-Tag Valid Op • Wake-up • Schedule Src1-Tag Valid Src2-Tag Valid Op Src1-Tag Valid Src2-Tag Valid Op

研究の動機 実行命令列 • 命令ウィンドウを大きくすることで命令レベル並列性利用の可能性が増大 • Alpha 21264 のウィンドウサイズは35 • MIPS R10000 のウィンドウサイズは?? • サイズの大きい命令ウィンドウを構成することは困難 • Power4, two 4-issue processors • Intel Itanium, VLIW techniques 命令ウィンドウ

研究の目的 • 高速動作する大きなサイズ（１２８）の命令ウィンドウを実現する • Log-depth cyclic segment prefix (CSP) circuit の提案 • Log-depth cyclic segment prefix circuitとサイクル時間の関係を議論 • 大きなサイズの命令ウィンドウによる性能向上を議論

Gate-delay cyclic segmented prefix (CSP) out 0 in 0 Tail s 0 out 1 in 1 Head s 1 out 2 An 4-entry wrap-around Reordering buffer with Adjacent, linear gated-delay cyclic segmented prefix. in 2 s 2 out 3 in 3 s 3

Commit login using CSP done out 0 done in 0 Tail Tail s 0 done out 1 done in 1 Head s 1 done out 2 done in 2 s 2 Not done out 3 Not done in 3 Head s 3

Wake-up logic for logical register R5 D: R4=R5+R7 Not done out 0 in 0 Tail s 0 A: R5=R8+R1 done out 1 in 1 Head s 1 B: R1=R5+R1 Not done out 2 in 2 s 2 C: R5=R3+R3 Not done out 3 in 3 s 3

Scheduler logic scheduling two FUs D: R4=R5+R7 request 2 Tail + A: R5=R8+R1 request 2 Head Logarithmic gate-delay implementations + B: R1=R5+R1 1 + C: R5=R3+R3 request • p.240 - 241 参照 1 +

評価結果 • １２８エントリの命令ウィンドウを設計 • Commit logic: 1.41 ns (709 MHz) • Wakeup logic: 1.54 ns (649 MHz) • Schedule logic: 1.69 ns (591 MHz) • 現在のプロセス技術を用いて500MHz以上の動作速度を達成

研究へのコメント • １２８エントリの命令ウィンドウの実現可能性を示した． • 従来，命令ウィンドウのエントリ数を増やすことは困難と考えられてきた． • この点を覆すという意味で面白い．

ＩＳＣＡ－２０００ 海外調査報告