スーパースカラ Super Scalar

スーパースカラSuper Scalar • From CPI（Clock/Instruction）to IPC（Instruction/clock） • スーパースカラ/Super Scalar • 考え方 - 複数命令の同時実行構造Basic idea: Simultaneous issues of several instructions • インオーダとアウトオブオーダー発行In order/Out of order issues • Stage Decoupling with the pipeline • Reservation Station（FIFO） • インオーダとアウトオブオーダー完了In order/Out of order completion • データコンフリクトと例外処理Data conflict and exception handling • ROB（ReOrder Buffer） • レジスタリネーミング; register renaming • 分岐予測; Branch prediction 福永　力; Chikara Fukunaga

From CPI to IPC • MIPS=動作周波数（MHz）÷CPIMIPS=Machine Frequency (MHz)/CPI • もし複数の命令を実行できるようになればCPIという概念は拡張される（命令を1クロックで複数こなせるようになる）．するとIf several instructions can be issued at once, several instructions will be executed seemingly in a clock cycle. Then we may regard IPC=1/CPI（Instructions per clock）がCPUを評価する基本単位となりうる．その場合as a basic unit for the machine performance. In this case MISP will be redefinedMIPS=周波数（MHz）×IPC • その複数命令同時実行構造をスーパースカラと呼ぶ．Structure of this idea (Execution of Several instructions at once) is referred to as the Super Scalar. • （注）MIPS評価には他の方法もある（Dhrystone MIPS）．この評価をもとにしてIPCを計算すると2本の命令が同時実行可能なスーパースカラでもIPC>2などという数字がでてくるので注意．（Note） MIPS can be estimated with the another way (Dhrystone MIPS). This Dhrystone over-estimates IPC>2, for example, for superscalar with double instruction stream. 福永　力; Chikara Fukunaga

スーパースカラの概念Idea of Super scalar • 複数パイプラインを設け複数命令の同時発行（issue）を可能にするSimultaneous Issues of several instructions with multiple pipelines • 命令は相互に独立してあるべきが理想Ideally an instruction must be independent with the other ones. • プログラムから独立した複数の命令を選ぶ（独立した命令をピックアップし矛盾なく並び替える仕事はCPUハードウェアによる）．もちろんプログラム側（コンパイラ）の配慮も必要Multiple independent instructions should be selected and rearranged the order by CPU 福永　力; Chikara Fukunaga

スーパースカラの方式・形態（１）Methods and types of Super scalar (1) • 一般的に命令処理にかかるクロック数（レイテンシー）は命令により異なることに注意（今までは簡単のため同じとして扱ってきたが）．Note the clock count of instructions may not be fixed (variable), although so far we implicitly assumed it as constant. • 複数の命令フェッチ（issueあるいはway数）Fetch of Multiple instructions; Number of instructions fetch simultaneously is called number of issues or ways. • フェッチとフェッチ以後のステージを分離しここに命令キュー（FIFO）を挟む（このFIFOをReservation StationあるいはCentral Instruction Window（集中型命令ウィンドウ）と呼ぶ）のが一般的．An instruction queue (FIFO) is implemented after Fetch stage but before other stages. This FIFO is called Reservation Station or Central Instruction Window 福永　力; Chikara Fukunaga

スーパースカラの方式・形態（2）Methods and types of Super scalar (2) • 命令の発行，完了がオリジナルプログラム通りをインオーダー（In-order），違ってくる場合をアウトオブオーダー(Out-of-order）と呼ぶ．若干異なる構造をとる．If the execution of instructions is done in order with the original program, we call “In-order”, otherwise “Out-of-Order” • 命令依存性は命令キュー内待機命令に対して専用ハードウェアにより確認あるいは積極的に解消させる（レジスタリネーミング；後述）．この時点でのデータハザードによるストールは原則としておきないようにする．Dependency is checked with instructions waited in the FIFO queue. • ロード/ストア命令の実行ユニットは他の命令のそれと別にしてストール発生を抑えるようにしている（キャッシュ，メモリアクセスの遅さ考慮）．Load/Store instruction execution unit is normally independently installed in order to avoid the stall for data access in the main storage. 福永　力; Chikara Fukunaga

命令のインオーダー発行In order issue of instructions • 命令キュー出口より2つの命令の（オペランドの）依存関係をチェック，独立であれば2つを発行．IF two instructions out of the queue are confirmed independet, two are issued. • 依存性があれば1命令のみ発行．次の命令はその次の命令とペアを組む．If there is an dependency, only one instr. is taken, one with the later one is paired. • 命令の発行順はオリジナルプログラムのそれがキープされる．Issued order is kept in execution. 福永　力; Chikara Fukunaga

メモリ研究Study（1）FIFO（Fast In Fast Out）形態 • メモリに入った順序でデータが保持されるData are kept in order with the ones entered. • 待ち行列の構造Waiting queue structure • カウンタに多くの人が列を作ると行列は伸びるIf many people come to a counter, queue is longer. • しかし列の先頭から処理がなされていくと行列は縮むBut the queue is shrink if the process is done turn by turn • 押し掛ける人の数/時間（Producer）と処理される人の数/時間（Consumer）でFIFOの大きさ（深さ）を決めて両サイドの単位時間あたりのデータ量の違いを調整Producer (number of people to counter) and Consumer (number of people accessed) rates will be adjusted with this queue. 福永　力; Chikara Fukunaga

FIFO 構造 structure • 待ち行列のh/wでの実現をFIFO．アドレスを持つ必要がないので単純な構造でメモリを構築できる．It is implemented with less hardware ingredient than usual memory sincethis structure needs not the address. • s/wでも実現できる（循環バッファ）The circulation Buffer can be implementedwith the software. 福永　力; Chikara Fukunaga

命令のアウトオブオーダー発行Out of order issue of instructions • 命令キュー全体（あるいは既定数の命令）について（オペランドの）依存関係を調べるAll the instructions in the queue are checked with the independency • 依存関係のない命令をissue数分パイプラインステージに送るIndependent instructions for number of ways are send to the pipelines． • 発行される順序はプログラムのそれと異なる可能性がある（アウトオブオーダー）．Execution order will be different with the one arranged in the program. • すべてに依存関係があれば1命令発行も止むなし．If all the instructions havedependency, only one instruction will be employed 福永　力; Chikara Fukunaga

命令の依存関係チェックのハードウェア機構H/W check system for Dependency of instructions （Wake-up logic） • Instruction format is taken as “instruction D S1 S2”, Destination & sourcesFor example,“add r1,r3,r1”= r1←r3+r1． • ある命令のDと他の命令のSの関係より命令の依存関係が判明する．Relation of D of an instruction with S of other instruction are used to check the dependency • 実行されなかった命令を実行させるようにするという意味でウェイクアップロジックとも呼ばれる．The instructions that are not able to be executed can be executed => Wakeup logic.s • このロジックの高速化と回路規模の縮小が課題Fast processing and down-sizing of this logic is an issue for this circuit. 福永　力; Chikara Fukunaga

（常に）アウトオブオーダー完了(Constant) Out of order completion • 命令を並列処理するとレイテンシ（命令実行に要するクロック数）の違いで完了（リタイア）時のタイミングは異なる．これはインオーダー発行でも同じである．つまり基本的にスーパースカラではアウトオブオーダー完了の形をとる．Completion timing is different for instructions executed in parallel due to the difference of latency (Number of clock cycles required for execution). Thus the completion timing is always out of order in the super scalar machine. • 注意すべきはWAW hazardと割込み発生後の再開命令It should be noted WAW hazard as well as the restart instruction after interrupt. • Data hazard with WAW（Write After Write）mulf r5,r3,r2 ;lateori r5,r4,0 ;early • 発生後の再開 Restart after interruptmulf r1,r2,r3 ; lateori r3,r4,0 ; earlyInterrupt occurs at this moment（この時点でr3は更新され，r1はまだだと，再開はmulfからでないとまずい; At this moment r3is updated, but r1 is kept old, so the restarting after the interrupt should be from mulf instruction 福永　力; Chikara Fukunaga

ROBを入れてインオーダー完了But in order completion is possible with ROB • スーパースカラはアウトオブオーダー完了が基本（前スライド）だが，前述の如く種々の問題を含んでいる．そこで命令完了の最終ステージであるレジスタライトはプログラム順番通りにさせる．これをインオーダー完了There are problems in the out of order completion as seen in the previous slide.An idea to keep in order for Register writing at the last stage of instruction execution • このために（さらに）新たなるハードウェアを導入する．これをリオーダーバッファ（ReOrder Buffer; ROB）と呼ぶ．ここに各命令のプログラム上での順番を記録，各命令に対して実行が終了したかどうか記憶させる．In order to do so, an another hardware must be implemented. “Reorder Buffer; ROB” to memorize the execution of the instructions 福永　力; Chikara Fukunaga

インオーダーかアウトオブオーダーかIn order or Out of order? • Out-of-order発行/完了のほうがプロセッサのパフォーマンスは確実に向上する．Processor performance is better with out of order than in order. • しかしそのためには例えば動的スケジューリング機能としてのウェークアップロジックのようなハードウェア回路を必要とする．But need more hardware components like wakeup logic etc. • In order発行では複雑なハードウェアを組込む必要はないが平行して処理された命令に予期せぬ遅滞が生じた場合（キャッシュミスヒットとか）その完了をいたずらに待たなければならない．効率的ではない．An instruction must be waited even completed but if the other instructions take longer time with, for example, cache miss-hit in the case of in order issue. 福永　力; Chikara Fukunaga

レジスタリネーミングRegister renaming • アウトオブオーダー発行により生じる新たなデータハザード（WAR,WAW）の完全解消を目指すOut of order will make new data hazard with WAR and WAW • WAW（Write After Write）:後続命令のレジスタ変更を遅れて実行された先行命令が書き直す． • WAR（Write After Read）:先行する命令のソースレジスタを後続の命令実行で書き直す． • これらは相前後する命令で同じレジスタを利用している．そのレジスタの利用を他のレジスタで置き換えればこれらのハザードは回避できる．The hazards occur because instructions uses the same registers.We can avoid these hazard using other registers instead. • そのため（ハードウェアとして密かに）複数のレジスタを用意しそのような状況に陥ったときそれを利用すれば（ソフトウェアのコンシステンシは保たれたまま）ハザードを回避できる．If the hardware can prepare more (shadow) registers, WAR and WAW will be avoided. 福永　力; Chikara Fukunaga

WAW WAR レジスタリネーミング例Example of register renaming • If latency of mulf is longer than addi with 4-issues Multi scalar processor, WAR or WAW will be happenedmultr3,r3,r5addi r4,r3,1addi r3,r5,1multr7,r3,r4 • レジスタリネーミングを行いデスティネーションレジスタを置き換えるとWAWは解消できるWAW can be eliminated with register renamingmultP1,r3,r5addiP2,P1,1addiP3,r5,1multP4,P3,P2 • RAWはレジスタリネーミングでも解消できないRAW can not be eliminated in principle RAW RAW RAW RAW 福永　力; Chikara Fukunaga

一般的なスーパースカラ構造Super scalar structure 福永　力; Chikara Fukunaga

ここでも分岐予測は問題Branch prediction is still serious issue • スーパースカラでも分岐命令の取り扱いには手を焼く（前回パイプラインと同じ状況）The situation is the same as one with the simple pipeline. • 静的予測/Static Prediction • 例：通常taken ,ループでnot takenと固定Normally default is set to not-taken，but taken in a loop • 動的予測/Dynamic Prediction • Branch Prediction Buffer（BPB or BHT；takenするかどうか予測）&Branch Target Buffer（BTB；分岐先アドレス情報）方式の併用 • 投機実行/Speculative Predictiontaken/not-takenどちらの分岐先の命令も先を見越（予測）して実行させる福永　力; Chikara Fukunaga

ウェイクアップロジックについてWake-up Logic Structure • ここではメモリの別形態CAMを用いた処理がなされるのが一般的In general CAM will be used for this logic • CAMとは（means）Contents Addressable Memoryあるいは（or）Contents Associative Memory • アドレス参照ではなく内容参照のメモリ；連想メモリとも呼ばれるA memory for referring data not from the address but from the data themselves • 例えば電話番号からその持ち主を知る（telephone book; number → name） • CPUではキャッシュヒット/TLBの判定で一般的に使われるCAM is used in Cash or TLB hit decision 福永　力; Chikara Fukunaga

Cacheの内部構成Internal structure of Cache • 抽象化されたCacheの構造Abstractive structure of Cache • コントロール部（アドレスデコーダとtag，コンパレータ）とCacheメモリ部Control (Address decoder and Tag comparator) stage and Cache memory • メインストレージのアドレスとtag情報の素早い参照が重要（マッピング）From quick search of tag field, CPU knows whether the address in MS is involved in Cache or not. • データの格納法にもさまさまな工夫Careful specification for Tag associated from the address 福永　力; Chikara Fukunaga

メモリ研究Study（2）CAM構造/Structure • 基本構成は右図のようにまとめられる：Basic idea of CAM is summarized in this figure. • Input Dataは（必要とあれば）マスクで不必要な情報を無視できるMask registers will mask unnecessarybit if it is needed. • 結果はマスクをかけずHit（1）/No hit（0）だけでなくオリジナルとどれだけ食い違っているかの判定するものもあるThere is a tagging without the maskbut with estimating number of bit (un)matched. 比較福永　力; Chikara Fukunaga

CAMの比較機構Bit comparator used in CAM • 入力データとサンプルメモリのデータをビットごとに高速で比較する．High speed bit by bit comparison of Input data with reference • Word単位で並列化できるParallelization is indispensable • 回路構成は単純だが1word並列処理でそのフルビット分回路を用意しなければならないSimple logic but need for all the bit • 右図では各ビットで比較しそのデータを下位から来た結果（Li-1）と論理積をとりLiとしさらに上位に送ることにより並列化を達成AND (Li and Li-1) enables parallelization 福永　力; Chikara Fukunaga

最小距離検索連想メモリLSICAM LSI with the Minimum Distance Search（広島大学ナノデバイスシステム研究センターHiroshima Univ. Nano-Device Research Center ; 2004） • 連想メモリのパターン認識への応用Application of CAM for pattern recognition • 検索（入力）データとメモリデータの比較に範囲（距離）を導入Tagging of a Search word to reference data not exactly but with distance • 入力データにもっとも近傍のメモリデータを探索するPick up a reference data for search word with minimum distance with or 福永　力; Chikara Fukunaga

Minimum-Distance-Search(MDS) CAMCore part • Word length （語長）1≦i<W bit • Number of References （標本数）1≦j≦R • Samples (Ref) are stored in advance in Storage Cells SC (SCij) • Search word aredistributed vertically downward. • Then all the comparsionare done paralell in Unit Comparator CellsUC (UCij) • Results are accumulatedin analog manner inWord Comparator(WCi) 福永　力; Chikara Fukunaga

MDS CAM 回路構成Circuits in detail • Unit Comparator and Word Comparator(Manhattan Distance case) • Self Adaptive Winner Line-up Amplification 福永　力; Chikara Fukunaga

スーパースカラ Super Scalar