Hack The Cell 2009 を振り返って

Hack The Cell 2009 を振り返って 白石匡央 2009/3/28

スコア ▼最終提出版の結果 cf. 小寺さん ORIGNAL: sum=3c927c56, 294950998 ticks MINE: sum=3c927c56, 2448310 ticks ORIGNAL: sum=2e987a4d, 425483257 ticks MINE: sum=2e987a4d, 3531709 ticks ORIGNAL: sum=ef1b6aef, 313079651 ticks MINE: sum=ef1b6aef, 2598780 ticks ORIGNAL: sum=eedd2516, 290962965 ticks MINE: sum=eedd2516, 2415205 ticks ORIGNAL: sum=f7e967a8, 14411793 ticks MINE: sum=f7e967a8, 119837 ticks ORIGNAL: sum=1f37a7db, 214886708 ticks MINE: sum=1f37a7db, 1783769 ticks ORIGNAL: sum=c7d41f36, 295887479 ticks MINE: sum=c7d41f36, 2456073 ticks ORIGNAL: sum=aa9d2e9f, 260377511 ticks MINE: sum=aa9d2e9f, 2161355 ticks ORIGNAL: sum=8abd398a, 251629392 ticks MINE: sum=8abd398a, 2088731 ticks ORIGNAL: sum=a374bd58, 6129418 ticks MINE: sum=a374bd58, 51100 ticks ORIGNAL:sum=3c927c56, 294032966 ticksMINE:sum=3c927c56, 2435961 ticksORIGNAL:sum=2e987a4d, 424158954 ticksMINE:sum=2e987a4d, 3513919 ticksORIGNAL:sum=ef1b6aef, 312105208 ticksMINE:sum=ef1b6aef, 2585719 ticksORIGNAL:sum=eedd2516, 290057342 ticksMINE:sum=eedd2516, 2403032 ticksORIGNAL:sum=f7e967a8, 14366933 ticksMINE:sum=f7e967a8, 119478 ticksORIGNAL:sum=1f37a7db, 214217873 ticksMINE:sum=1f37a7db, 1774866 ticksORIGNAL:sum=c7d41f36, 294966530 ticksMINE:sum=c7d41f36, 2443734 ticksORIGNAL:sum=aa9d2e9f, 259567100 ticksMINE:sum=aa9d2e9f, 2150455 ticksORIGNAL:sum=8abd398a, 250846200 ticksMINE:sum=8abd398a, 2078200 ticksORIGNAL:sum=a374bd58, 6110333 ticksMINE:sum=a374bd58, 51035 ticks ※ http://longlong.way-nifty.com/blog/2009/03/post-7872.html より転載

bitslice ▼ mt転置レジスタ shuffle + gb + shift の組み合わせで生成。割と重い処理。 ▼ mt[624] の配置 128 129 256 257 384 385 512 513 623 1 mt[]インデックス： 128bit×32 128bit×32 128bit×32 128bit×32 128bit×32 mt[0]はバッファに乗せない：別のワーク変数余り分は０

mt[]更新 ▼ mt更新 128 129 256 257 384 385 512 513 623 1 128bit×32 128bit×32 128bit×32 128bit×32 128bit×32 ブロック単位で更新 624 751 752 879 880 1007 1008 1135 1136 1246 更新式： mt[kk+624] = f(mt[kk]，mt[kk+1]） mt[kk+1]を軸に見ると計算量が減る（mt[kk]は1bitしか関わらないので） mt[kk+624] = f(mt[kk+1]) とみなす。 ※ mt[kk+M] は別途

(y>>1)^mag01[y&0x1UL]までの計算 ▼ y = (mt[kk]&UPPER_MASK)|(mt[kk+1]&LOWER_MASK); a00 = spu_sel( a00, prev, 0x1 ); a00 = spu_rlqwbyte( a00, 15 ); a00 = spu_rlqw( a00, 7 ); 128 0 1 最上位ビット： prev a00 ▼ (y >> 1) ^ mag01[y & 0x1UL]; a02 = spu_xor( a02, a31 ); a03 = spu_xor( a03, a31 ); a06 = spu_xor( a06, a31 ); a11 = spu_xor( a11, a31 ); a15 = spu_xor( a15, a31 ); a17 = spu_xor( a17, a31 ); a18 = spu_xor( a18, a31 ); a23 = spu_xor( a23, a31 ); a24 = spu_xor( a24, a31 ); a26 = spu_xor( a26, a31 ); a27 = spu_xor( a27, a31 ); a28 = spu_xor( a28, a31 ); a29 = spu_xor( a29, a31 ); a30 = spu_xor( a30, a31 ); mag01[2]={0x0UL, 0x9908b0df}; 「a31,a00,・・・,a30」の組に結果が入る

mt[]更新とtempering計算 ▼ mt更新とtempering計算のパイプラインループ 128 129 256 257 384 385 512 513 623 624 1 751 mt更新： 0-31 32-63 32-63 32-63 64-95 95-127 128-159 0-31 128 129 256 257 384 385 512 513 623 1 tempering： 0-31 32-63 32-63 32-63 64-95 95-127 128-159 ：レジスタ・セットA 32レジスタ mt更新： odd命令が多い互いに依存関係がなければペアリングしやすい tempering ： even命令が多い：レジスタ・セットB 32レジスタ ◎ ループの終端で“0-31” のレジスタセットが異なる → ループ内を10セット分（上図の倍）にアンロール（※ 時間切れで実装できず。セットBからセットAへの代入コストを払っている。）

mt[kk+M]の算出 128 129 256 257 384 385 512 513 623 1 mt：ブロックをまたがる mt[kk+M]： 397 ・２つのブロックに対する selb + rot で求める。・理想的には，３２個のレジスタを２セット使ってメモリロードを減らしたい。 → レジスタが足りない・レジスタ８個（8ビット分）×５セットをうまく回すことで対処 mt[513]～mt[623] では特異処理となる。 => 最適化が不十分で遅い・・・ 31 24 23 16 15 8 7 0 一つ前のブロック： A B C D ＋＋＋＋次のブロック(ロード)： B A C E A E B C D 次の計算へ y と xor → 空き y と xor → 空き y と xor → 空き y と xor → 空き B C A 次へ次へ次へ

tempering ▼ tempering計算

合計値 ▼ tempering計算値の合計各ビットごとに集計 s0 s1 s2 s3 ・・・ s31 spu_cntb が使えるさらに，ブロックごとにs0～s31を集計 total0 += s0 total0 ～ totoal15（上位ビット）は short で集計 total1 += s1 total2 += s2 total16 ～ totoal31（下位ビット）は int で集計 total3 += s3 → ６レジスタ使用・・・ total31 += s31 最終結果は，最後に各ビットをシフトしながら合計する SUMコスト： cntb×32，sumb×20，shuffle×12，a×6 （even=58，odd=12）

アセンブラ関数 １２８個のレジスタを全て使うためにオールアセンブラの関数を記述 ▼ エントリポイントラベル + extern で記述 extern RESULT_SUM sum_rand_asm( UINTV loop_cnt, UINTV preve ); __asm__ ( "sum_rand_asm:\n\t" ▼ prologue レジスタ $80～$127 を退避 "stqd $80,-16($sp)\n\t" "stqd $81,-32($sp)\n\t" "stqd $82,-48($sp)\n\t“ ・・・ "stqd $127,-768($sp)\n\t"

アセンブラ関数 ▼ 引数 $0 = $lr 戻りアドレス $1 = $sp スタックポインタ $2 環境ポインタ（フレームポインタ？） $3 から（引数の数だけ）順番に格納される。 ▼ 戻り値 $3 に入れる。構造体の場合は，メンバの数だけ$3から順番に格納する。 ▼ epilogue 退避したレジスタ $80～$127 を戻して，戻りアドレスへジャンプ "hbr FUNCTION_END,$lr\n\t" "lqd $80,-16($sp)\n\t" "lqd $81,-32($sp)\n\t“ ・・・ "lqd $127,-768($sp)\n\t" "FUNCTION_END:\n\t" "bi $lr\n\t"

End ご清聴ありがとうございました。

Hack The Cell 2009 を振り返って