1 / 47

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads. Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan † and Todd C. Mowry. School of Computer Science Carnegie Mellon University. † Dept. Elec. & Comp. Engineering University of Toronto.

alpha
Download Presentation

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan† and Todd C. Mowry School of Computer Science Carnegie Mellon University †Dept. Elec. & Comp. Engineering University of Toronto

  2. Motivation Chip-level multiprocessing is becoming commonplace • UntraSPARC IV • 2 UltraSparc III cores • IBM Power 4 • SUN MAJC • Sibyte SB-1250 Can multithreaded processors improve the performance of a single application?  We need parallel programs - 2 -

  3. Why Is Automatic Parallelization Difficult? Automatic parallelization today • Must statically prove threads are independent • Constructing proofs is difficult due to ambiguous data dependences • Complex control flow • Pointers and indirect references • Runtime inputs Optimistic compiler? • Limited only by true dependences  One solution: Thread-Level Speculation - 3 -

  4. Time a a a r Thread 5 Thread 6 Retry Thread 4 Thread 7 … = hash[31] … hash[12] = ... check_dep() … = hash[9] … hash[44] = ... check_dep() … = hash[10] … hash[25] = ... check_dep() … = hash[27] … hash[32] = ... check_dep() a Example Processor 1 Processor 2 Processor 3 Processor 4 Thread 1 Thread 2 Thread 3 Thread 4 … = hash[3] … hash[10]= ... check_dep() … = hash[19] … hash[21] = ... check_dep() … = hash[33] … hash[30] = ... check_dep() … = hash[10] … hash[25] = ... check_dep() • while (...){ • … • x=hash[index1]; • … • hash[index2]=y; • ... • } - 4 -

  5. Frequently Dependent Scalars Producer Consumer …=a Time …=a a=… a=…  Can identify scalars that always cause dependences - 5 -

  6. Frequently Dependent Scalars Producer Consumer …=a Wait(a) Time a=… Signal(a) …=a a=… Dependent scalars should be synchronized [ASPLOS’02] - 6 -

  7. Frequently Dependent Scalars Producer Consumer Time …=a …=a a=… a=… Dataflow analysis allows us to deal with complex control flow [ASPLOS’02] - 7 -

  8. Communicating Memory-Resident Values Producer Consumer Load *p Synchronize? Speculate? Time Load *p Store *q Store *q Will speculation succeed? - 8 -

  9. Speculation vs. Synchronization Sequential Execution Speculative Parallel Execution Load *p Load *p Load *p Store *q Load *p Store *q Load *p Store *q Time Store *q Load *p Store *q Store *q Load *p Store *q Load *p Store *q Speculation succeeds: efficient - 9 -

  10.      Speculation vs. Synchronization Sequential Execution Speculative Parallel Execution violation Load *p Load *p Load *p Load *p Store *q Store *q Load *p Store *q Time Store *q Load *p Store *q Load *p Store *q Load *p Load *p Store *q Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Store *q Load *p Store *q Speculation fails: inefficient - 10 -

  11. Speculation vs. Synchronization Sequential Execution Speculative Parallel Execution Load *p Load *p Store *q Store *q Time Load *p Load *p Store *q Load *p Store *q Store *q Load *p Load *p Store *q Store *q Load *p Store *q • Frequent dependences: Synchronize • Infrequent dependences: Speculate - 11 -

  12. Performance Potential 100 Norm. Regional Exec. Time 0 go gcc gap ijpeg mcf crafty parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp • Detailed simulation: • TLS support • 4-processor CMP • 4-way issue, out-of-order superscalar • 10-cycle communication latency Original Perfect memory value Prediction Reducing failed speculation improves performance - 12 -

  13. Hardware vs. Compiler Inserted Synchronization Producer Producer Producer Consumer Consumer Consumer Store *q Load *p Wait() Signal() Store*q Store*q Time (stall) Load *p Load *p Memory Memory Memory Hardware-inserted Synchronization [HPCA’02] Compiler-inserted Synchronization [CGO’04] Speculation - 13 -

  14. Issues in Synchronizing Memory-Resident Values Producer • Static analysis • Which instructions to synchronize? • Inter-procedural dependences • Runtime • Detecting and recovering from improper synchronization Consumer Load *p Store *q Time - 14 -

  15. Outline Producer • Static analysis • Runtime checks • Results • Conclusions Consumer Store *q Time Load *p - 15 -

  16. Compiler Passes foo.c Front End Profile Data Dependences Create Threads Insert Synchronization Decide what to Synchronize Schedule Instructions Back End foo.exe - 16 -

  17. Example do { push (&set, element); work(); } while (test); work() push (head, entry) - 17 -

  18. Example do { push (&set, element); work(); } while (test); work() { if (condition(&set)) push (&set, element); } push (head, entry) - 18 -

  19. Example do { push (&set, element); work(); } while (test); Store *head (push) Load *head (push) work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head Store *head Store *head (work, push) Load *head (work, push) - 19 -

  20. Compiler Passes foo.c Front End Profile Data Dependences Thread Creating Insert Synchronization Decide what to Synchronize Instruction Scheduling Back End foo.exe - 20 -

  21. Example do { push (&set, element); work(); } while (test); Store *head (push) Load *head (push) work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10 push(head,entry) { entry->next = *head; *head = entry; } Store *head (work, push) Load *head (work, push) - 21 -

  22. Compiler Passes foo.c Front End Profile Data Dependences Thread Creating Insert Synchronization Decide what to Synchronize Instruction Scheduling Back End foo.exe - 22 -

  23. Dependence Graph Store *head (push) Store *head (work, push) 10 990 10 Load *head (push) Load *head (work, push) Infrequent dependences: occur in less than 5% of iterations Pairs that need to be synchronized can be extracted from the dependence graph - 23 -

  24. Compiler Passes foo.c Front End Profile Data Dependences Thread Creating Insert Synchronization Decide what to Synchronize Instruction Scheduling Back End foo.exe - 24 -

  25. Load *head (push) 990 Store *head (push) Synchronize these Example do { push (&set, element); work(); } while (test); push_clone(&set, element); Load *head (push) work() { if (condition(&set)) push (&set, element); } push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head); } push(head,entry) { entry->next = *head; *head = entry; } Store *head (push) push(head,entry) { entry->next = *head; *head = entry; } - 25 -

  26. Outline Producer • Static analysis • Runtime checks • Results • Conclusions Consumer Store *q Time Load *p - 26 -

  27. Runtime Checks Producer • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandLoad *p Consumer Signal(q, *q); Store *q Time Load *p • Producer forwards the address to ensure a match between the load and the store - 27 -

  28. Ensuring Correctness Producer Consumer Store *q Time Load *p Store *x • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support • Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] - 28 -

  29. Ensuring Correctness Producer Consumer Store *y Store *q Time Load *p • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support: TLS hardware already knows which locations are stored to - 29 -

  30. Outline Producer • Static analysis • Runtime checks • Results • Conclusions Consumer Store *q Time Load *p - 30 -

  31. C C C Experimental Framework Underlying architecture • 4-processor, single-chip multiprocessor • speculation supported through coherence Simulator • superscalar, similar to MIPS R14K • 10-cycle communication latency • models all bandwidth and contention Benchmarks • SPECint95 and SPECint2000, -O3 optimization P P Crossbar detailed simulation - 31 -

  32. Parallel Region Coverage 100 Parallel Region Coverage 0 go gcc gap ijpeg mcf crafty parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp • Coverage is significant • Average coverage: 54% - 32 -

  33. Failed Speculation Synchronization Stall Other Busy Compiler-Inserted Synchronization 10% 46% 13% 5% 8% 5% 21% 100 Norm. Regional Exec. Time 0 C C C C C C C C C C C C C U U U U U U U U U U U U U go gcc gap ijpeg mcf crafty parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp U=No synchronization inserted C=Compiler-Inserted Synchronization Seven benchmarks speed up by 5% to 46% - 33 -

  34. Failed Speculation Synchronization Stall Other Busy Compiler- vs. Hardware-Inserted Synchronization Hardware does better Compiler does better 100 Norm. Regional Exec. Time 0 C H C H C H C H C H C H C H C H C H C H C H C H C H mcf gcc crafty go gap ijpeg parser perlbmk vpr_place m88ksim gzip_comp bzip2_comp gzip_decomp C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware [HPCA’02] each benefits different benchmarks - 34 -

  35. Failed Speculation Synchronization Stall Other Busy Combining Hardware and Compiler Synchronization 100 Norm. Regional Exec. Time 0 C H B C H B C H B C H B C H B C H B go gap perlbmk m88ksim gzip_comp gzip_decomp C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both The combination is more robust than each technique individually - 35 -

  36. Related Work Compiler-inserted Hardware-inserted Zhai et. al. CGO’04 Steffan et. al. HPCA’02 Cytron ICPP’86 Moshovos et. al. ISCA’97 Tsai & Yew PACT’96 Cintra & Torrellas HPCA’02 Distributed Table CentralizedTable - 36 -

  37. Conclusions Compiler-inserted synchronization for memory-resident value communication: • Effective in reducing speculation failure • Half of the benchmarks speedup by 5% to 46% (regional) • Combining hardware and compiler techniques is more robust • Neither consistently outperforms the other • Can be combined to track the best performer • Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware - 37 -

  38. Questions? - 38 -

  39. Failed Speculation Synchronization Stall Other Busy The Potential of Instruction Scheduling 100 0 ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL go gap gcc ijpeg mcf crafty parser perlbmk m88ksim vpr_place gzip_decomp gzip_comp_R gzip_comp Bzip2_comp E=Early C=Compiler-Inserted Synchronization L=Late Scheduling instructions has addition benefit for some benchmarks - 39 -

  40. Failed Speculation Synchronization Stall Other Busy Program Performance 100 0 UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB gcc gap ijpeg mcf crafty go twolf parser m88ksim perlbmk vpr_place gzip_comp bzip2_comp gzip_decomp gzip_comp_R bzip2_decomp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware - 40 -

  41. Which Technique Synchronizes This Load? 100 0 UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB go gap mcf gcc ijpeg crafty twolf parser gzip_comp perlbmk m88ksim bzip2_comp vpr_place gzip_comp_R gzip_decomp Synchronized by neither technique U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Synchronized by compiler Synchronized by hardware Synchronized by both - 41 -

  42. Ensuring Correctness Producer Consumer Store *q Load *p Store *x • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support • Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] - 42 -

  43. Local Store to *p NO YES q == p NO YES Use Forwarded Value Use Memory Value Ensuring Correctness Producer Consumer Store *x Signal(q); Signal(*q) Store *q Load *p • Store *qandLoad *p access the same memory address • No store modifies the forwarded address between Store *qandload *p • Hardware support Use the forwarded value only if the synchronized pair is dependent - 43 -

  44. Issues in Synchronizing Memory-Resident Values Producer • Inserting synchronization using compilers • Ensuring correctness • Reducing synchronization cost Consumer Store *q Load *p - 44 -

  45. Reducing Cost of Synchronization Producer Producer Consumer Consumer After Instruction Scheduling Before Instruction Scheduling • Instruction scheduling algorithms are described in [ASPLOS’02] - 45 -

  46. Failed Speculation Synchronization Stall Other Busy The Potential of Instruction Scheduling 100 Norm. Regional Exec. Time 0 E C L E C L E C L E C L E C L E C L ijpeg gap m88ksim gzip_comp vpr_place gzip_decomp E = Perfectly predicting synchronized memory-resident values C = Compiler-inserted synchronization L = Consumer stalls until previous thread commits Scheduling instructions could offer additional benefit - 46 -

  47. Failed Speculation Synchronization Stall Other Busy Using More Accuracy of Profiling Information 100 Norm. Regional Exec. Time 0 U C R gzip_comp U=No Instruction Scheduling C=Compiler-Inserted Synchronization R=Compiler-Inserted Synchronization (Profiled with the ref input set) Gzip_comp is the only benchmark sensitive to profiling input - 47 -

More Related