1 / 21

Reducing OLTP Instruction Misses with Thread Migration

Reducing OLTP Instruction Misses with Thread Migration. Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos. University of Toronto École Polytechnique Fédérale de Lausanne. OLTP on a Intel Xeon5660. Shore-MT Hyper-threading disabled . better.

sook
Download Presentation

Reducing OLTP Instruction Misses with Thread Migration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reducing OLTP Instruction Misses with Thread Migration Islam Atta Pınar Tözün Anastasia Ailamaki Andreas Moshovos University of Toronto École Polytechnique Fédérale de Lausanne

  2. OLTP on a Intel Xeon5660 Shore-MT Hyper-threading disabled better IPC < 1 on a 4-issue machine 70-80% of stalls are instruction stalls

  3. OLTP L1 Instruction Cache Misses Trace Simulation 4-way L1-I Cache Shore-MT Most common today! better ~512KB is enough for OLTP instruction footprint

  4. Reducing Instruction Stalls at the hardware level • Larger L1-I cache size • Higher access latency • Different replacement policies • Does not really affect OLTP workloads • Advanced prefetching • Has too much space overhead (40KB per core) • Simultaneous multi-threading • Increases IPC per hardware context • Cache polluting

  5. Alternative: Thread Migration • Enables usage of aggregate L1-I capacity • Large cache size without increased latency • Can exploit instruction commonality • Localizes common transaction instructions • Dynamichardware solution • More general purpose

  6. Transactions Running Parallel Instruction parts that can fit into L1-I Threads T1 T2 T3 Transaction T3 T2 T1 Common instructions among concurrent threads

  7. Scheduling Threads TMi Traditional Threads CORES CORES T3 T2 T1 Total Misses Total Misses T2 T2 T2 T1 T2 T1 T2 T1 T1 T1 T3 T1 T1 T3 T1 T3 T3 T3 T2 T3 0 1 2 3 0 1 2 3 1 1 L1I time 2 3 3 6 4 9 4 10

  8. TMi Transaction A Transaction B CORES • Group threads • Wait till L1-I is almost full • Count misses • Record last N misses • Misses > threshold => Migrate 0 1 T2 T1 T4 T3 T1 L1I time

  9. TMi Transaction A CORES Where to migrate? • Check the last N misses recorded in other caches 1) No matching cache => Move to an idle core if exists 2) Matching cache => Move to that core 3) None of above => Do not move 0 1 T1 T2 T2 T1 T2 T1 T2 T1 T1 L1I time

  10. Experimental Setup • Trace Simulation • PIN to extract instructions & data accesses per transaction • 16 core system • 32KB 8-way set-associative L1 caches • Miss-thresholdis 256 • Last 6 misses are kept • Shore-MT as the storage manager • Workloads: TPC-C, TPC-E

  11. Impact on L1-I Misses better Instruction misses reduced by half

  12. Impact on L1-D Misses better Cannot ignore increased data misses

  13. TMi’s Challenges • Dealing with the data left behind • Prefetching • Depends on thread identification • Software assisted • Hardware detection • OS support needed • Disabling OS control over thread scheduling

  14. Conclusion • ~50% of the time OLTP stalls on instructions • Spread computation through thread migration • TMi • Halves L1-I misses • Time-wise ~30% expected improvement • Data misses should be handled Thank you!

  15. BAckUP

  16. L1-I Misses per K-Instruction

  17. L1-D Misses per K-Instruction

  18. Replacement Policies

  19. Experimental Setup • Intel VTune 2011 • Interface for hardware counters • Working set fits in RAM • Log flushed to RAM • Each run: • Starts with initial database • Each worker executes 1000 xcts before Vtune starts collecting numbers for 60 secs

  20. Formulas • IPC = INST_RETIRED.ANY_P / CPU_CLK_UNHALTED.THREAD • Data Stalls = RESOURCE_STALLS.ANY • Instruction Stalls = UOPS_ISSUED.CORE_STALL_CYCLES - RESOURCE_STALLS.ANY

  21. OLTP L1 Instruction Cache Misses Trace Simulation 4-way L1-I Cache Shore-MT Most common today! better ~512KB is enough for OLTP instruction footprint

More Related