1 / 58

Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split

LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA. Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split. Wook-Hee Kim 1 , Beomseok Nam 1 , Dongil Park 2 , Youjip Won 2

barid
Download Presentation

Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LG 전자 세미나 Mar 18, 2014 Paper presented in USENIX FAST ’14, Santa Clara, CA Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim1, Beomseok Nam1, Dongil Park2, Youjip Won2 1 Ulsan National Institute of Science and Technology, Korea 2 Hanyang University, Korea

  2. Outline • Motivation • Journaling of Journal Anomaly • How to resolve Journaling of Journal anomaly • Multi-Version B-Tree (MVBT) • Optimizations of Multi-Version B-Tree • Lazy Split • Reserved Buffer Space • Lazy Garbage Collection • Metadata Embedding • Disabling Sibling Redistribution • Evaluation • Demo

  3. Android is the most popular mobile platform

  4. Storage in Android platform Performance Bottleneck Major cause for device break down. 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 Cause = Excessive IO

  5. Android I/O Stack Insert/Update/ Delete/Select Misaligned Interaction SQLite Read/Write EXT4 Read/Write Block Device Driver

  6. SQLite insert (Truncate mode) insert SQLite O W fsync W fsync W T Open rollback journal. Record the data to journal. Put commit mark to journal. Insert entry to DB Truncate journal.

  7. EXT4 Journaling (ordered mode) W SQLite write() D J M C EXT4 Write data block Write EXT4 journal Write journal metadata Write journal commit

  8. Journaling of Journal Anomaly insert fsync O W SQLite W fsync W D J M C D D J M C EXT4 One insert of 100 Byte  9 Random Writes of 4KByte Write data block Write EXT4 journal Write journal metadata Write journal commit

  9. How to Resolve Journaling of Journal? insert SQLite fsync O W W fsync W Database Journaling Database Insertion D J M C D D J M C EXT4

  10. How to Resolve Journaling of Journal? Reducing the number of fsync() calls fsync() fsync() fsync() Database Insertion Database Journaling Journaling + Database = Version-based B-Tree (MVBT)

  11. Version-based B-Tree (MVBT) Insert 10 Update 10 with 20 Time T2 T1 20 [T2, ∞) 10 [T1, ∞) 10 [T1, T2) 10 [T1, T2) Do not overwrite old version data Do not need a rollback journal

  12. Node Split in Multi-Version B-Tree Page 1 Key = 25 = 5 12 [4~∞) 40 [1~∞) 5 [3~∞) 10 [2~∞)

  13. Node Split in Multi-Version B-Tree Dead Node Page 1 Key = 25 = 5 12 [4~5) 40 [1~5) 5 [3~5) 10 [2~5)

  14. Node Split in Multi-Version B-Tree Page 1 Key = 25 = 5 12 [4~5) 40 [1~5) 5 [3~5) 10 [2~5)

  15. Node Split in Multi-Version B-Tree Page 3 Page 2 Page 1 Key = 25 = 5 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5) 5 [5~∞) 5 [3~5) 10 [5~∞) 10 [2~5)

  16. Node Split in Multi-Version B-Tree Page 2 Page 4 Page 1 Page 3 ∞ [0~5) : Page 1 12 [5~∞) 12 [4~5) 40 [1~5) 40 [5~∞) 5 [5~∞) 5 [3~5) 10 [5~∞) : Page 3 10 [5~∞) 10 [2~5) ∞ [5~∞) : Page 2 25 [5~∞)

  17. Node Split in Multi-Version B-Tree One more dirty page than original B-Tree dirty Page 3 Page 4 Page 2 Page 1 ∞ [0~5) : Page 1 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5) dirty dirty dirty 10 [5~∞) : Page 3 5 [3~5) 5 [5~∞) 10 [2~5) ∞ [5~∞) : Page 2 10 [5~∞) 25 [5~∞)

  18. I/O Traffic in MVBT DB buffer cache fsync() MVBT Reduce the number of dirty pages!

  19. I/O Traffic in MVBT DB buffer cache fsync() MVBT DB buffer cache fsync() LS -MVBT

  20. Optimizations in Android I/O tread >> twrite Read Write Write Transaction 1 Multiple Insert Write Transaction 2 Single Insert Transaction Write Transaction 3

  21. Lazy Split Multi-Version B-Tree (LS-MVBT) Optimizations Lazy Split LS-MVBT Reserved Buffer Space Disabling Sibling Redistribution Lazy Garbage Collection Metadata Embedding

  22. Optimization1: Lazy Split • Legacy Split in MVBT dirty Page 4 Page 2 Page 1 Page 3 ∞ [0~5) : Page 1 4 dirty pages 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5) dirty dirty dirty 5 [5~∞) 10 [5~∞) : Page 3 5 [3~5) ∞ [5~∞) : Page 2 10 [2~5) 10 [5~∞) 25 [5~∞)

  23. Optimization1: Lazy Split Page 1 Key = 25 = 5 12 [4~∞) 40 [1~∞) 5 [3~∞) 10 [2~∞)

  24. Optimization1: Lazy Split Lazy Node: Half-dead, Half-live Page 1 Key = 25 5 [3~∞) 10 [2~∞) = 5 12 [4~5) 40 [1~5)

  25. Optimization1: Lazy Split Page 1 Key = 25 5 [3~∞) 10 [2~∞) = 5 12 [4~5) 40 [1~5)

  26. Optimization1: Lazy Split Page 2 Page 1 Key = 25 5 [3~∞) 10 [2~∞) = 5 12 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5)

  27. Optimization1: Lazy Split Page 2 Page 1 Page 3 5 [3~∞) 10 [5~∞) : Page 1 12 [5~∞) 10 [2~∞) ∞ [0~5) : Page 1 12 [4~5) ∞ [5~∞) : Page 2 40 [5~∞) 40 [1~5) 25 [5~∞)

  28. Optimization1: Lazy Split • Lazy Node Overflow Key = 8 Page 3 Page 1 10 [5~∞) : Page 1 5 [3~∞) = 6 10 [2~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 12 [4~5) 40 [1~5) Page 2 12 [5~∞) Dead entries 25 [5~∞) 40 [5~∞)

  29. Optimization1: Lazy Split • Lazy Node Overflow → Garbage collect dead entries Key = 8 Page 3 10 [5~∞) : Page 1 = 6 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 5 [3~∞) 10 [2~∞) 12 [5~∞) 25 [5~∞) 12 [4~5) 40 [5~∞) 40 [1~5)

  30. Optimization1: Lazy Split • Lazy Node Overflow → Garbage collect dead entries Key = 8 Page 3 10 [5~∞) : Page 1 = 6 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 5 [3~∞) 8 [6~∞) 12 [5~∞) 25 [5~∞) 10 [2~∞) 40 [5~∞) But, what if the dead entries are being accessed by other transactions?

  31. Optimization2: Reserved Buffer Space • Option 1. Wait for read transactions to finish Key = 8 Page 3 Page 2 10 [5~∞) : Page 1 = 6 12 [5~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 40 [5~∞) Page 1 5 [3~∞) 10 [2~∞) 25 [5~∞) 12 [4~5) 40 [1~5)

  32. Optimization2: Reserved Buffer Space • Option 2. Split as in legacy MVBT split Key = 8 Page 3 Page 2 10 [5~∞) : Page 1 = 6 12 [5~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 40 [5~∞) Page 4 Page 1 5 [6~∞) 5 [3~6) 10 [6~∞) 10 [2~6) 12 [4~5) 25 [5~∞) 40 [1~5)

  33. Optimization2: Reserved Buffer Space • Option 2. Split as in legacy MVBT split Page 2 Page 3 10 [5~6) : Page 1 10 [6~∞) : Page 3 12 [5~∞) ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 40 [5~∞) Page 4 Page 1 5 [6~∞) 5 [3~6) 8 [6~∞) 10 [2~6) 25 [5~∞) 10 [6~∞) 12 [4~5) 40 [1~5)

  34. Optimization2: Reserved Buffer Space • Option 3. Pad some buffer space in tree nodes Key = 8 Page 3 10 [5~∞) : Page 1 = 6 ∞ [0~5) : Page 1 ∞ [5~∞) : Page 2 Page 1 Page 2 12 [5~∞) 10 [2~∞) Buffer space is used when lazy node is full 25 [5~∞) 12 [4~5) 40 [1~5) 40 [5~∞) If buffer space is also full, split as in legacy MVBT 8 [6~∞)

  35. Rollback in LS-MVBT • Similar to rollback in MVBT Rollback Page 3 Page 3 Page 1 Page 3 Page 1 Page 3 Page 1 Page 1 Txn 5 crashes 5 [3~∞) 5 [3~∞) ∞ [0~∞) : Page 1 5 [3~∞) 5 [3~∞) 10 [5~∞) : Page 1 10 [5~∞) : Page 1 10 [2~∞) 10 [2~∞) 10 [2~∞) ∞ [0~5) : Page 1 10 [2~∞) ∞ [0~5) : Page 1 ∞ [0~5) : Page 1 • Number of dirty nodes touched by rollback of LS-MVBT is also smaller than that of MVBT. 12 [4~5) ∞ [5~∞) : Page 2 12 [4~5) 12 [4~5) 12 [4~∞) ∞ [5~∞) : Page 2 40 [1~ ∞) 40 [1~5) 40 [1~5) 40 [1~5) Page 2 12 [5~∞) 25 [5~∞) 40 [5~∞)

  36. Optimization3: Lazy Garbage Collection Periodic Garbage Collection Transaction 1 Transaction buffer cache P1 P2 P3 P1

  37. Optimization3: Lazy Garbage Collection Periodic Garbage Collection Transaction 1 Stopped Transaction buffer cache P1 P2 P3 P1 P2 P3 fsync() Garbage Collection Extra Dirty Pages GC buffer cache

  38. Optimization3: Lazy Garbage Collection Lazy Garbage Collection Do not garabge collect if no space is needed Transaction 1 fsync() Insert P2 P3 P1 P1 Transaction buffer cache No Extra Dirty Page by GC

  39. Optimization4: Metadata embedding Version = “File Change Counter” in DB header page Header Page 1 dirty 5 6 ++ Write Transaction 6 Page 2 ... 2 dirty pages Commit Read Transaction 5 dirty Page N 15 [6~∞) 25 [5~∞) 40 [1~∞)

  40. Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Header Page 1 Write Transaction 6 Page 2 ... 5 Commit Read Transaction 5 6 Page N 5 RAMDISK 15 [6~∞) 25 [5~∞) 40 [1~∞) 6 5

  41. Optimization4: Metadata embedding Flush “File Change Counter” to the last modified page and RAMDISK Header Page 1 Write Transaction 6 Page 2 ... 5 CRASH Commit Read Transaction 5 Page N 6 dirty RAMDISK 15 [6~∞) 25 [5~∞) 40 [1~∞) 1 dirty page 6 6 Volatile

  42. Optimization5: Disable Sibling Redistribution • Sibling redistribution hurts insertion performance • Disable sibling redistribution → avoid dirtying 4 pages Page 5: Page 5: dirty dirty 12 10 10 : Page 4 10 : Page 4 Insertion of key 20 40 25 • : Page 3 • : Page 3 ∞ : Page 2 ∞ : Page 2 4 dirty pages Page 3: dirty dirty dirty Page 4: Page 2: 12 55 ∞ 5 10 15 20 25 40

  43. Optimization5: Disable Sibling Redistribution • Disabled sibling redistribution Page 5: dirty 10 : Page 4 • : Page 3 15 : Page 5 Insertion of key 20 90 : Page 2 Page 3: dirty Page 4 Page 2 dirty Page 5: 20 12 55 90 5 10 15 25 40 3 dirty pages

  44. Summary Optimizations Lazy Split Avoid dirtying an extra page when split occurs Reserved Buffer Space Disabling Sibling Redistribution Reduce the probability of node split LS-MVBT Do not touch siblings to make search faster Lazy Garbage Collection Metadata Embedding Delete dead entries on the next mutation Avoid dirtying header page

  45. Performance Evaluation • Implemented LS-MVBT in SQLite 3.7.12 • Testbed: Samsung Galaxy-S4 (JellyBean 4.3) • Exynos 5 Octa 5410 Quad Core 1.6GHz • 2GB Memory • 16GB eMMC • Performance Analysis Tools • Mobibench • MOST

  46. Analysis of insert Performance SQLite Insert (Response Time) LS-MVBT fsync WAL fsync() MVBT fsync() B-tree computation LS-MVBT: 30~78% faster than WAL mode WAL Checkpointing Interval

  47. Analysis of insert Performance SQLite Insert (# of Dirty Pages per Insert) LS-MVBT flushes only one dirty page

  48. I/O Traffic at Block Device Level SQLite Insert (I/O Traffic per 10 msec) 1000 insertions (LS-MVBT) 1000 insertions (WAL) LS-MVBT saves 67% I/O traffic: 9.9 MB (LS-MVBT) vs 31 MB (WAL) x3 longer life time of NAND flash

  49. How about Search performance? • LS-MVBT is optimized for write performance.

  50. Search Performance Transaction Throughput with varying Search/Insert ratio LS-MVBT wins unless more than 93% of transactions are search

More Related