1 / 38

HBase at Xiaomi

HBase at Xiaomi. Liang Xie / Honghua Feng. {xieliang, fenghonghua}@xiaomi.com. About Us. Honghua Feng. Liang Xie. Outline. Introduction Latency practice Some patches we contributed Some ongoing patches Q&A. About Xiaomi. Mobile internet company founded in 2010

dalit
Download Presentation

HBase at Xiaomi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HBase at Xiaomi Liang Xie / Honghua Feng {xieliang, fenghonghua}@xiaomi.com www.mi.com

  2. About Us Honghua Feng Liang Xie www.mi.com

  3. Outline • Introduction • Latency practice • Some patches we contributed • Some ongoing patches • Q&A www.mi.com

  4. About Xiaomi • Mobile internet company founded in 2010 • Sold 18.7 million phones in 2013 • Over $5 billion revenue in 2013 • Sold 11 million phones in Q1, 2014 www.mi.com

  5. Hardware www.mi.com

  6. Software www.mi.com

  7. Internet Services www.mi.com

  8. About Our HBase Team • Founded in October 2012 • 5 members • Liang Xie • Shaohui Liu • Jianwei Cui • Liangliang He • Honghua Feng • Resolved 130+ JIRAs so far www.mi.com

  9. Our Clusters and Scenarios • 15 Clusters : 9 online / 2 processing / 4 test • Scenarios • MiCloud • MiPush • MiTalk • Perf Counter www.mi.com

  10. Our Latency Pain Points • Java GC • Stable page write in OS layer • Slow buffered IO (FS journal IO) • Read/Write IO contention www.mi.com

  11. HBase GC Practice • Bucket cache with off-heap mode • Xmn/ServivorRatio/MaxTenuringThreshold • PretenureSizeThreshold & repl src size • GC concurrent thread number GC time per day : [2500, 3000] -> [300, 600]s !!! www.mi.com

  12. Write Latency Spikes HBase client put ->HRegion.batchMutate ->HLog.sync ->SequenceFileLogWriter.sync ->DFSOutputStream.flushOrSync ->DFSOutputStream.waitForAckedSeqno <Stuck here often!> =================================================== DataNode pipeline write, in BlockReceiver.receivePacket() : ->receiveNextPacket ->mirrorPacketTo(mirrorOut) //write packet to the mirror ->out.write/flush //write data to local disk. <- buffered IO [Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it www.mi.com

  13. Root Cause of Write Latency Spikes • write() is expected to be fast • But blocked by write-back sometimes! www.mi.com

  14. Stable page write issue workaround Workaround : 2.6.32.279(6.3) -> 2.6.32.220(6.2) or 2.6.32.279(6.3) -> 2.6.32.358(6.4) Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster! www.mi.com

  15. Root Cause of Write Latency Spikes ... 0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2] 0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2] 0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4] 0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4] 0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4] 0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4] 0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4] 0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel] 0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel] 0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel] 0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4] 0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel] XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS www.mi.com

  16. Write Latency SpikesTesting 8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17 Statistic the stalled write() which costs > 100ms The largest write() latency in Ext4 : ~600ms ! www.mi.com

  17. Hedged Read (HDFS-5776) www.mi.com

  18. Other Meaningful Latency Work • Long first “put” issue (HBASE-10010) • Token invalid (HDFS-5637) • Retry/timeout setting in DFSClient • Reduce write traffic? (HLog compression) • HDFS IO Priority (HADOOP-10410) www.mi.com

  19. Wish List • Real-time HDFS, esp. priority related • Core data structure GC friendly • More off-heap; shenandoah GC • TCP/Disk IO characteristic analysis Need more eyes on OS Stay tuned… www.mi.com

  20. Some Patches Xiaomi Contributed • New write thread model(HBASE-8755) • Reverse scan(HBASE-4811) • Per table/cf replication(HBASE-8751) • Block index key optimization(HBASE-7845) www.mi.com

  21. 1. New Write Thread Model Oldmodel: … … WriteHandler WriteHandler WriteHandler 256 Local Buffer WriteHandler : write to HDFS WriteHandler :write to HDFS 256 WriteHandler :write to HDFS WriteHandler : sync to HDFS WriteHandler :sync to HDFS 256 WriteHandler :sync to HDFS Problem : WriteHandler does everything, severe lock race! www.mi.com

  22. New Write Thread Model Newmodel : … … WriteHandler WriteHandler WriteHandler 256 Local Buffer AsyncWriter : write to HDFS 1 AsyncSyncer : sync to HDFS WriteHandler :sync to HDFS WriteHandler :sync to HDFS 4 AsyncNotifier : notify writers 1 www.mi.com

  23. New Write Thread Model • Lowload:No improvement • Heavyload:Hugeimprovement(3.5x) www.mi.com

  24. 2. Reverse Scan 1. All scanners seek to ‘previous’ rows(SeekBefore) 2. Figureoutnextrow:max ‘previous’ row 3. All scanners seek to first KV of nextrow(SeekTo) Row1 kv1 Row1 kv2 Row2 kv2 Row3 kv1 Row2 kv1 Row3 kv2 Row3 kv3 Row2 kv3 Row3 kv4 Row4 kv2 Row4 kv1 Row4 kv4 Row4 kv5 Row4 kv3 Row4 kv6 Row5 kv2 Row6 kv1 Row5 kv3 Performance: 70% of forward scan www.mi.com

  25. 3. Per Table/CF Replication • PeerB creates T2only: replication can’t work! PeerA (backup) • PeerB creates T1&T2: all data replicated! T1:cfA,cfB; T2:cfX,cfY Source T1 : cfA, cfB T2 : cfX, cfY PeerB (T2:cfX) ? Need a way to specify which data to replicate! www.mi.com

  26. Per Table/CF Replication • add_peer ‘PeerA’, ‘PeerA_ZK’ • add_peer ‘PeerB’, ‘PeerB_ZK’,‘T2:cfX’ PeerA T1:cfA,cfB; T2:cfX,cfY Source T1 : cfA, cfB T2 : cfX, cfY PeerB (T2:cfX) T2:cfX www.mi.com

  27. 4. Block Index Key Optimization Before : ‘Block 2’ block index key = “ah, hello world/…” Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2) k1:“ab” k2 : “ah, hello world” … … Block 1 Block 2 • Reduce block index size • Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’] www.mi.com

  28. Some ongoing patches • Cross-table cross-row transaction(HBASE-10999) • HLog compactor(HBASE-9873) • Adjusted delete semantic(HBASE-8721) • Coordinated compaction (HBASE-9528) • Quorum master (HBASE-10296) www.mi.com

  29. 1. Cross-Row Transaction:Themis http://github.com/xiaomi/themis • Google Percolator : Large-scale Incremental Processing Using Distributed Transactions and Notifications • Two-phase commit : strong cross-table/row consistency • Global timestamp server : global strictly incremental timestamp • No touch to HBase internal: based on HBase Client and coprocessor • Read : 90%, Write : 23% (same downgrade as Google percolator) • More details : HBASE-10999 www.mi.com

  30. 2. HLog Compactor HLog 1,2,3 Region x : few writes but scatter in many HLogs Region 2 Region x Region 1 Memstore HFiles PeriodicMemstoreFlusher : flush old memstores forcefully • ‘flushCheckInterval’/‘flushPerChanges’ : hard to config • Result in ‘tiny’ HFiles • HBASE-10499 : problematic region can’t be flushed! www.mi.com

  31. HLog Compactor HLog 1, 2, 3,4 • Compact :HLog 1,2,3,4  HLog x • Archive:HLog1,2,3,4 HLog x Region x Region 2 Region 1 Memstore HFiles www.mi.com

  32. 3. Adjusted Delete Semantic Scenario 1 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Write kvA at t0 again 4. Read kvA Result : kvA can’t be read out Scenario 2 1. Write kvA at t0 2. Delete kvA at t0, flush to hfile 3. Major compact 4. Write kvA at t0 again 5. Read kvA Result : kvA can be read out Fix : “delete can’t mask kvs with larger mvcc ( put later )” www.mi.com

  33. 4. Coordinated Compaction RS RS RS Compact storm! HDFS (global resource) • Compact uses aglobalHDFS, whilewhethertocompactisdecidedlocally! www.mi.com

  34. Coordinated Compaction RS RS RS Can I ? OK Master Can I ? NO Can I ? OK HDFS (global resource) • Compact is scheduled by master,nocompactstormanylonger www.mi.com

  35. 5. Quorum Master A zk2 zk3 X Master A Read info/states Master zk1 ZooKeeper RS RS RS • When active master serves, standby master stays ‘really’ idle • When standby master becomes active, it needs to rebuild in-memory status www.mi.com

  36. Quorum Master A X Master 1 Master 3 A Master 2 RS RS RS • Better master failover perf : No phase to rebuild in-memory status • Better restart perf for BIG cluster(10+K regions) • No external(ZooKeeper) dependency • No potential consistency issue • Simpler deployment www.mi.com

  37. Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei LiShaohui Liu, Jianwei Cui, Liangliang HeDihao Chen Acknowledgement www.mi.com

  38. Thank You!xieliang@xiaomi.comfenghonghua@xiaomi.com www.mi.com www.mi.com

More Related