1 / 23

Hbase

Hbase. The HBase. HBase is a distributed column-oriented database built on top of HDFS. Easy to scale to demand HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets. Use MapReduce to search

vianca
Download Presentation

Hbase

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hbase

  2. The HBase • HBase is a distributed column-oriented database built on top of HDFS. • Easy to scale to demand • HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets. • Use MapReduce to search • HBase depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state.

  3. Data Model • A data model similar to Bigtable. • a data row has a sortable row key and an arbitrary number of columns • the table is stored sparsely, rows in the same table can have widely varying numbers of columns Conceptual View Physical Storage View

  4. Example • Capture network packets into HDFS, save to a file for every minute. • Run MapReduce app, estimate flow status. • count tcp, udp, icmp packet number • compute tcp, udp, or all packet flow • The result save to HBase. • row key and timestamp are the captrue time

  5. Display • Specify start time and stop time to scan table then estimate data and display as flow graph. • Sample output

  6. The performance of accessing files to HDFS directly and through a HDFS-based FTP server

  7. Accessing files to HDFS directly(1/7) • ssh登入namenode下達指令 • 上傳檔案至HDFS: • hadoop fs -Ddfs.block.size=資料區塊位元組數 -Ddfs.replication=資料區塊複製數量 -put 本機資料 HDFS檔案目錄 • 由HDFS下載檔案: • hadoop fs -get HDFS上的資料 本機目錄

  8. Accessing files to HDFS directly(2/7) • 觀察透過HDFS參數的調整,讓HDFS在不同條件下的檔案讀取效能。之後的標題中若標示R=1,表示某檔案在HDFS中的複製(備份)數量。

  9. Accessing files to HDFS directly(3/7, R=1) (橫軸表示資料分割區塊大小,單位:byte) (縱軸表示一份資料完全寫入HDFS所需要的時間, 單位:秒)

  10. Accessing files to HDFS directly(4/7, R=1) (橫軸表示資料分割區塊大小,單位:byte) (縱軸表示一份資料完全從HDFS讀出所需要的時間,單位:秒)

  11. Accessing files to HDFS directly(5/7, R=2) (橫軸表示資料分割區塊大小,單位:byte) (縱軸表示一份資料完全寫入HDFS所需要的時間,單位:秒)

  12. Accessing files to HDFS directly(6/7, R=2) (橫軸表示資料分割區塊大小,單位:byte) (縱軸表示一份資料完全從HDFS讀出所需要的時間,單位:秒)

  13. Accessing files to HDFS directly(7/7) • 結論 • 在運行NameNode daemon的namenode server上直接上下載檔案,原則上資料區塊大小以64MB或128MB效能較佳。 • 資料區塊複製數越多,雖在檔案寫入時會花較久的時間,但在檔案讀取時速度會些許提升。

  14. Accessing files through a HDFS-based FTPserver(1/3) • 使用者用FTP client連上FTP server後 • lfs表示一般的FTP server daemon直接存取local file system。 • HDFS表示由我們撰寫的FTP server daemon,透過與位在同一台server上的NameNode daemon溝通後,存取HDFS。 • 之後上傳/下載完檔案花費之總秒數皆為測量3次秒數平均後之結果 • 網路頻寬約維持在10Mb/s~12Mb/s間

  15. Accessing files through a HDFS-based FTP server(2/3) (橫軸:上傳單一檔案GB數) (縱軸:上傳完檔案花費總秒數) (HDFS:檔案區塊大小128MB,複製數=2)

  16. Accessing files through a HDFS-based FTP server(3/3) (橫軸:下載單一檔案GB數) (縱軸:下載完檔案花費總秒數) (HDFS:檔案區塊大小128MB,複製數=2)

  17. Hadoop認證分析 • the name node has no notion of the identity of the realuser。(沒有真實用戶的概念) • User Identity : • The user name is the equivalent of「whoami」. • The group list is the equivalent of「bash -c groups」. • The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user.

  18. Why Using Proxy to connect name node • DataNodes do not enforce any access control on accesses to its data blocks。(client可與datanode直接連線,提供Block ID即可read、write)。 • Hadoopclient(any user)can access HDFS or submit Mapreduce Job。 • Hadoop only works with SOCKS v5. ( in client,ClientProtocol and SubmissionProtocol) • 結論:hadoop(Private IP叢集)+ RADIUS + SOCKS proxy。

  19. 結構

  20. 結構

  21. Hadoop SOCKS • 只需在Hadoop client設定SOCKS連線,Namenode無需設定。

  22. User 認證 • 使用SOCKS protocol的method(username、password)辨識Proxytransfer的權限。 • 由RADIUSServer紀錄user是否可以存取hadoop。(user-group) • User使用Hadoop client(whoami)的執行身分來存取Hadoop。

  23. SOCKS Proxy 優缺點 • 優點: • 可進行user認證。 • 可過濾IPrange,限制使用proxy的網域。 • 不會儲存transfer的封包, 單純forward。 • 缺點: • Client 端需支援 SOCKSprotocol。 • 可能會成為Bottleneck,傳輸速度(transfer)與硬體和選用的SOCKS軟體有關。

More Related