1 / 10

Agenda

Agenda. Lab time Work on Hadoop Problems (week 5) Due Next Week (May 13) Answer 15 questions to pass, more to learn a lot Ask questions as needed!!! Lecture on HBase. Last Time. Wrap up Hadoop Introduce Distributed Key/Value stores Memcache Introduce HBase. This Week. HBase

essien
Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agenda Lab time Work on Hadoop Problems (week 5) Due Next Week (May 13) Answer 15 questions to pass, more to learn a lot Ask questions as needed!!! Lecture on HBase

  2. Last Time Wrap up Hadoop Introduce Distributed Key/Value stores Memcache Introduce HBase

  3. This Week HBase Pros and Cons Architecture Schema Examples

  4. Next Week Lab Hadoop Problems are due HBase Problems assigned Lecture HBase usage and examples Rest of the Hadoop Ecosystem Cassandra Hive Pig Mahout, Katta, etc Move into clouds Virtualization Amazon EC2

  5. Review Hadoop Batch processing, no random-access Not real-time Free form (no concept of a schema) Distributed Key-Value stores Map some value to some other value Pairs are distributed across servers Distributed Column-Oriented Databases Impose more structure than DHT More freedom than Relational Database Organize/group data by column rather than row

  6. HBase: Key Features Distributed (Fast and Scalable) Column-Oriented Versioned (Multi-Dimensional w/ Time) Highly Available (Robust) Integration with Hadoop for performance Wide and sparsely populated tables Nulls are stored free

  7. HBase: Limitations Not SQL! No joins, queries, types Fairly new, unlike a RDBMS Secondary indexing is slow Transactions are not as robust No data types Not always a bad thing Consider the trade-offs from a relation database!

  8. HBase Architecture Table is made up of any number of regions Region has a startKey and endKey (WeatherTable, LAStation, JanuaryTemp) → (WeatherTable, NYStation, JanuaryTemp) Regions are distributed to different nodes Nodes store regions as 1 or more files in HDFS Each file is broken into blocks by HDFS HDFS replicates each block to other HDFS nodes Two type of nodes: Master Region Server

  9. HBase Architecture Tables are sorted by Row e.g. WeatherStation (LA, NY, etc) Table schema defines column families e.g. temperature, humidity, precipitation Family consists of zero or more columns e.g. temperature:current, temperature:high, temperature:low, temperature:average Families are sorted and stored together for performance Tend to look at all 'attributes' of a group together Columns are versioned Changes are stored as a 3rd dimension, which is the timestamp Designed for timestamp, but does not really have to be Columns only exist when inserted, NULLs are free Everything is a byte[]

  10. HBase Data Model (Table, Row, Family:Column, Timestamp) → Value SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )

More Related