1 / 32

Software Systems Development

Software Systems Development. MAP-REDUCE , Hadoop, HBase. The problem. Batch (offline) processing of huge data set using commodity hardware Linear scalability Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms. Data Sets.

Download Presentation

Software Systems Development

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Systems Development MAP-REDUCE , Hadoop, HBase

  2. The problem • Batch (offline) processing of huge data set using commodity hardware • Linear scalability • Need infrastructure to handle all the mechanics, allow for developer to focus on the processing logic/algorithms

  3. Data Sets • The New York Stock Exchange: 1 Terabyte of data per day • Facebook: 100 billion of photos, 1 Petabyte(1000 Terabytes) • Internet Archive: 2 Petabyte of data, growing by 20 Terabytes per month • Can’t put data on a single node, need distributed file system to hold it

  4. Batch processing • Single write/append multiple reads • Analyze Log files for most frequent URL • Each data entry is self-contained • At each step , each data entry can be treated individually • After the aggregation, each aggregated data set can be treated individually

  5. Grid Computing • Grid computing • Cluster of processing nodes attached to shared storage through fiber (typically Storage Area Network) • Work well for computation intensive tasks, problem with huge data sets as network become a bottleneck • Programming paradigm: Low level Message Passing Interface (MPI)

  6. Hadoop • Open-source implementation of 2 key ideas • HDFS: Hadoop distributed file system • Map-Reduce: Programming Model • Build based on Google infrastructure (GFS, Map-Reduce papers published 2003/2004) • Java/Python/C interfaces, several projects built on top of it

  7. Approach • Limited but simple model fit to broad range of applications • Handle communications, redundancies , scheduling in the infrastructure • Move computation to data instead of moving data to computation

  8. Who is using Hadoop?

  9. Distributed File System (HDFS) • Files are split into large blocks (128M, 64M) • Compare with typical FS block of 512Bytes • Replicated among Data Nodes(DN) • 3 copies by default • Name Node (NN) keeps track of files and pieces • Single Master node • Stream-based I/O • Sequential access

  10. HDFS: File Read

  11. HDFS: File Write

  12. HDFS: Data Node Distance

  13. Map Reduce • A Programming Model • Decompose a processing job into Map and Reduce stages • Developer need to provide code for Map and Reduce functions, configure the job and let Hadoop handle the rest

  14. Map-Reduce Model

  15. MAP function • Map each data entry into a pair • <key, value> • Examples • Map each log file entry into <URL,1> • Map day stock trading record into <STOCK, Price>

  16. Hadoop: Shuffle/Merge phase • Hadoop merges(shuffles) output of the MAP stage into • <key, valulue1, value2, value3> • Examples • <URL, 1 ,1 ,1 ,1 ,1 1> • <STOCK, Price On day 1, Price On day 2..>

  17. Reduce function • Reduce entries produces by Hadoop merging processing into <key, value> pair • Examples • Map <URL, 1,1,1> into <URL, 3> • Map <Stock, 3,2,10> into <Stock, 10>

  18. Map-Reduce Flow

  19. Hadoop Infrastructure • Replicate/Distribute data among the nodes • Input • Output • Map/Shuffle output • Schedule Processing • Partition Data • Assign processing nodes (PN) • Move code to PN(e.g. send Map/Reduce code) • Manage failures (block CRC, rerun MAP/Reduce if necessary)

  20. Example: Trading Data Processing • Input: • Historical Stock Data • Records are CSV (comma separated values) text file • Each line : stock_symbol, low_price, high_price • 1987-2009 data for all stocks one record per stock per day • Output: • Maximum interday delta for each stock

  21. Map Function: Part I

  22. Map Function: Part II

  23. Reduce Function

  24. Running the Job : Part I

  25. Running the Job: Part II

  26. Inside Hadoop

  27. Datastore: HBASE • Distributed Column-Oriented database on top of HDFS • Modeled after Google’s BigTable data store • Random Reads/Writes on to of sequential stream-oriented HDFS • Billions of Rows * Millions of Columns * Thousands of Versions

  28. HBASE: Logical View

  29. Physical View

  30. HBASE: Region Servers • Tables are split into horizontal regions • Each region comprises a subset of rows • HDFS • Namenode, dataNode • MapReduce • JobTracker, TaskTracker • HBASE • Master Server, Region Server

  31. HBASE Architecture

  32. HBASE vs RDMS • HBase tables are similar to RDBS tables with a difference • Rows are sorted with a Row Key • Only cells are versioned • Columns can be added on the fly by client as long as the column family they belong to preexists

More Related