1 / 40

Programming on Hadoop

Programming on Hadoop. Outline. Different perspective of Cloud Computing The Anatomy of Data Center The Taxonomy of Computation Computation intensive Data intensive The Hadoop Eco-system Limitations of Hadoop. Cloud Computing. From user perspective

angeni
Download Presentation

Programming on Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming on Hadoop

  2. Outline • Different perspective of Cloud Computing • The Anatomy of Data Center • The Taxonomy of Computation • Computation intensive • Data intensive • The Hadoop Eco-system • Limitations of Hadoop

  3. Cloud Computing • From user perspective • A service which enables users to run their applications on the Internet • From service provider perspective • A resource pool which is used to deliver cloud services through the Internet • The resource pool is hosted in on-premise data center • What the data center (DC) looks like ?

  4. An Example of DC • Google’s Data Center at 2009. From Jeffrey Dean’s talk on WSDM2009

  5. A Closer Look at DC – Overview Figure is copied from [4]

  6. A Closer Look of DC – Cooling Figure is copied from [4]

  7. A Closer Look at DC – Computing Resources Figure is copied from [4]

  8. The Commodity Server • Commodity server is NOT low-end server • Standard components vs. proprietary hardware • Common configuration in 2008 • Processor: 2 quad-core Intel Xeon 2.0GHz CPUs • Memory: 8 GB ECC RAM • Storage: 4 1TB SATA disks • Network: Gigabit Ethernet

  9. Approaches to Deliver Service • The dedicated approach • Serve each customer with dedicated computing resources • The shared approach (multi-tenant architecture) • Serve customers with the shared resource pool

  10. The Dedicated Approach • Pros: • Easy to implement • Performance & security guarantee • Cons: • Pain for the customer to scale their applications • Poor resource utilization

  11. The Shared Approach • Pros: • No pain for customers to scale their applications • Better resources utilization • Better performance in some cases • Low service cost per customer • Cons: • Need complicated software layer • Performance isolation/tuning may be complicated • To achieve better performance customers should be familiar with the software/hardware architecture to some degree

  12. The Hadoop Eco-system • An software infrastructure to deliver a DC as a service through shared-resources approach • Customers can use Hadoop to develop/deploy certain data-intensive applications on the cloud • We focus on the Hadoop core in this lecture • Hadoop == Hadoop– core afterwards HBase Chukwa Hive Pig Extensions Hadoop Distributed File System (HDFS) MapReduce Core

  13. The Taxonomy of Computations • Computation-intensive tasks • Small data (in-memory), Lots of CPU cycles per data item processing • Examples: machine learning • Data-intensive tasks • Large-volume data (in-disk), relatively small CPU cycles per data item processing • Examples: DBMS

  14. The Data-intensive Tasks • Streaming-oriented data access • Read/Write a large portion of dataset in streaming manner (sequentially) • Character: • NO-seek, high-throughput • Optimized for larger data transferring rate • Random-oriented data access • Read/Write a small number of data items randomly located in the dataset • Character: • Seek-oriented • Optimized for low-latency data access for each data item

  15. What Hadoop does & doesn’t • Hadoopcan perform • High-throughput streaming data access • Limited low-latency random data access through HBase • Large-scale analysis through MapReduce • Hadoopcannot do • Perform transactions • Certain time-critical applications

  16. Hadoop Quick Start • Very simple • Download Hadoop package from Apache • http://hadoop.apache.org/ • Unpack into a folder • Do some configurations on hadoop-site.xml • fs.default.name  select the default file system (e.g., HDFS) • mapred.job.tracker point to the JobTracker of MapReduce cluster • Start • Format the file system only once (in a fresh installation) • bin/hadoopnamenode –format • Launch HDFS & MapReduce cluster • bin/start-all.sh

  17. The Launched HDFS cluster

  18. The Launched MapReduce Cluster

  19. The Hadoop Distributed Filesystem • Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially

  20. A Closer Look at the API • Aha, writing “hello word!” • bin/hadoop jar test.jar public class Main { public static void main(String[] args) throws Exception { FileSystemfs = FileSystem.get(new Configuration()); FSDataOutputStreamfsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }

  21. A Closer Look at the API (cont.) • Reading data from the HDFS public class Main { public static void main(String[] args) throws Exception { FileSystemfs = FileSystem.get(new Configuration()); FSDataInputStreamfsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; intlen = fsIn.read(buf); System.out.println(new String(buf, 0, len); } }

  22. Inside HDFS • A single NameNode multiple DataNodes architecture (see [5] for reference) • Chop each file as a set of fix-sized blocks and store those data blocks on all available DataNodes • NameNode hosting all file system meta-data (file block mapping, block locations etc) in memory • DataNode  hosting all file data for reading/writing

  23. Inside HDFS – Architecture • Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html

  24. Inside HDFS – Writing Data Figure is copied from [2]

  25. Inside HDFS – Reading Data • What is the problem with reading/writing ? Figure is copied from [2]

  26. The HDFS Cons • Single reader/writer • Reading and writing a single block each time • Only touch ONE data node • Data transferring rate == disk bandwidth of a SINGLE node • Too slow for a large file • Suppose disk bandwidth == 100MB/sec • Reading /writing a 1TB file requires ~3 hrs • How to fix it ?

  27. Multiple Reader/Writers • Reading/Writing a large data set using multiple processes • Each process reads/writes a subset of the whole data set and materialize the sub-data set as file • File collection for the whole data set • Typically, the file collection is stored in a directory named with the data set

  28. Multiple Readers/Writers (cont.) • Question – what is the proper number of readers and writers ? Data set A /root/datasetA Sub-set 1 Process 1 part-0001 Sub-set 2 Process 2 part-0002 Sub-set 3 Process 3 part-0003 Sub-set 4 Process 4 part-0004

  29. Multiple Readers/Writers (cont.) • Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS • But, too painful ! • Invocation of multiple readers/writers in the cluster • Coordination of those readers/writers • Machine failure • …. • Rescue: MapReduce

  30. The MapReduce System • MapReduce is a programming model and its associated implementation for processing and generating large data sets [1] • The computation performs key/value oriented operations and consists of two functions • Map: transform the input key/value pair into a set of intermediate key/value pairs • Reduce: merge intermediate key/value pairs with the same key and produce an other key/value pair

  31. The MapReduce Programming Model • Map: (k0, v0) -> (k1, [v1]) • Reduce: (k1, [v1]) -> (k2, v2)

  32. The System Architecture • One JobTacker for Job submission • Multiple TaskTrackers for invocation of mappers or reducers Figure is from Google image

  33. The Mapper Interface • Mapper/Reducer is defined as a generic java interface in Hadoop public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter); } public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter); }

  34. The Data Types of MapReduce • MapReduce makes no assumption of the data type • It does not know what constitutes key/value pair • Users must figure out what is appropriate input/output data types • The runtime data interpreting pattern • Achieved by implementing two Hadoop interface • RecordReader<K, V> for parsing input key/value pair • RecordWriter<K, V> for serializing output key/value pair

  35. The RecordReader/Writer Interface interface RecordReader<K, V> { // Omit other functions boolean next(K key, V value); } interface RecordWriter<K, V> { // Omit other functions void write(K key, V value); }

  36. The Overall Picture • The data set are spitted into many parts • Each part is processed by one mapper • The intermediated results are processed by reducer • Each reducer writes its results as a file part-000n RecordReader InputSplit-n Shuffle/merge RecordWriter map reduce

  37. Performance Tuning • A lot of factors … • From architecture level • Record parsing, map-side sorting, …, see [3] • Shuffling see many research papers on VLDB, SIGMOD • Parameter Tuning • Memory buffer for mapper/reducer • The thumb of rule for concurrent mapper and reducers • Map: per file block per map • Reducer: a small multiple of available TaskTrackers

  38. Limitations of Hadoop • HDFS • No reliable appending yet • File is immutable • MapReduce • Basically row-oriented • Support for complicated computation is not strong

  39. Reference • [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters • [2] Tom White. Hadoop: The Definitive Guide • [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of MapReduce: An Indepth Study • [4] Luiz André Barroso and UrsHolzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines • [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File System

  40. Thank You!

More Related