1 / 29

Hadoop

Hadoop. Ida Mele. Parallel programming. Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken up into parts, that run on different machines concurrently

adam-carson
Download Presentation

Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop Ida Mele

  2. Parallel programming • Parallel programming is used to improve performance and efficiency • In a parallel program, the processing is broken up into parts, that run on different machines concurrently • Super-computers can be replaced by large clusters of CPUs. These CPUs may be on the same machine or they may be in a network of computers Hadoop

  3. Parallel programming Web graphic Super ComputerJanet E. Ward, 2000 Cluster of Desktops Hadoop

  4. Parallel programming • We have to identify the sets of tasks that can run concurrently • Parallel programming is suitable for large datasets where we can split the data in equal-size portions • The number of tasks we can perform concurrently depends on the dimension of the original dataset and on how many CPUs (nodes) we have Hadoop

  5. Parallel programming • One node is the master: It divides the problem in sub-problems and gives them to the other nodes, also called workers • Worker node processes the small problem and returns the partial result to the master • Once the master receives partial results from all the workers, it can combine them to compute the final result Hadoop

  6. Parallel programming: examples • Assuming a big array, it can be split into small sub-arrays • Assuming a large document collection, it can be split in documents, and again each document can be broken up in paragraphs, lines, … • Note: not all problems can be parallelized. For example, the current value to compute depends on the previous one Hadoop

  7. MapReduce • MapReduce was developed by Google for processing large amount of raw data • It is used for the distributed computing on clusters of computers • MapReduce is an abstraction which allows programmers to implement distributed programs • Distributed programming is complex, so MapReduce hides the issues related to parallelization, data distribution, load balancing, and fault tolerance Hadoop

  8. MapReduce • MapReduce is inspired by the map and reduce combinators of Lisp • Map: (key1, val1) → (key2, val2). The map function takes as input <key,value> pairs and produces a set of zero or more intermediate <key,value> pairs • The framework groups together all the intermediate values associated to the same intermediate key and passes them to the reducer • Reduce: (key2, [val2]) → [val3]. The reduce function aggregates the values of a key by using a binary operation, such as the sum Hadoop

  9. MapReduce: dataflow • Input reader: it reads the data from the stable storage and divides it into portions (splits or shards), then it assigns one split to each map function • Map function: it takes a series of <key, value> pairs and processes each of them to create zero or more intermediate <key, value> pairs • Partition function: Between the map and the reduce stages, data is shuffled: parallel-sorted and exchanged among nodes. Data is moved from the node of the map to the correct shard in which it will be reduced. In order to do that, the partition function receives the key and the number of reducers, and it returns the index of the desired reducer  load balancing Hadoop

  10. MapReduce: dataflow • Comparison function: the input of each reducer is sorted using the comparison function • Reduce function: it takes the sorted intermediate pairs and aggregate the values by the keys, in order to produce a single output for each key • Output writer: it writes the output of the reducer on the stable storage, that is usually the distributed file system Hadoop

  11. Example • Consider the following problem: Given a large collection of documents, we want to compute the number of occurrences of each term in the documents • How can we solve it in parallel? Hadoop

  12. Example • We assume to have a set of workers • Divide the collection of documents among them, for example one document for each worker • Each worker returns the count of a given word in a document • Sum up the counts from all the documents to have the overall number of occurrences of a word in the collection of documents Hadoop

  13. Example Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Hadoop

  14. Example Hadoop

  15. Example Hadoop

  16. Execution overview Hadoop

  17. Execution overview • The map invocations are distributed across multiple CPUs, by automatically partitioning the input data into M splits (or shards). The shards can be processed on different CPUs concurrently • One copy of the program is the master and it assigns the work to the other copies (workers). In particular it has M map tasks and R reduce tasks to assign, so the master picks the idle workers and assign each one a map or reduce task • The worker, which has the map task, reads the content of the input shard. It applies the user-defined operations in parallel, and it produces the intermediate <key,value> pairs Hadoop

  18. Execution overview • The intermediate pairs are written on the local disk periodically. Then, they are partitioned into R regions by the partitioning function. The locations of these pairs are passed to the master, which forwards them to the reduce workers • The reduce worker reads the intermediate pairs and sorts them by the intermediate key • The reduce worker iterates over the sorted pairs and applies the reduce function, which allows to aggregate the values for each key. Then, the output of the reducer can be appended to the output file Hadoop

  19. Reliability • MapReduce allows to have high reliability: to detect failure the master pings every workers periodically. If a worker is silent for longer than a given time interval the master node marks the worker as failed and sends the failed worker’s work to another node • When a failure occurs the completed map task has to be re-executed, since the output is stored on the local disk of the failed machine, and it is inaccessible. The completed reduce tasks do not need to be re-executed, because their output is stored in the global file system • Some operations are atomic to ensure no conflicts Hadoop

  20. Hadoop • Apache Hadoop is a open-source framework for reliable, scalable, and distributed computing. It implements the computational paradigm named MapReduce. • Useful links: • http://hadoop.apache.org/ • http://hadoop.apache.org/docs/stable/ • http://hadoop.apache.org/releases.html#Download Hadoop

  21. Hadoop • The project includes several modules: • Hadoop Common: the common utilities that support the other Hadoop modules • Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data • Hadoop YARN: a framework for job scheduling and cluster resource management • Hadoop MapReduce: a YARN-based system for parallel processing of large data sets Hadoop

  22. Install Hadoop • Download the release 1.1.1 of Hadoop available at: http://www.dis.uniroma1.it/~mele/WebIR.html or one of the latest release of Hadoop available at: http://hadoop.apache.org/releases.html#Download • The directory conf contains all configuration files • Set the JAVA_HOME by editing the file conf/hadoop-env.sh: export JAVA_HOME= %JavaPath Hadoop

  23. Install Hadoop • Optional: (if needed) we can specify the classpath • Optional: we can specify the maximum amount of heap to use. The default is 1000 MB, but we can increase it by editing the file conf/hadoop-env.sh: export HADOOP_HEAPSIZE=2000 Hadoop

  24. Install Hadoop • Optional: we can specify the directory for the temporal output • Edit the file conf/core-site.xml adding the following lines: <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-tmp-${user.name}</value> <description>A base for other temporary directories.</description> </property> Hadoop

  25. Example: WordCounter • Download WordCounter.jar and text.txt, available at: http://www.dis.uniroma1.it/~mele/WebIR.html • Put WordCounter.jar in the Hadoop directory • In the Hadoop directory, create the sub-directory einput and copy the input file text.txt into it • Run the word counter by issuing the following command: bin/hadoop jar WordCounter.jar mapred.WordCount einput/ eoutput/ Note: make sure that the output directory doesn't already exist Hadoop

  26. Example: WordCounter • The Map class: Hadoop

  27. Example: WordCounter • The Reduce class: Hadoop

  28. Example: WordCounter more eoutput/part-0000 '1500,' 1 'Come, 1 'Go 1 'Hareton 1 'Here 1 'I 1 'If 1 'Joseph!' 1 'Mr. 2 'No 1 'No, 1 'Not 1 'She's 1 occurrences words Hadoop

  29. Example: WordCounter sort -k2 -n -r eoutput/part-00000 | more the 93 of 73 and 64 a 60 I 57 to 47 my 27 in 23 his 19 with 18 have 16 that 15 Most frequent words Words frequencies sorted in decreasing order Hadoop

More Related