1 / 18

Hadoop Tutorial

Hadoop Tutorial. Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation. Why should we use Hadoop ?. Need to process 10TB datasets On 1 node: scanning @ 50MB/s = 2.3 days On 1000 node cluster:

ban
Download Presentation

Hadoop Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HadoopTutorial Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

  2. Why should we use Hadoop? • Need to process 10TB datasets • On 1 node: • scanning @ 50MB/s = 2.3 days • On 1000 node cluster: • scanning @ 50MB/s = 3.3 min • Need Efficient, Reliable and Usable framework • Google File System (GFS) paper • Google's MapReduce paper

  3. HDFS - Hadoop Distributed FS • Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system • Files are divided into large blocks and distributed across the cluster (64MB) • Blocks replicated to handle hardware failure • Current block replication is 3 (configurable) • It cannot be directly mounted by an existing operating system. • Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30

  4. Hadoop Architecture • Master-Slave Architecture • HDFS Master “Namenode” (irkm-1) • Accepts MR jobs submitted by users • Assigns Map and Reduce tasks to Tasktrackers • Monitors task and tasktracker status, re-executes tasks upon failure • HDFS Slaves “Datanodes” (irkm-1 to irkm-6) • Run Map and Reduce tasks upon instruction from the Jobtracker • Manage storage and transmission of intermediate output

  5. Hadoop Paths • Hadoop is locally “installed” on each machine • Version 0.19.2 • Installed location is in /home/tmp/hadoop • Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)

  6. Format Namenode • If it is the first time that you use it, you need to format the namenode: • - log to irkm-1 • - cd /home/tmp/hadoop • - bin/hadoop namenode –format • Basically we see most commands look similar • bin/hadoop “some command” options • If you just type hadoop you get all possible commands (including undocumented)

  7. Using HDFS • hadoop dfs • [-ls <path>] • [-du <path>] • [-cp <src> <dst>] • [-rm <path>] • [-put <localsrc> <dst>] • [-copyFromLocal <localsrc> <dst>] • [-moveFromLocal <localsrc> <dst>] • [-get [-crc] <src> <localdst>] • [-cat <src>] • [-copyToLocal [-crc] <src> <localdst>] • [-moveToLocal [-crc] <src> <localdst>] • [-mkdir <path>] • [-touchz <path>] • [-test -[ezd] <path>] • [-stat [format] <path>] • [-help [cmd]]

  8. Starting / Stopping Hadoop • bin/start-all.sh – starts all slave nodes and master node • bin/stop-all.sh – stops all slave nodes and master node • Run jps to check the status

  9. Copying Local files to HDFS • Log to irkm-1 • rm –fr /tmp/hadoop/$userID • cd /home/tmp/hadoop • bin/hadoop dfs –ls • bin/hadoop dfs –copyFromLocal example example • After that • bin/hadoop dfs –ls

  10. Running jobs on Hadoop

  11. Wordcount in python • Mapper.py

  12. Wordcount in python • Reducer.py

  13. Execution code • bin/hadoop dfs -ls • bin/hadoop dfs –copyFromLocal example example • bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output • bin/hadoop dfs -cat java-output/part-00000 • bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local

  14. Web interface • Hadoop job tracker • http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp • Hadoop task tracker • http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp • Hadoop dfs checker • http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp

More Related