1 / 38

Hadoop Introducing Installation and Configuration

Hadoop Introducing Installation and Configuration. 数据挖掘研究组 Data Mining Group @ Xiamen University. A Distributed data-intensive Programming Framework. Distributed storage. Hadoop. Parallel computing. 数据挖掘研究组

thane
Download Presentation

Hadoop Introducing Installation and Configuration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HadoopIntroducingInstallation and Configuration 数据挖掘研究组 Data Mining Group @ Xiamen University

  2. A Distributed data-intensive Programming Framework Distributed storage Hadoop Parallel computing 数据挖掘研究组 Data Mining Group @ Xiamen University

  3. Introducing to HDFS • Hadoop Distributed File System (HDFS) • An open-source implementation of GFS • has many similarities with distributed file systems. • However, comes differences with it. • HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. • HDFS provides high throughput access to application data and is suitable for applications that have large data sets. 数据挖掘研究组 Data Mining Group @ Xiamen University

  4. How it works?

  5. Features of it • An important feature of the design : • data is never moved through the namenode. • Instead, all data transferoccurs directly between clients and datanodes 数据挖掘研究组 Data Mining Group @ Xiamen University

  6. MapReduce? Let’s talk it next time……… 数据挖掘研究组 Data Mining Group @ Xiamen University

  7. “Running Hadoop?” What means for it?“Running Hadoop” means running a set of daemons.NameNodeDataNode Secondary NameNodeJobTrackerTaskTracker 数据挖掘研究组 Data Mining Group @ Xiamen University

  8. Who Works for who? • NameNode • Sec ND • JobTracker • DataNode • TaskTracker Hadoop

  9. NameNode • Hadoop employs a master/slave architecture for both distributed storage and distributed computation. • NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks • NameNode is the bookkeeper of HDFS • keeps track of how your fi les are broken down into fi le blocks • keeps track of the overall health of the distributed fi lesystem

  10. DataNode • reading and writing HDFS blocks for clients • communicate with other DataNodes to replicate its data blocks for redundancy 数据挖掘研究组 Data Mining Group @ Xiamen University

  11. NameNode and DataNode

  12. Secondary NameNode • SNN is an assistant daemon for monitoring the state of the cluster HDFS • differs from the NameNode in that this process doesn’t receive or record any real-time changes to HDFS • communicates with the NameNode to take snapshots of the HDFS metadata • Recovery: NameNode failure ???? We reconfigure the cluster to use the SNN as the primary NameNode

  13. JobTracker • the liaison between your application and Hadoop • submit your code to your cluster, the JobTracker determines the execution plan • determining which fi les to process • assigns nodes to different tasks • monitors all tasks as they’re running • a task fail???? JobTrackerwill relaunch the task on a different node

  14. TaskTracker • Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns 数据挖掘研究组 Data Mining Group @ Xiamen University

  15. JobTracker and TaskTracker

  16. Installation and Configuration • Pseudo-distributed mode All daemons run on on the machine • Fully distributed mode What Different? 数据挖掘研究组 Data Mining Group @ Xiamen University

  17. Installation forPseudo-distributed mode • Prerequisites • Ubuntu Linux • Hadoop 0.20.2 • Sun Java 6 $sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner“ $sudo apt-get update $sudo apt-get install sun-java6-jdk 数据挖掘研究组 Data Mining Group @ Xiamen University

  18. Configuring SSH • Hadoop requires SSH access to manage its nodes, remote machines plus your local machine if you want to use Hadoop on it • $ sduo apt-get install openssh-server • $ ssh-keygen -t rsa -P “” • The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction ,since you don’t want to enter the passphrase every time Hadoop interacts with its nodes.

  19. Configuring SSH • $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys • sshlocalhost • The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] 数据挖掘研究组 Data Mining Group @ Xiamen University

  20. extract Hadoop package • $ cd /usr/local • $ sudo tar xzf hadoop-0.20.2.tar.gz • $ sudochown -R dm:dm hadoop-0.20.2 数据挖掘研究组 Data Mining Group @ Xiamen University

  21. Update ~/.bashrc • $sudo vim ~/.bashrc • # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop • # Set JAVA_HOME JAVA_HOME=/usr/lib/jvm/java-6-sun • # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin 数据挖掘研究组 Data Mining Group @ Xiamen University

  22. hadoop.tmp.dir • Create /app/hadoop/tmp. • Hadoop’sdefault configurations usehadoop.tmp.diras the base temporary directory both for the local file system and HDFS • $ sudomkdir -p /app/hadoop/tmp • $ sudochowndm:dm/app/hadoop/tmp 数据挖掘研究组 Data Mining Group @ Xiamen University

  23. Configuration hadoop-env.sh • Configure JAVA_HOME environment variable for Hadoop • Change • # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun • to • # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun  数据挖掘研究组 Data Mining Group @ Xiamen University

  24. Key stage • Configuration Key propertyies for hadoopdaemons • These propertyies should be set in XML files ,which locate in”/usr/local/hadoop-0.20.2/conf” core-site.xml mapred-site.xml hdfs-site.xml 数据挖掘研究组 Data Mining Group @ Xiamen University

  25. Key propertyies for hadoop daemons • fs.default.name(core-site.xml) • hadoop.tmp.dir(core-site.xml) • mapred.job.tracker(mapred-site.xml) • dfs.data.dir(hdfs-site.xml) • dfs.replication(hdfs-site.xml) 数据挖掘研究组 Data Mining Group @ Xiamen University

  26. Configuration core-site.xml • Add the following lines between the <configuration> ... </configuration> tags <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories. </description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property>

  27. Configuration mapred-site.xml <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 数据挖掘研究组 Data Mining Group @ Xiamen University

  28. Configuration hdfs-site.xml <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> 数据挖掘研究组 Data Mining Group @ Xiamen University

  29. Formatting the name node •  formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” • $ bin/hadoopnamenode –format Installation Done! 数据挖掘研究组 Data Mining Group @ Xiamen University

  30. Fully distributed mode

  31. Fully distributed mode

  32. Networking • assign the Static IP for all the hosts • Update /etc/hosts on both machines with the following lines:(for master AND slaves) 192.168.0.1 master 192.168.0.2 slave 数据挖掘研究组 Data Mining Group @ Xiamen University

  33. SSH access add the hduser@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user ’ $HOME/.ssh/authorized_keys) 数据挖掘研究组 Data Mining Group @ Xiamen University

  34. Masters vs. Slaves • one machine in the cluster is designated as the NameNode another machine(maybe the same) as JobTracker. These are the actual “masters”. • The rest of the machines in the cluster must act as both DataNode and TaskTracker. These we call “slaves” 数据挖掘研究组 Data Mining Group @ Xiamen University

  35. Masters vs. Slaves • conf/masters (master only) master • conf/slaves (master only) master slave 数据挖掘研究组 Data Mining Group @ Xiamen University

  36. conf/*-site.xml (all machines) How? 数据挖掘研究组 Data Mining Group @ Xiamen University

  37. Formatting the NameNode $bin/hadoop namenode –format $bin/start-all.sh $jps $bin/stop-all.sh 数据挖掘研究组 Data Mining Group @ Xiamen University

  38. Thank youAny Question? 数据挖掘研究组 Data Mining Group @ Xiamen University

More Related