Comprehensive Guide to Setting Up Hadoop on Multiple Operating Systems

Hadoop Setup

Prerequisite: • System: Mac OS / Linux / Cygwin on Windows • Notice: • 1. only works in Ubuntu will be supported by TA. You may try other environments for challenge. • 2. Cygwin on Windows is not recommended, for its instability and unforeseen bugs. • Java Runtime Environment, JavaTM 1.6.x recommended • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Hadoop Setup

Single Node Setup (Usually for debug) • Untarhadoop-*.**.*.tar.gz to your user path • About Version: • The latest stable version 1.0.1 is recommended. • edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation • edit the files to configure properties: conf/core-site.xml: <configuration> <property> <name> fs.default.name </name> <value> hdfs://localhost:9000 </value> </property> </configuration> conf/hdfs-site.xml: <configuration> <property> <name> dfs.replication </name> <value> 1 </value> </property> </configuration> conf/mapred-site.xml: <configuration> <property> <name> mapred.job.tracker </name> <value> localhost:9001 </value> </property> </configuration> Hadoop Setup

Cluster Setup ( the only acceptable setup for HW) • Same steps as single node setup • Set dfs.name.dir and dfs.data.dir property in hdfs-site.xml • Add the master’s node name to conf/master • Add all the slaves’ node name to conf/slaves • Edit /etc/hosts in each node: add IP and node name item for each node • Suppose your master’s node name is ubuntu1 and its IP is 192.168.0.2, then add line “192.168.0.2 ubuntu1” to the file • Copy the folder to the same path of all nodes • Notice: JAVA_HOME may not be set the same in each node Hadoop Setup

Execution • generating ssh keygen. Passphrase will be omitted when starting up:$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys$ ssh localhost • Format a new distributed-filesystem:$ bin/hadoop namenode –format • Start the hadoop daemons:$ bin/start-all.sh • The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs). Hadoop Setup

Execution(continued) • Copy the input files into the distributed filesystem:$ bin/hadoop fs -put conf input • Run some of the examples provided:$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' • Examine the output files: • View the output files on the distributed filesystem:$ bin/hadoop fs -cat output/* • When you're done, stop the daemons with:$ bin/stop-all.sh Hadoop Setup

Details About Configuration Files • Hadoop configuration is driven by two types of important configuration files: • Read-only default configuration:src/core/core-default.xmlsrc/hdfs/hdfs-default.xmlsrc/mapred/mapred-default.xmlconf/mapred-queues.xml.template. • Site-specific configuration:conf/core-site.xmlconf/hdfs-site.xmlconf/mapred-site.xmlconf/mapred-queues.xml Hadoop Setup

Details About Configuration Files (continued) conf/core-site.xml: conf/hdfs-site.xml: Hadoop Setup

Details About Configuration Files (continued) conf/mapred-site.xml: Hadoop Setup

You may get detailed information from The official site: http://hadoop.apache.org Course slides & Textbooks: http://www.cs.sjtu.edu.cn/~liwujun/course/mmds.html Michael G. Noll's Blog (a good guide): http://www.michael-noll.com/ If you have good materials to share, please send them to TA. Hadoop Setup

Comprehensive Guide to Setting Up Hadoop on Multiple Operating Systems