1 / 45

Introduction to Hadoop Architecture

Learn about the architecture of Hadoop and how it supports big data processing and algorithms. Explore HDFS, installation and configuration, running Hadoop jobs, and examples like word count and K-means under Mahout.

rafaelv
Download Presentation

Introduction to Hadoop Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIG DATA PROGRMMING & ALGORITHMS Part II: Hadoop Architecture10:45 – 12:00Wed, June 24 IFI Summer School 2015

  2. What we’ll cover… • Hadoop • Overview • Architecture • Hadoop distributed file system (HDFS) • Hands-on • Installation and configuration • Running hadoop jobs • Examples • Word count example • K-means under mahout (if time allows) IFI Summer School 2015

  3. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable version: 1.0.3 • Reliable, Performant Distributed file system • MapReduce Programming framework • Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro … IFI Summer School 2015

  4. Who uses Hadoop? IFI Summer School 2015

  5. Bandwidth to data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently IFI Summer School 2015

  6. Hadoop goals • Scalable: Petabytes (1015 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable: fault tolerance IFI Summer School 2015

  7. IFI Summer School 2015

  8. Hadoop big picture IFI Summer School 2015

  9. Hadoop architecture IFI Summer School 2015

  10. HDFS • Master-Worker Architecture • Single NameNode • Many (Thousands) DataNodes • Files are split into fixed sized blocks and stored on data nodes (default 128MB) • Data blocks are replicated for fault tolerance and fast access (default is 3) IFI Summer School 2015

  11. HDFS – Master (NameNode) • Manages filesystem namespace • File metadata • Mapping file to list of blocks + locations • Authorization & Authentication • Checkpoint namespace changes • Mapping of datanode to list of blocks • Monitor datanodehealth • Replicate missing blocks • Keeps ALL namespace in memory IFI Summer School 2015

  12. HDFS – Slave (DataNode) • Handle block storage on multiple volumes & block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to NameNode • Blocks are stored as underlying OS’s files IFI Summer School 2015

  13. Data replication IFI Summer School 2015

  14. Data replication • Frist copy is written to the local node (write affinity). • Second copy is written to a DataNode within a remote rack. • Third copy is written to a DataNode in the same remote rack. • Additional replicas are randomly placed. Objectives: load balancing, fast access, fault tolerance. IFI Summer School 2015

  15. MapReduce: Hadoop execution layer • JobTrackerknows everything about submitted jobs • Divides jobs into tasks and decides where to run each task • Continuously communicating with TaskTracker • TaskTrackerexecute task (multiple tasks per node) • Monitors the execution of each task • Continuously sending feedback to JobTracker IFI Summer School 2015

  16. HDFSfilesystem commands • Listthecontentsofadirectory • $hadoopfs -ls • CreateadirectoryinHDFSatgivenpath(s) • $hadoopfs -mkdir <directory name> • UploadanddownloadafileinHDFS • Upload: $hadoopfs -put <local file> <remote path> • Download: $hadoopfs -get <file in HDFS> <local path> • Seecontentsofafile • $hadoopfs -cat <filename> • Delete afile/directory in HDFS • $hadoopfs -rm/rmr <file or directory> IFI Summer School 2015

  17. HDFSfilesystem commands • Movefilefromsourcetodestination • $hadoopfs -mv <src> <dst> • Reporttheamountofspaceusedandavailability • $hadoopfs -dfhdfs:/ • How much space a directory occupies • $hadoopfs -du -s -h <dir name> • Change permission of files • $sudohadoopfs -chmod 600 <file> • Change owner and group of files • $sudohadoopfs -chownroot:root <file> IFI Summer School 2015

  18. HDFSadmin commands • DFSAdmin command • -report: reports basic statistics of HDFS • -safemode: though usually not required, an administrator can manually enter or leave safemode • enter, leave, get, wait • -refreshNodes: updates the set of hosts allowed to connect to namenode Usage: hadoopdfsadmin [-report] [-safemode enter | leave | get | wait] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-help [cmd]] IFI Summer School 2015

  19. Installation and configuration • Use the stable version, e.g., 1.0.3 • Local (standalone) mode • Fully-distributed mode • Pseudo-distributed mode Please refer to PDF file Hadoop_installation_and_configuration.pdf IFI Summer School 2015

  20. Fully distributed mode • Assume we have three machines with following configurations • Create a hadoop user account on for each machine IFI Summer School 2015

  21. Modify machine names • For different OS • Ubuntu OS • Fedora OS • Mac OS • $ sudoscutil --set HostNameMaster.Hadoop IFI Summer School 2015

  22. Configure DNS • Modify /etc/hosts under the master machine IFI Summer School 2015

  23. Required software • JDK Download:http://www.oracle.com/technetwork/java/javase/index.html e.g., jdk-7u25-linux-i586.tar.gz • Hadoop Download: http://hadoop.apache.org/common/releases.html e.g., hadoop-1.0.3.tar.gz • SSH • $ sudoapt-get install openssh-server (for ubuntu OS) • $ yum install openssh-server (for Fedora OS) IFI Summer School 2015

  24. Passwordlesslogin for SSH (1) • $ ssh-kegen –t rsa–P ‘’ (no spaces between single quotes) • $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys • $ chmod600 ~/.ssh/authorized_keys IFI Summer School 2015

  25. Passwordless login for SSH (2) • $ sudovi /etc/ssh/sshd_confid • Make sure the following lines are uncommented • Restart ssh service • $ sudoservice ssh restart IFI Summer School 2015

  26. Passwordless login for SSH (3) • Do the ssh-add if ssh is still not working IFI Summer School 2015

  27. Passwordlesslogin for SSH (4) • Copy the generated public key to all slave machines in order passwordless access from the master to slaves Input slave1.hadoop’s passwd IFI Summer School 2015

  28. Passwordless login for SSH (5) • Test if it is successful to access slave machines without password IFI Summer School 2015

  29. Passwordless login for SSH (6) • To make passwordless access from slave machines to the master machine, we need to add public keys of slave machines into ‘authorized_keys’ on the master machine under the folder of .ssh using the following command • $ cat ~/.ssh/id_rsa.pub | sshhadoop@Master.Hadoop 'cat >> ~/.ssh/authorized_keys’ • It is equal to two commands: • $ scp ~/.ssh/id_rsa.pubhadoop@Master.Hadoop:/~ • $ cat ~/id_rsa.pub >> ~/.ssh/authorized_keys IFI Summer School 2015

  30. IFI Summer School 2015

  31. Set Java environment variables • Edit /etc/profile (Ubuntu) or ~/.bashrc_profile(Mac) • Make it work through • $ source /etc/profile or . /etc/profile IFI Summer School 2015

  32. Install Hadoop • Modify /etc/profile to add some environment variables IFI Summer School 2015

  33. Configure Hadoop (1) • Edit /usr/hadoop/conf/hadoop-env.sh IFI Summer School 2015

  34. Configure Hadoop (2) • Edit /usr/hadoop/conf/core-site.xml IFI Summer School 2015

  35. Configure Hadoop (3) • Edit /usr/hadoop/conf/hdfs-site.xml IFI Summer School 2015

  36. Configure Hadoop (4) • Edit /usr/hadoop/conf/mapred-site.xml IFI Summer School 2015

  37. Configure Hadoop (5) • Edit ‘masters’ file • Configure ‘slaves’ file (only needed on the master machine) IFI Summer School 2015

  38. Start Hadoop (1) IFI Summer School 2015

  39. Start Hadoop (2) IFI Summer School 2015

  40. Check Hadoop status IFI Summer School 2015

  41. IFI Summer School 2015

  42. Web access Hadoop http://192.168.1.141:50030 IFI Summer School 2015

  43. Web access Hadoop http://192.168.1.141:50070 IFI Summer School 2015

  44. Installation and configuration • Local (standalone) mode • Fully-distributed mode • Pseudo-distributed mode Please refer to PDF file Hadoop_installation_and_configuration.pdf IFI Summer School 2015

  45. Example • Using K-means to cluster text documents under Mahout and Hadoop Please refer to PDF file Kmeans_hadoop_mahout.pdf IFI Summer School 2015

More Related