ids594 special topics in big data analytics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
IDS594 Special Topics in Big Data Analytics PowerPoint Presentation
Download Presentation
IDS594 Special Topics in Big Data Analytics

Loading in 2 Seconds...

play fullscreen
1 / 25

IDS594 Special Topics in Big Data Analytics - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

IDS594 Special Topics in Big Data Analytics. Week3. What is Hadoop?. Hadoop is a software framework for distributed processing of large datasets across large clusters of computers Hadoop is open-source implementation for Google MapReduce

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'IDS594 Special Topics in Big Data Analytics' - tolla


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
what is hadoop
What is Hadoop?
  • Hadoop is a software framework for distributed processingof large datasets across large clusters of computers
  • Hadoop is open-source implementation for Google MapReduce
  • Hadoop is based on a simple programming model called MapReduce
  • Hadoop is based on a simple data model, any data will fit
  • Hadoop framework consists on two main layers
    • Distributed file system (HDFS)
    • Execution engine (MapReduce)
hadoop infrastructure
Hadoop Infrastructure
  • Hadoop is a distributedsystem like distributed databases.
  • However, there are several key differences between the two infrastructures
    • Data model
    • Computing model
    • Cost model
    • Design objectives
hdfs hadoop distributed file system
HDFS: Hadoop Distributed File System
  • HDFS is a master-slave architecture
    • Master: namenode
    • Slave: datanode (100s or 1000s of nodes)
    • Single namenode and many datanodes
    • Namenode maintains the file system metadata
    • Files are split into fixed sized blocks and stored on data nodes (Default 64MB)
    • Data blocks are replicated for fault tolerance and fast access (Default is 3)
hdfs architecture
HDFS Architecture
  • Default placement policy: where to put a given block?
    • Frist copy is written to the node creating the file (write affinity)
    • Second copy is written to a datanode within the same rack
    • Third copy is written to a datanode in a different rack
    • Objectives: load balancing, fast access, fault tolerance
mapreduce hadoop execution layer
MapReduce: Hadoop Execution Layer
  • JobTracker knows everything about submitted jobs
  • Divides jobs into tasks and decides where to run each task
  • Continuously communicating with TaskTrackers
  • TaskTrackers execute task (multiple per node)
  • Monitors the execution of each task
  • Continuously sending feedback to JobTracker
  • MapReduce is a master-slave architecture
    • Master: JobTracker
    • Slave: TaskTrackers (100s or 1000s of tasktrackers)
  • Every datanode is running a TaskTracker
hadoop computing model
Hadoop Computing Model
  • Mapper and Reducers consume and produce (key, value) pairs
  • Users define the data type of the Key and Value
  • Shuffling and Sorting phase
    • Map output is shuffled such that all same-key records go the same reducer
    • Each reducer may receive multiple key sets
    • Each reducer sorts its records to group similar keys, then process each group
c onf hadoop env sh
conf/hadoop-env.sh
  • Configuring the Environment of the HadoopDaemons
    • JAVA_HOME
    • Individual daemons using the configuration options HADOOP_*_OPTS.
    • HADOOP_LOG_DIR - The directory where the daemons' log files are stored. They are automatically created if they don't exist.
    • HADOOP_HEAPSIZE - The maximum amount of heapsize to use, in MB e.g. 1000MB. This is used to configure the heap size for the hadoop daemon. By default, the value is 1000MB.
c onf core site xml
conf/core-site.xml
  • Configuring the Hadoop daemons

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>/Users/${user.name}/hadoop-store</value>

<description>A base for other temporary directories.</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

c onf hdfs site xml
conf/hdfs-site.xml
  • dfs.name.dir
    • Path on the local file system where the NameNode stores the namespace and transactions logs persistently.
    • If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
  • dfs.data.dir
    • Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.
    • If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.
  • dfs.replication
    • Each data block is replicated for redundancy.
c onf mapred site xml
conf/mapred-site.xml
  • mapred.job.tracker
    • Host or IP and port of JobTracker.
    • host:portpair.
  • mapred.tasktracker.{map|reduce}.tasks.maximum
    • The maximum number of MapReduce tasks, which are run simultaneously on a given TaskTracker, individually.
    • Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware.
  • others
s lave file
slave file
  • Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataNode and TaskTracker and are referred to as slaves.
  • List all slave hostnames or IP addresses in your conf/slaves file, one per line.
start stop hadoop
Start/Stop Hadoop
  • bin/hadoopnamenode -format
  • Start the hadoop daemons:
    • $ bin/start-all.sh
  • Stop the hadoop daemons:
    • $ bin/stop-all.sh
slide20

Online Documentation

Windows: http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html

Mac: http://hadoop.apache.org/docs/stable/single_node_setup.html

mapreduce
MapReduce
  • Map function
  • Reduce function
  • Job configuration

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htm

map function
Map Function

Input type

Output type

  • public static class Map extendsMapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> {
    • private final static IntWritable one = new IntWritable(1);
  • private Text word = new Text();
  • public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
  • String line = value.toString();
  • StringTokenizertokenizer = new StringTokenizer(line);
  • while (tokenizer.hasMoreTokens()) {
  • word.set(tokenizer.nextToken());
  • output.collect(word, one);
  • }
  • }
  • }
reduce function
Reduce Function

Input type

Output type

  • public static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> {
  • public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
  • intsum = 0;
  • while (values.hasNext()) {
    • sum += values.next().get();
  • }
  • output.collect(key, new IntWritable(sum));
  • }
  • }
job configuration
Job Configuration

public static void main(String[] args) throws Exception {

JobConfconf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

hdfs commands
HDFS Commands

https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-common/FileSystemShell.html