CHAPTER 9

CHAPTER 9 使用Hadoop實作MapReduce

Outline • 開發環境設定 • 新增專案 • MapReduce程式架構 • MapReduce基礎實作範例 • MapReduce進階實作範例 • HadoopMapReduce專題

開發環境設定 新增專案 MapReduce程式架構 MapReduce基礎實作範例 MapReduce進階實作範例 Hadoop專題

開發環境設定 • Hadoop程式開發環境的架設可分為兩種； • 不透過Integrated Development Environment (IDE) • 透過IDE

不使用IDE環境(1/2) • 首先設定環境變數如下： • 在/etc/profile中加入Java及Hadoop的CLASSPATH： ~# vi /etc/profile CLASSPATH=/opt/hadoop/hadoop-0.20.2-core.jar ←加入這兩行 export CLASSPATH • 透過source讓設定的profile環境變數生效，或者重新登入也可以： ~# source /etc/profile • 將撰寫好的Java程式編譯成class檔： ~# javac [程式名稱].java

不使用IDE環境(2/2) • 在Hadoop上執行程式 (只有一個class檔時)：在Hadoop上執行程式 (只有一個class檔時)： • 若有多個class檔，則需要將class檔包裝成jar檔： ~# jar cvf [jar檔名稱].jar [程式名稱].class • 在Hadoop上執行包裝好的jar檔： /hadoop# bin/hadoop jar [jar檔名稱].jar [主函式名稱] [參數0] [參數1] …

使用IDE (Eclipse) (1/2) • 先到Eclipse官網 (http://www.eclipse.org) 下載Eclipse安裝檔 • Linux系統上需先安裝圖形化套件 • 目前下載版本是Eclipse Classic 3.6.2 • 下載完後解壓縮，並將解壓縮後的目錄移到/opt/eclipse

使用IDE (Eclipse) (2/2) • 也可以開啟終端機輸入下列指令： ~# wget http://ftp.cs.pu.edu.tw/pub/eclipse/eclipse/downloads/drops/R-3.6.2-201102101200/eclipse-SDK-3.6.2-linux-gtk.tar.gz ~# tar zxvf eclipse-SDK-3.6.2-linux-gtk.tar.gz ~# mv eclipse /opt/ • 在/usr/local/bin/裡建立eclipse執行檔的連結： ~# ln -sf /opt/eclipse/eclipse /usr/local/bin/ • 將/opt/hadoop裡的eclipse plugin搬到eclipse/plugin裡： ~# cp /opt/hadoop/contrib/eclipse-plugin/hadoop-0.20.2-eclipse-plugin.jar /opt/eclipse/plugins/ • 接下來便可以啟動Eclipse了： ~# eclipse &

新增專案(1/15)

新增專案(2/15)

新增專案(3/15)

新增專案(4/15)

新增專案(5/15)

新增專案(6/15)

新增專案(7/15)

新增專案(8/15)

新增專案(9/15)

新增專案(10/15)

新增專案(11/15)

新增專案(12/15)

新增專案(13/15)

新增專案(14/15)

新增專案(15/15)

MapReduce程式架構 • MapReduce程式主要可分為三個部份 • MapReduce Driver • 扮演整個MapReduce程式主函式角色，在此類別中定義MapReduce程式的相關設定 • Mapper • 利用一個輸入key/value pair集合來產生一個輸出的key/value pair集合 • Reducer • 接受一個中間key的值和相關的一個value值的集合

MapReduceDriver 01. Class MapReduceDriver類別名稱{ 02. main(){ 03. Configuration conf = new Configuration(); 04. Job job = new Job(conf, Job名稱); 05. job.setJarByClass( MapReduceDriver類別(即此類別) ); 06. job.setMapperClass( Mapper類別 ); 07. job.setReducerClass( Reducer類別 ); 08. FileInputFormat.addInputPath( job, new Path(args[0])); 09. FileOutputFormat.setOutputPath( job, new Path(args[1])); 10. 其它參數設定 11. job.waitForCompletion(true); 12. } 13. }

Mapper程式架構 01. class Mapper類別名稱 extends Mapper< 輸入鍵類型, 輸入值類型, 輸出鍵類型, 輸出鍵值型 > { 02. 全域變數 03. public void map( 輸入鍵類型 key, 輸入值類型 value, Context context) throws IOException, InterruptedException { 04. Map程式碼區 05. context.write(IntermediateKey,IntermediateValue); 06. } 07. }

Reducer程式架構 01. class Reducer類別名稱 extends Redcuer < 輸入鍵類型, 輸入值類型, 輸出鍵類型, 輸出鍵值型 > { 02. 全域變數 03. public void reduce( 輸入鍵類型 key, Iterable< 輸入值類型> value, Context context) throws IOException, InterruptedException { 04. Reduce程式碼 05. context.write(ResultKey, ResultValue); 06. } 07. }

MapReduce基礎實作範例(1/2) • 本範例以一個簡單的maxCPU程式說明如何使用Eclipse開發MapReduce程式 • 此範例中，系統每小時記錄一次CPU使用率到日誌檔中，而maxCPU程式會分析日誌檔，並透過MapReduce的方式，找出每天最高的CPU使用率。 • 本範例中將以這個日誌檔做為輸入檔，先在HDFS上新創一個log目錄，再根據上述格式建立一個日誌檔並上傳到log目錄中。

MapReduce基礎實作範例(2/2) • 日誌檔中記錄的欄位分別為日期、時段及CPU使用率，目誌檔部份內容如下 2011/01/01 00:00 40 2011/01/01 01:00 30 … … 2011/01/02 22:00 40 2011/01/02 23:00 30 … …

新增Mapper類別(1/3)

新增Mapper類別(2/3) • 之後在HadoopLab專案中便會新增一個新的package MR_Lab及mymapper.java ，並修改mymapper.java的內容： 01. public class mymapper extends Mapper<Object, Text, Text, IntWritable> { 02. private Text tday = new Text(); 03. private IntWritableidata = new IntWritable(); 04. public void map(Object key, Text value, Context context) throws IOException, InterruptedException { 05. String line = value.toString(); 06. String day = line.substring(0, 10); 07. String data = line.substring(17); 08. tday.set(day); 09. idata.set(Integer.valueOf(data)); 10. context.write(tday, idata); 11. } 12. }

新增Mapper類別(3/3)

新增Reducer類別(1/3)

新增Reducer類別(2/3) • 在MR_Lab的package中出現myreducer.java ，並修改myreducer.java的內容為： 01. public class myreducer extends Reducer<Text, IntWritable, Text, IntWritable> { 02. IntWritablecpuUtil = new IntWritable(); 03. public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { 04. intmaxValue = Integer.MIN_VALUE; 05. for (IntWritableval : values) { 06. maxValue = Math.max(maxValue, val.get()); 07. } 08. cpuUtil.set(maxValue); 09. context.write(key, cpuUtil); 10. } 11. }

新增Reducer類別(3/3)

新增MapReduce Driver類別(1/4)

新增MapReduce Driver類別(2/4) • 在MR_Lab的package中出現maxCPU.java ，並修改maxCPU.java的內容為： 01. public class maxCPU { 02. public static void main(String[] args) throws Exception { 03. Configuration conf = new Configuration(); 04. String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); 05. if (otherArgs.length != 2) { 06. System.err.println("Usage: maxCPU <in> <out>"); 07. System.exit(2); 08. } 09. Job job = new Job(conf, "max CPU"); 10. job.setJarByClass(maxCPU.class); 11. job.setMapperClass(mymapper.class); 12. job.setCombinerClass(myreducer.class); 13. job.setReducerClass(myreducer.class); 14. job.setOutputKeyClass(Text.class); 15. job.setOutputValueClass(IntWritable.class); 16. FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 17. FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

新增MapReduce Driver類別(3/4) 18. boolean status = job.waitForCompletion(true); 19. if (status) { 20. System.exit(0); 21. } else { 22. System.err.print("Not Complete!"); 23. System.exit(1); 24. } 25. } 26. }

新增MapReduce Driver類別(4/4)

在Hadoop上執行MapReduce程式(1/3)

在Hadoop上執行MapReduce程式(2/3)

在Hadoop上執行MapReduce程式(3/3) • 執行結束後，在HDFS上的output目錄中即可看到最後結果： 2011/01/01 100 2011/01/02 90 2011/01/03 80 2011/01/04 30

MapReduce進階實作範例 • 本範例將所介紹的MapReduce程式 (maxCPU)，加入HDFS及HBase的相關操作。 7. MapReduce 4. 3. Reducer Mapper HBase 6. 5. 2. HDFS 1. Local host

新增maxCPU類別(1/2) • 新增一個MapReduce Driver類別並命名為maxCPU，maxCPU類別負責MapReduce相關設定及運作流程 01. public class maxCPU { 02. public static void main(String[] args) throws Exception { 03. Configuration conf = new Configuration(); 04. String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); 05. if (otherArgs.length != 2) { 06. System.err.println("Usage: maxCPU <in> <out>"); 07. System.exit(2); 08. } 09. Job job = new Job(conf, "max CPU"); 10. job.setJarByClass(maxCPU.class); 11. job.setMapperClass(mymapper.class); 12. job.setCombinerClass(myreducer.class); 13. job.setReducerClass(myreducer.class); 14. job.setOutputKeyClass(Text.class); 15. job.setOutputValueClass(IntWritable.class); 16. FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

新增maxCPU類別(2/2) 17. FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 18. CheckDir.check(otherArgs[0].toString(), conf); 19. LocalToHdfs.localToHdfs(otherArgs[0].toString(), otherArgs[0].toString(), conf); 20. CheckDir.check(otherArgs[1].toString(), conf); 21. CheckTable.check("CPU"); 22. CheckTable.addFamily("CPU", "CPUUtil"); 23. boolean status = job.waitForCompletion(true); 24. if (status) { 25. OutputResult.output(otherArgs[1].toString(), conf); 26. System.exit(0); 27. } else { 28. System.err.print("Not Complete!"); 29. System.exit(1); 30. } 31. } 32. }

新增mymapper類別 • 新增一個Mapper類別並命名為mymapper，其功能為整理輸入的鍵/值，並在第10行呼叫AddData類別將資料存入HBase 01. public class mymapper extends Mapper<Object, Text, Text, IntWritable> { 02. private Text tday = new Text(); 03. private IntWritableidata = new IntWritable(); 04. public void map(Object key, Text value, Context context) throws IOException, InterruptedException { 05. String line = value.toString(); 06. String day = line.substring(0, 10); 07. String time = line.substring(11, 16); 08. String data = line.substring(17); 09. try { 10. AddData.add("CPU", "CPUUtil", day + " " + time, data); 11. } catch (Exception e) { 12. System.err.print("ERROR! (add data to HBase)"); 13. } 14. tday.set(day); 15. idata.set(Integer.valueOf(data)); 16. context.write(tday, idata); 17. } 18. }

CHAPTER 9

CHAPTER 9

Presentation Transcript

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

Chapter 9

CHAPTER 9

Chapter 9

Chapter 9

Chapter 9