1 / 15

Poly Hadoop

Poly Hadoop. CSC 550 May 22, 2007. Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko. Accomplishments. ITS Grid Account OpenPBS, Java, Subversion, Bash, Perl, Vim Hadoop on ITS Grid Account HDFS, Node Configurations MapReduce Code Hadoop Running Natively on ITS Grid

kaycee
Download Presentation

Poly Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Poly Hadoop CSC 550 May 22, 2007 Scott Griffin Daniel Jackson Alexander Sideropoulos Anton Snisarenko

  2. Accomplishments • ITS Grid Account • OpenPBS, Java, Subversion, Bash, Perl, Vim • Hadoop on ITS Grid Account • HDFS, Node Configurations • MapReduce Code • Hadoop Running Natively on ITS Grid • Hadoop on VMware Images • Fedora 6, Image & Hadoop Configuration

  3. Grid Properties • All Jobs Queued Through Management Node • qsub <resource_list> script.bsh • Resource list can include which physical node assignment, number of processors, allowed execution time, etc. • Script Executes on Only One Physical Node • User Environment Replicated on All Nodes • Shared File System

  4. Hadoop on GridIssues & Solutions • Shared File System vs. Local File System • Issues • Single Configuration File Shared by All Hadoop Nodes • Hadoop DataNodes Need “Local” Directories • The File System is Shared • Solution • Create Separate Directories Using Node’s HostName • Supply HostName via Java System Properties • Use Java System Property Expansion in Hadoop Configuration File

  5. Hadoop on GridIssues & Solutions (cont.) • Pseudo-Dynamic namenode Selection • Issues • Physical Node Assignments Not Guaranteed • Hadoop Configuration File Specifies Nodes to Use • Solution • On-the-Fly Modification of Hadoop Configuration File • Yay for XML! • On-the-Fly Modification of Hadoop masters and slaves Files

  6. Hadoop on Grid Scripts • run_createdirs.sh • Creates dirs for each physical node • update_sitexml.pl • Dynamically updates hadoop-site.xml • run_real_test.sh • Formats HDFS • Starts job management and DFS • Puts dataset on DFS • Runs MapReduce jobs • Exports output • Stops MapReduce and DFS

  7. MapReduce Progress • Pushing Dataset Onto Hadoop FS • Simple Command Done in qsub Script • MapReduce Java Code • Selecting Number Of Jobs • Map Jobs = 10 Per Node • Reduce Jobs = 2 Per Node

  8. Map Code public class UserRatingMapper extends MapReduceBase implements Mapper { private static Pattern userRatingDate = Pattern.compile("^(\\d+),(\\d+), \\d{4}-\\d{2}-\\d{2}$"); private Logger log = Logger.getLogger(this.getClass()); public void map(WritableComparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)values).toString(); Matcher userRating = userRatingDate.matcher(line); IntWritable userId = new IntWritable(); IntWritable rating = new IntWritable(); if (line.matches("^\\d+:$")) { } else if (userRating.matches()) { userId.set(Integer.parseInt(userRating.group(1))); rating.set(Integer.parseInt(userRating.group(2))); output.collect(userId, rating); } else { log.error("Unexpected input: " + line); } } }

  9. Reduce Code public class AverageValueReducer extends MapReduceBase implements Reducer { publicvoid reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0, count = 0; while (values.hasNext()) { sum += ((IntWritable)values.next()).get(); ++count; } output.collect(key, new FloatWritable(((float)sum)/count)); } }

  10. VMWare Image Progress • Setup a Fedora Core 6 VM Image • Configured Image to always create a new key when moved • Turned Firewall off on the Image. • Installed Hadoop and configured it • Master, slaves, HDFS namespace, output directories, format HDFS • Successfully stared the HDFS, and MapReduce with a master and 1 slave • Ran a test job with 99 input files

  11. VMWare setup on Grid • Need multiple copies of images on the grid • Namenode/JobTracker image (1 copy) • Datanode/TaskTracker images (many copies) • Different MAC address for each copy • Starting up Hadoop • Start each image copy on separate blades • Obtain image IP's from dhcp server and place them in config files for each image. • Start the HDFS and MapReduce from the master

  12. VMWare Issues • Issues • Slaves would not connect to the master • Master would not start after formating the HDFS • Need root access to install VMPlayer on the grid • Images too big / not enough HD space • Solutions • Turn off firewall • Delete all the files from the namespace dir and then format the HDFS • E-mail the admin • Reduce the virtual harddrive on image

  13. Evaluation Techniques • Processing time between the different configurations • Optimizations that can be made • Number of Map tasks vs. Reduce tasks per node • Explanation of prelim data • overhead w/ redundancy on grid • We’re all setup and ready to start our experiments • as soon as jkempena gives us our nodes back

  14. Timeline • Week 5-6 • Install/Configure Environment • Develop Code • Week 7-8 • Run Experiments • Week 9-10 • Analyze Data • Write Paper • Present results

  15. Questions?

More Related