1 / 19

Map reduce with Hadoop streaming and/or Hadoop

Map reduce with Hadoop streaming and/or Hadoop. Hadoop FileSystem. Streaming Hadoop Job. HS Mapper. Adaptor. HS Reducer. Adaptor. API: text lines in, lines out. …. Hadoop Job. Hadoop FileSystem. Hadoop Mapper. Hadoop Reducer. Shuffle Sort. Partitioner. Combiner.

Download Presentation

Map reduce with Hadoop streaming and/or Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Map reduce with Hadoop streaming and/or Hadoop

  2. HadoopFileSystem Streaming Hadoop Job HS Mapper Adaptor HS Reducer Adaptor API: text lines in, lines out … Hadoop Job HadoopFileSystem Hadoop Mapper Hadoop Reducer Shuffle Sort Partitioner Combiner Java, Unix File System, …

  3. Unix file system Stream & Sort SS Mapper SS Reducer API: text lines in, lines out cat input | mapper | sort | reducer > output mapper = java [-cp myJar.jar] MyMainClass1 reducer = java [-cp myJar.jar] MyMainClass2

  4. HadoopFileSystem Streaming Hadoop Job HS Mapper Adaptor HS Reducer Adaptor API: text lines in, lines out … Hadoop Job HadoopFileSystem Hadoop Mapper mapper = java [-cpmyJar.jar] MyMainClass1 Hadoop Reducer reducer = java [-cpmyJar.jar] MyMainClass2 Shuffle Sort Partitioner Combiner Java, Unix File System, …

  5. Breaking this down… • What actually is a key-value pair? How do you interface with Hadoop? • One very simple way: Hadoop’s streaming interface. • Mapper outputs key-value pairs as: • One pair per line, key and value tab-separated • Reduced reads in data in the same format • Lines are sorted so lines with the same key are adjacent.

  6. An example: • SmallStreamNB.java and StreamSumReducer.java: • the code you just wrote.

  7. To run locally:

  8. To train with streaming Hadoop: • First, you need to prepare the corpus by splitting it into shards • … and distributing the shards to different machines:

  9. To train with streaming Hadoop: • One way to shard text: • hadoop fs -put LocalFileName HDFSName • then run a streaming job with ‘cat’ as mapper and reducer • and specify the number of shards you want with -numReduceTasks

  10. To train with streaming Hadoop: • Next, prepare your code for upload and distribution to the machines cluster

  11. To train with streaming Hadoop: • Next, prepare your code for upload and distribution to the machines cluster

  12. Now you can run streaming Hadoop:

  13. on EC2, what’s different?(aside from interface) s3://…./myjar.jar s3://…./myjar.jar

  14. “Real” Hadoop • Streaming is simple but • There’s no typechecking of inputs/outputs • You need to parse strings a lot • You can’t use compact binary encodings • … • basically you have limited control over what you’re doing

  15. others: • KeyValueInputFormat • SequenceFileInputFormat

More Related