1 / 8

An introduction to Apache Crunch

A short introduction to Apache Crunch. What is it and how does it simplify and aid the creation of Hadoop pipelines ?

semtechs
Download Presentation

An introduction to Apache Crunch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Crunch • What is it ? • How does it work ? • Why use it ? • Hadoop MapReduce pipelines • Scrunch • Joins www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  2. Apache Crunch – Pipe line • Crunch is based on Google's FlumeJava • Provides a Java based API for M/R pipelines • It uses an MST ( multiple serializable type ) data model • Good for processing complex data types • Better for “non tuple” data types i.e. • Images • Audio • Seismic data www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  3. Apache Crunch – Pipe line • What is a Map Reduce Pipe line ? • Map • Shuffle • Reduce • Combine • Arranged in sequence and / or in parallel • Potentially very long chains www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  4. Apache Crunch – Scala • Scrunch is a Scala wrapper for Apache Crunch • Reduced code • Functional and OO styles • Uses type inferencing for Map / Reduce • Incorporates Java Materialize functionality • Includes REPL ( read eval print loop ) www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  5. Apache Crunch – Joins • Details of Joins available in Crunch • Inner / Outer like SQL joins • Same with Left / Right / Full joins • MapSide join is an in memory join www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  6. Apache Crunch – Performance • A light weight API that runs efficiently • Crunch is a thin veneer on top of Map Reduce • Two implementations available • Hadoop Writeables • Avro • Avro implementation much faster www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  7. Apache Crunch – API • Operators • DoFn • CombineFn • FilterFn • Joins • Cartesian • Sort • Secondary Sort • Pobject • BloomFilters • Data Model • Pipeline • MRPipeline • MemPipeline • Pcollection • Ptable • PgroupTable • Source • Target • Emitter • PType www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  8. Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

More Related