1 / 10

SIMR Spark In MapReduce

SIMR Spark In MapReduce. Ali Ghodsi Ahir Reddy UC Berkeley Databricks. Background. Hard to try out Spark on MapReduce v1 clusters Separate machines Installing Scala , Spark Admin rights Generally difficult to try out on a cluster

cameo
Download Presentation

SIMR Spark In MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMRSpark In MapReduce Ali Ghodsi Ahir Reddy UC Berkeley Databricks

  2. Background • Hard to try out Spark on MapReduce v1 clusters • Separate machines • Installing Scala, Spark • Admin rights • Generally difficult to try out on a cluster • Configure, compile, Standalone, Mesos, YARN

  3. SIMR • MapReduce job with Spark inside it • Launches Spark, Scala, your job, Spark-shell bash> ./simr --shell bash> ./simrmy.jartestClass %spark_url%

  4. How does it work?

  5. Under the hood • Ship to all mappers • Scala & Spark fat jar • Your job jar

  6. Setting up Spark • Mappers write their ID to HDFS • Lowest timestamped mapper becomes leader • Leader mapper executes • Spark driver • Other mappers execute • Executors

  7. Connecting everyone • Connect executors with driver • Driver writes URL to HDFS, executors busy-read • Spark is ready!

  8. Interacting with Spark • Relay screen input & keyboard output • Relay Server executed on leader mapper • Relay Client executed on client machine • Connecting the two • Relay server writes URL to HDFS • Relay client reads and connects to server • Relay all input/output between client/driver

  9. Hadoop versions • Precompiled for Hadoop • 1.0.4 (HDP 1.0-1.2) • 1.2.x (HDP 1.3) • 0.20 (CDH3) • 2.0.0 (CDH4) • Instructions on how to compile your own • http://databricks.github.io/simr

  10. DEMO TIME

More Related