1 / 7

An introduction to Apache Spark

A introduction to Apache Spark, what is it and how does it work ? Why use it and some examples of use.

semtechs
Download Presentation

An introduction to Apache Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Spark • What is it ? • How does it work ? • Benefits • Tuning • Examples www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  2. Spark – What is it ? • Open Source • Alternative to Map Reduce for certain applications • A low latency cluster computing system • For very large data sets • May be 100 times faster than Map Reduce for • Iterative algorithms • Interactive data mining • Used with Hadoop / HDFS • Released under BSD License www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  3. Spark – How does it work ? • Uses in memory cluster computing • Memory access faster than disk access • Has API's written in • Scala • Java • Python • Can be accessed from Scala and Python shells • Currently an Apache incubator project www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  4. Spark – Benefits • Scales to very large clusters • Uses in memory processing for increased speed • High Level API's • Java, Scala, Python • Low latency shell access www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  5. Spark – Tuning • Bottlenecks can occur in the cluster via • CPU, memory or network bandwidth • Tune data serialization method i.e. • Java ObjectOutputStream vs Kryo • Memory Tuning • Use primitive types • Set JVM Flags • Store objects in serialized form i.e. • RDD Persistence • MEMORY_ONLY_SER www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  6. Spark – Examples Example from spark-project.org, Spark job in Scala. Showing a simple text count from a system log. /*** SimpleJob.scala ***/ import spark.SparkContext import SparkContext._ object SimpleJob { def main(args: Array[String]) { val logFile = "/var/log/syslog"// Should be some file on your system val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME", List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))‏ val logData = sc.textFile(logFile, 2).cache()‏ val numAs = logData.filter(line => line.contains("a")).count()‏ val numBs = logData.filter(line => line.contains("b")).count()‏ println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))‏ } } www.semtech-solutions.co.nz info@semtech-solutions.co.nz

  7. Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

More Related