1 / 15

Scala/Spark Review

6/27/2014 Small Scale Scala Spark Demo, DataRepos, Spark Executors. Scala/Spark Review. Spark 2.9.3 partition. //withoutTen( = {1, 10, 10, 2}) → = {1, 2, 0, 0} //withoutTen( = {10, 2, 10}) → = {2, 0, 0} //withoutTen( = {1, 99, 10}) → = {1, 99, 0}

elda
Download Presentation

Scala/Spark Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6/27/2014 Small Scale Scala Spark Demo, DataRepos, Spark Executors Scala/Spark Review

  2. Spark 2.9.3 partition //withoutTen( = {1, 10, 10, 2}) → = {1, 2, 0, 0} //withoutTen( = {10, 2, 10}) → = {2, 0, 0} //withoutTen( = {1, 99, 10}) → = {1, 99, 0} def withoutTen(nums:Array[Int]):Array[Int] = { //cant get span and dropWhile,takeWhile to work correctly in 2.9.3 nums.partition(_ == 10)._2.padTo(nums.size,0) }

  3. Partition to split into 2 arrays/lists scala> a res37: Array[Int] = Array(1, 2, 3, 10, 10, 10, 1) scala> a.partition(_ == 10) res38: (Array[Int], Array[Int]) = (Array(10, 10, 10),Array(1, 2, 3, 1))

  4. Span should do the same thing • scala> a.span( _ ==10) • res44: (Array[Int], Array[Int]) = (Array(),Array(1, 2, 3, 10, 10, 10, 1)) • Works for first element only • scala> a.span( _ ==1) • res45: (Array[Int], Array[Int]) = (Array(1),Array(2, 3, 10, 10, 10, 1)) • takeWhile/dropWhile

  5. Using fold/reduce for accumulation • foldLeft/foldRight binary operator to add not ( _ + _ ) on webpost • def add(res:Int,acc:Int)={println(“res:”+res+” acc:“+acc) res+acc} • Val a = Array(1,2,3,10,10,10,1) • Add if statement to add:scala> def add(res:Int, x:Int)={ println("res:"+res+" acc:"+acc);if(acc%2==0)res+acc else res} • add: (res: Int, acc: Int)Int • a.foldLeft(0)(add)

  6. FoldLeft scala> a.foldLeft(0)(add) • res:0 x:1 • res:0 x:2 • res:2 x:3 • res:2 x:10 • res:12 x:10 • res:22 x:10 • res:32 x:1 • res51: Int = 32

  7. reduceLeft scala> a.reduceLeft(add) • res:1 x:2 • res:3 x:3 • res:3 x:10 • res:13 x:10 • res:23 x:10 • res:33 x:1 • res49: Int = 33

  8. Test using embeded fxns • Can't add logic to _ + _ in the same line • Limited to functions which return boolean, count or another data collection • ReduceRigtht, foldRight have to reverse arguments to add(acc:Int, res:Int)

  9. Spark • CDK, MR parallelism vs Spark Executors in Mesos/YARN • Spark Job Server demo • Change Dependencies.scala • lazy val commonDeps = Seq(... • "org.apache.hadoop" % "hadoop-common" % "2.3.0", • "org.apache.hadoop" % "hadoop-client" % "2.3.0", • "org.apache.hadoop" % "hadoop-hdfs" % "2.3.0"

  10. Test HDFS access HelloWorld.scala object HelloWorld extends SparkJob{ def main(args:Array[String]){ println("asdf") //wont see this val sc=new SparkContext("local[2]","HelloWorld") val config = ConfigFactory.parseString("") val results = runJob(sc,config) println("results:"+results) }

  11. IPC error aused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1113) Wrong version of Hadoop Client libs

  12. Validate, runJob def validate(sc:SparkContext,config:Config):SparkJobValidation = { Try(config.getString("input.string")).map(x=>SparkJobValid).getOrElse(SparkJ$ } override def runJob(sc:SparkContext, config:Config):Any= { val dd = sc.textFile("hdfs://localhost:8020/user/dc/books") dd.count() }

  13. Results Test in spark shell first, count num lines hdfs file books 1) sbt package to create a jar 2) start the spark job server >re-start Verify you see a ui at localhost:8090 3) load the jar you packaged in 1) [dc@localhost spark-jobserver-master]$ curl --data-binary @job-server-tests/target/job-server-tests-0.3.1.jar localhost:8090/jars/test OK

  14. Jobserver Hadoop HelloWorld 4) run jar in Spark [dc@localhost spark-jobserver-master]$ curl -d "input.string = a a a a a a a b b" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.HelloWorld' { "status": "STARTED", "result": { "jobId": "ce208815-f445-4a77-866c-0be46fdd5df9", "context": "70b92cb1-spark.jobserver.HelloWorld" } }

  15. Query JobServer for results [dc@localhost spark-jobserver-master]$ curl localhost:8090/jobs/ce208815-f445-4a77-866c-0be46fdd5df9 { "status": "OK", "result": 5 }[dc@localhost spark

More Related