1 / 45

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:<br><br>1) Spark Overview<br>2) Hadoop Overview<br>3) Spark vs Hadoop<br>4) Why Spark Hadoop?<br>5) Using Hadoop With Spark<br>6) Use Case - Sports Analytics (NBA)

EdurekaIN
Download Presentation

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  2. What to expect?  Spark Overview  Hadoop Overview  Spark vs Hadoop  Why Spark Hadoop?  Using Hadoop With Spark  Use Case  Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  3. Spark Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  4. What is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Reduction in time Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  5. Spark Overview Spark is used in real-time processing Polyglot: Can be programmed in Scala, Java, Python and R Real time computation & low latency because of in-memory computation Lazy Evaluation: Delays evaluation till needed EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  6. Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  7. Spark Features Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  8. Spark Use Cases Twitter Sentiment Analysis With Spark NYSE: Real Time Analysis of Stock Market Data Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  9. Hadoop Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  10. What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion Master HDFS (Storage) MapReduce (Processing) Slaves Allows parallel processing of the data stored in HDFS Allows to dump any kind of data across the cluster Hadoop Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  11. Hadoop Ecosystem EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  12. Hadoop Features In-built capability of integrating seamlessly with cloud based services Flexible with all kinds of data Flexibility Scalability Usage of commodity hardware minimizes the cost of ownership Hadoop infrastructure has in-built fault tolerance features Reliability Economical Features EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  13. Hadoop Use Cases E-Commerce Data Analytics Politics: US Presidential Election Banking: Credit Card Fraud Detection Healthcare EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  14. Spark vs Hadoop EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  15. Spark vs Hadoop Use Cases For Real Time Analytics Healthcare Stock Market Telecommunications Banking Government Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  16. Spark vs Hadoop Page Rank Performance 180 160 140  Spark runs upto 100x times faster than Hadoop.  The in-memory processing in Spark is what makes it faster than MapReduce.  Spark is not considered as a replacement but as an extension to Hadoop. Hadoop 120 Iteration Time (s) Basic Spark 100 80 Spark + Controlled Partitioning 60 40 20 0 The best case as per our chart is when Spark is used alongside Hadoop. Let us dive in and use Hadoop with Spark. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  17. Why to use Spark with Hadoop? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  18. Why Spark Hadoop? Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN. Storage Sources Input Data Spark Streaming CSV Resource Allocation Sequence File Input Data Spark YARN HDFS Avro Output Data Parquet MapReduce Optional Processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  19. Using Hadoop with Spark Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of HDFS to leverage the distributed replicated storage MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing & EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  20. YARN Deployment With Spark YARN Cluster Mode YARN Client Mode  In YARN-Cluster mode, the Spark driver runs inside an application master process which is managed by YARN  In YARN-Client mode, the Spark driver runs in the client process  The application master is only used for requesting resources from YARN.  The client can go away after initiating the application Figure: Cluster Deployment Mode Figure: Client Deployment Mode EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  21. Use Case – Sports Analysis Using Spark Hadoop EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  22. Use Case Problem Statement To build a Sport Analysis system using Spark Hadoop for predicting game results and player rankings for sports like Basketball, Football, Cricket, Soccer, etc. We will demonstrate the same using Basketball for our use case. Stephen Curry, NBA MVP 2015 & 2016 Kevin Durant, NBA MVP 2014 Joe Hassett, Highest 3 Pt Normalized LeBron James, NBA MVP ‘10, ’12 & ‘13 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  23. Use Case – Flow Diagram 1 2 3 Using Spark Processing for Analysis Data Stored in HDFS Huge amount of Sports data Query 1 4 Predict the NBA Most Valuable Player (MVP) Query 2 4 Calculate Top Scorers Per Season Query 3 4 Compare Teams to Predict Winners 5 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  24. Use Case – Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  25. Use Case – Dataset Figure: Dataset from http://www.basketball-reference.com/leagues/NBA_2016_per_game.html EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  26. Use Case – Initializing Spark Packages //Importing the necessary packages import org.apache.spark.rdd._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.sql.SQLContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.util.StatCounter import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import scala.collection.mutable.ListBuffer import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  27. Use Case – Reading Data From HDFS //Creating an object basketball containing our main() class object basketball { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("basketball").setMaster("local[2]") val sc = new SparkContext(sparkConf) for (i <- 1980 to 2016) { println(i) val yearStats = sc.textFile(s"hdfs://localhost:9000/basketball/BasketballStats/leagues_NBA_$i*") yearStats.filter(x => x.contains(",")).map(x => (i,x)).saveAsTextFile(s"hdfs://localhost:9000/basketball/BasketballStatsWithYear/ $i/") } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  28. Use Case – Parsing Data And Broadcasting //Read in all the statistics val stats=sc.textFile("hdfs://localhost:9000/basketball/BasketballStatsWithYear4/*/*") .repartition(sc.defaultParallelism) //Filter out the junk rows and clean up data for errors val filteredStats=stats.filter(line => !line.contains("FG%")).filter(line => line.contains(",")).map(line => line.replace("*","").replace(",,",",0,")) filteredStats.cache() //Parse statistics and save as Map val txtStat = Array("FG","FGA","FG%","3P","3PA","3P%","2P","2PA","2P%","eFG%","FT","FTA","FT%"," ORB","DRB","TRB","AST","STL","BLK","TOV","PF","PTS") val aggStats = processStats(filteredStats,txtStat).collectAsMap //Collect RDD into map and broadcast it into 'broadcastStats' val broadcastStats = sc.broadcast(aggStats) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  29. Use Case – Player Statistics Transformations //Parse stats and normalize val nStats = filteredStats.map(x=>bbParse(x,broadcastStats.value,zBroadcastStats.value)) //Parse stats and track weights val txtStatZ = Array("FG","FT","3P","TRB","AST","STL","BLK","TOV","PTS") val zStats = processStats(filteredStats,txtStatZ,broadcastStats.value).collectAsMap //Collect RDD into Map and broadcast into 'zBroadcastStats' val zBroadcastStats = sc.broadcast(zStats) //Map RDD to RDD[Row] so that we can turn it into a DataFrame val nPlayer = nStats.map(x => Row.fromSeq(Array(x.name,x.year,x.age,x.position,x.team,x.gp,x.gs,x.mp) ++ x.stats ++ x.statsZ ++ Array(x.valueZ) ++ x.statsN ++ Array(x.valueN))) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  30. Use Case – Querying through Spark SQL EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  31. Use Case – Getting All Player Statistics //create schema for the data frame val schemaN = StructType( StructField("name", StringType, true) :: StructField("year", IntegerType, true) :: ... StructField("nTOT", DoubleType, true) :: Nil ) //Create DataFrame 'dfPlayersT' and register as 'tPlayers' val sqlContext = new org.apache.spark.sql.SQLContext(sc) val dfPlayersT = sqlContext.createDataFrame(nPlayer,schemaN) dfPlayersT.registerTempTable("tPlayers") //Create DataFrame 'dfPlayers' and register as 'Players' val dfPlayers = sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers join (select name,min(age)as min_age from tPlayers group by name) as t1 on tPlayers.name=t1.name order by tPlayers.name, exp ") dfPlayers.registerTempTable("Players") EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  32. Use Case – Storing Best Players Into HDFS //Calculate the best players of 2016 val mvp = sqlContext.sql("Select name, zTot from Players where year=2016 order by zTot desc").cache mvp.show //Storing the best players of 2016 into HDFS mvp.write.format("csv").save("hdfs://localhost:9000/basketball/output.csv") //Listing the full numbers of LeBron James sqlContext.sql("Select * from Players where year=2016 and name='LeBron James'").collect.foreach(println) //Ranking the top 10 players on the average 3 pointers scored per game in 2016 sqlContext.sql("select name, 3p, z3p from Players where year=2016 order by z3p desc").take(10).foreach(println) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  33. Use Case –Storing Best Players Into HDFS Best Player Of 2016 Most 3 Pointers In 2016 All Stats Of LeBron James EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  34. Use Case – Sample Result File in HDFS Output directory path Figure: Output file containing top NBA players of 2016 Figure: Output directory in HDFS file system EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  35. Use Case – Highest 3 Point Shooters //All time 3 point shooting ranking sqlContext.sql("select name, 3p, z3p from Players order by 3p desc").take(10).foreach(println) //All time 3 point shooting ranking normalized to their leagues sqlContext.sql("select name, 3p, z3p from Players order by z3p desc").take(10).foreach(println) //Calculate the average number of 3 pointers per game in 2016 broadcastStats.value("2016_3P_avg") //Calculate the average number of 3 pointers per game in 1981 broadcastStats.value("1981_3P_avg") EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  36. Use Case – Highest 3 Point Shooters Best All Time 3 Point Shooter Best All Time 3 Point Shooter Normalized To Their Season EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  37. Use Case – Prediction Analysis Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  38. Use Case – Who Will Be The 2016 NBA MVP? sqlContext.sql("select name, zTot from Players where year=2016 order by zTot desc").take(10).foreach(println) LeBron James James Harden Dwayne Wade Kobe Bryant Russell Westbrook Stephen Curry EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  39. Use Case – Predicting MVP 2016 As our model predicts, Stephen Curry is the MVP of NBA in 2016. Hell Yeah! It matched with the actual NBA MVP of 2016. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  40. Summary EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  41. Summary Spark vs Hadoop Spark Overview Hadoop Overview Sport Analysis Why Spark Hadoop? YARN Spark Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  42. Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  43. Conclusion Congrats! We have hence demonstrated the power of Spark Hadoop in Prediction Analytics. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark and Hadoop. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  44. Thank You … Questions/Queries/Feedback EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

More Related