1 / 72

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka

This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial: <br><br>1) Big Data Analytics <br>2) What is Apache Spark? <br>3) Why Apache Spark? <br>4) Using Spark with Hadoop <br>5) Apache Spark Features <br>6) Apache Spark Architecture <br>7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX <br>8) Demo: Analyze Flight Data Using Apache Spark

EdurekaIN
Download Presentation

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 5 Best Practices in DevOps Culture EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  2. What to expect? 2 Spark Features 1 Why Apache Spark? 3 Spark Ecosystem 5 4 Use Case Hands-On Examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  3. Big Data Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  4. Data Generated Every Minute! EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  5. Big Data Analytics ➢ Big Data Analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information ➢ Big Data Analytics is of two types: 1. Batch Analytics 2. Real-Time Analytics Batch Analytics Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  6. Spark For Real Time Analysis Use Cases For Real Time Analytics Healthcare Stock Market Telecommunications Banking Government Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  7. What Is Spark? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  8. What Is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel Reduction in time  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  9. Why Spark? Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  10. Spark Success Story EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  11. Spark Success Story NYSE: Real Time Analysis of Stock Market Data Twitter Sentiment Analysis With Spark Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  12. Using Hadoop Through Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  13. Spark And Hadoop Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of HDFS to leverage the distributed replicated storage MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing & EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  14. Spark Features Speed Multiple Languages Advanced Analytics Real Time Hadoop Integration Machine Learning EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  15. Spark Features vs Spark runs upto 100x times faster than MapReduce Supports multiple data sources Real time computation & low latency because of in-memory computation Lazy Evaluation: Delays evaluation till needed EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  16. Spark Features Hadoop Integration Machine Learning for iterative tasks EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  17. Spark Components Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  18. Spark Components Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  19. Spark Components ML pipelines makes it easier to combine multiple algorithms or workflows Tabular data abstraction introduced by Spark SQL DataFrames ML Pipelines Spark Streaming (Streaming) MLlib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  20. Spark Core Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  21. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing Table Row It is responsible for: Row  Memory management and fault recovery  Scheduling, distributing and monitoring jobs on a cluster  Interacting with storage systems Result Row Row Figure: Spark Core Job Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  22. Spark Architecture Worker Node Cache Executor Task Task Driver Program Cluster Manager Spark Context Worker Node Cache Executor Task Task Figure: Components of a Spark cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  23. Spark Streaming Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  24. Spark Streaming  Spark Streaming is used for processing real-time streaming data  It is a useful addition to the core Spark API  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams  The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  25. Spark Streaming MLlib Machine Learning Streaming Data Sources Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  26. Spark Streaming Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  27. Spark SQL Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  28. Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming. 2 Spark SQL is used for the structured/semi structured data analysis in Spark. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  29. Spark SQL Features 3 Support for various data formats RDD 2 RDD 1 Shuffle transform 4 Drop split point SQL queries can be converted into RDDs for transformations Invoking RDD 2 computes all partitions of RDD 1 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  30. Spark SQL Overview 5 Performance And Scalability EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  31. Spark SQL Features 6 Standard JDBC/ODBC Connectivity 7 User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary User EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  32. Spark SQL Flow Diagram  Spark SQL has the following libraries: 1. Data Source API 2. DataFrame API 3. Interpreter & Optimizer 4. SQL Service Data Source API  The flow diagram represents a Spark SQL process using all the four libraries in sequence DataFrame API Named Columns Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  33. MLlib Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  34. MLlib Machine Learning may be broken down into two classes of algorithms: Machine Learning Supervised Unsupervised Clustering - K Means • Classification - Naïve Bayes - SVM • Dimensionality Reduction - Principal Component Analysis - SVD • Regression - Linear - Logistic • Supervised algorithms use labelled data in which both the input and output are provided to the algorithm Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  35. Mllib - Techniques There are 3 common techniques for Machine Learning: 1. Classification: It is a family of supervised machine learning algorithms that designate input as belonging to one of several pre-defined classes Some common use cases for classification include: i) Credit card fraud detection ii) Email spam detection 2. Clustering: In clustering, an algorithm groups objects into categories by analyzing similarities between input examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  36. Mllib - Techniques Collaborative Filtering: Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part) 3. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  37. GraphX Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  38. GraphX Graph Concepts A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects and the edges are the relationships between them. Relationship: Friends John Sam Edge Vertex A directed graph is a graph where the edges have a direction associated with them. E.g. User Sam follows John on Twitter. Relationship: Friends John Sam Follows EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  39. GraphX – Triplet View A GraphX has Graph class that contains members to access edges and vertices Vertices: B Triplet View The triplet view logically joins the vertex and edge properties yielding an Edges: B A RDD[EdgeTriplet[VD, ED]] containing instances of the EdgeTriplet class Triplets: B A EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  40. GraphX – Property Graph GraphX is the Spark API for graphs and graph-parallel computation. GraphX extends the Spark RDD with a Resilient Distributed Property Graph. The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined properties associated with it. The parallel edges allow multiple relationships between the same vertices. Vertex Property LAX SJC Edge Property Property Graph EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  41. GraphX – Example To understand GraphX, let us consider the below graph.  The vertices have names and ages of people.  The edges represent whether a person likes a person and its weight is a measure of the likeability. 7 4 1 2 3 Charlie Age: 65 Alice Age: 28 Bob Age: 27 1 3 2 3 4 5 6 David Age: 42 Ed Fran Age: 50 Age: 55 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  42. GraphX – Example Display names and ages val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray) val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray) val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD) graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach { case (id, (name, age)) => println(s"$name is $age")} Output David is 42 Fran is 50 Ed is 55 Charlie is 65 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  43. GraphX – Example Display Relations for (triplet <- graph.triplets.collect) { println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}") } Output Bob likes Alice Bob likes David Charlie likes Bob Charlie likes Fran David likes Alice Ed likes Bob Ed likes Charlie Ed likes Fran EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  44. Use Case: Analyze Flight Data Using Spark GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  45. Use Case: Problem Statement Problem Statement To analyse Real-Time Flight data using Spark GraphX, provide near real-time computation results and visualize the results using Google Data Studio Computations to be done:  Compute the total number of flight routes  Compute and sort the longest flight routes  Display the airport with the highest degree vertex  List the most important airports according to PageRank  List the routes with the lowest flight costs We will use Spark GraphX for the above computations and visualize the results using Google Data Studio EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  46. Use Case: Flight Dataset The attributes of each particular row is as below: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Scheduled Departure Time 11. Actual Departure Time 12. Departure Delay In Minutes 13. Scheduled Arrival Time 14. Actual Arrival Time 15. Arrival Delay Minutes 16. Elapsed Time 17. Distance Day Of Month Day Of Week Carrier Code Unique ID- Tail Number Flight Number Origin Airport ID Origin Airport Code Destination Airport ID Destination Airport Code Figure: USA Airport Flight Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  47. Use Case: Flow Diagram 1 2 3 Database storing Real-Time Flight Data Huge amount of Flight data Creating Graph Using GraphX Query 1 4 Compute Longest Flight Routes Query 2 4 Calculate Top Busiest Airports Query 3 4 Calculate Routes with Lowest Flight Costs 5 Visualizing using Google Data Studio USA Flight Mapping EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  48. Use Case: Starting Spark Shell //Importing the necessary classes import org.apache.spark._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.graphx._ import org.apache.spark.graphx.util.GraphGenerators //Creating a Case Class ‘Flight’ case class Flight(dofM:String, dofW:String, carrier:String, tailnum:String, flnum:Int, org_id:Long, origin:String, dest_id:Long, dest:String, crsdeptime:Double, deptime:Double, depdelaymins:Double, crsarrtime:Double, arrtime:Double, arrdelay:Double,crselapsedtime:Double,dist:Int) //Defining a Parse String ‘parseFlight’ function to parse input into ‘Flight’ class def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5).toLong, line(6), line(7).toLong, line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt) } EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  49. Use Case: Starting Spark Shell 1 2 3 4 5 6 7 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  50. Use Case: Creating Edges For Graph Mapping //Load the data into a RDD ‘textRDD’ val textRDD = sc.textFile("/home/edureka/Downloads/AirportDataset.csv") //Parse the RDD of CSV lines into an RDD of flight classes val flightsRDD = textRDD.map(parseFlight).cache() //Create airports RDD with ID and Name val airports = flightsRDD.map(flight => (flight.org_id, flight.origin)).distinct airports.take(1) //Defining a default vertex called ‘nowhere’ and mapping Airport ID for printlns val nowhere = "nowhere" val airportMap = airports.map { case ((org_id), name) => (org_id -> name) }.collect.toList.toMap //Create routes RDD with sourceID, destinationID and distance val routes = flightsRDD.map(flight => ((flight.org_id, flight.dest_id), flight.dist)).distinct routes.take(2) //Create edges RDD with sourceID, destinationID and distance val edges = routes.map { case ((org_id, dest_id), distance) => Edge(org_id.toLong, dest_id.toLong, distance)} edges.take(1) EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

More Related