1 / 85

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:<br><br>1) Limitations of Apache Hive<br>2) Spark SQL Advantages Over Hive<br>3) Spark SQL Success Story<br>4) Spark SQL Features<br>5) Architecture of Spark SQL<br>6) Spark SQL Libraries<br>7) Querying Using Spark SQL<br>8) Demo: Stock Market Analysis With Spark SQL

EdurekaIN
Download Presentation

Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 5 Best Practices in DevOps Culture www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  2. What to expect? 2 Spark SQL Libraries 1 Why Spark SQL 3 Spark SQL Features 5 4 Use Case Hands-On Examples www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  3. Why Spark SQL? www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  4. Why Do We Need Spark SQL? Spark SQL was built to overcome the limitations of Apache Hive running on top of Spark. Limitations of Apache Hive Hive uses MapReduce which lags in performance with medium and small sized datasets ( <200 GB) No resume capability Hive cannot drop encrypted databases www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  5. Spark SQL Advantages Over Hive 1 600 secs Faster execution 50 secs 2 Advantages No migration hurdles 3 Real time querying Batch Spark SQL uses the metastore services of Hive to query the data stored and managed by Hive. How? www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  6. Spark SQL Success Story www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  7. Spark SQL Success Story Twitter Sentiment Analysis With Spark SQL NYSE: Real Time Analysis of Stock Market Data Sentiment helps in crisis management, service adjusting and target marketing Trending Topics can be used to create campaigns and attract larger audience Banking: Credit Card Fraud Detection Genomic Sequencing www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  8. Spark SQL Features SQL Integration With Spark Uniform Data Access Seamless Support Transformations Performance Standard Connectivity User Defined Functions www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  9. Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming. 2 Spark SQL is used for the structured/semi structured data analysis in Spark. www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  10. Spark SQL Features Support for various data formats 3 4 SQL queries can be converted into RDDs for transformations RDD 2 RDD 1 Shuffle transform Drop split point Invoking RDD 2 computes all partitions of RDD 1 www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  11. Spark SQL Overview 5 Performance And Scalability www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  12. Spark SQL Features Standard JDBC/ODBC Connectivity 6 User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary 7 User www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  13. UDF Example Creating a UDF ‘toUpperCase’ to convert a string to upper case Registering our UDF in the list of functions www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  14. Spark SQL Architecture www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  15. Spark SQL Architecture Architecture Of Spark SQL Spark SQL & HQL DataFrame DSL DataFrame API Data Source API CSV JSON JDBC www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  16. Spark SQL Libraries www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  17. Spark SQL Libraries Spark SQL has the following libraries: 1 Data Source API DataFrame API 2 Interpreter & Optimizer 3 4 SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  18. Data Source API Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  19. Data Source API Data Source API Data Source API is used to read and store structured and semi- structured data into Spark SQL Features:  Structured/ Semi-structured data  Multiple formats  3rdparty integration www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  20. DataFrame API Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  21. DataFrame API DataFrame API converts the data that is read through Data Source API into tabular columns to help perform SQL operations Data Source API Features:  Distributed collection of data organized into named columns  Equivalent to a relational table in SQL DataFrame API  Lazily evaluated Named Columns www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  22. SQL Interpreter & Optimizer Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  23. SQL Interpreter & Optimizer SQL Interpreter & Optimizer handles the functional programming part of Spark SQL. It transforms the DataFrames RDDs to get the required results in the required formats. Features:  Functional programming  Transforming trees  Faster than RDDs  Processes all size data e.g. Catalyst: A modular library for distinct optimization Interpreter & Optimizer Resilient Distributed Dataset www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  24. SQL Service Data Source API DataFrame API Interpreter & Optimizer SQL Service www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  25. SQL Service  SQL Service is the entry point for working along structured data in Spark  SQL is used to fetch the result from the interpreted & optimized data We have thus used all the four libraries in sequence. This completes a Spark SQL process Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  26. Querying Using Spark SQL Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  27. Starting Up Spark Shell Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  28. Starting Up Spark Shell - Intialization //We first import a Spark Session into Apache Spark. import org.apache.spark.sql.SparkSession //Creating a Spark Session ‘spark’ using the ‘builder()’ function. val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate() //Importing the Implicts class into our ‘spark’ Session. import spark.implicits._ //We now create a DataFrame ‘df’ and import data from the ’employee.json’ file. val df = spark.read.json("examples/src/main/resources/employee.json") //Displaying the DataFrame ‘df’. The result is a table of ages and names from our ’employee.json’ file. df.show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  29. Starting Up Spark Shell – Spark Session www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  30. Creating Datasets Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  31. Creating Dataset - Case Class & Dataset After understanding DataFrames, let us now move on to Dataset API. The below code creates a Dataset class in SparkSQL. //Creating a class ‘Employee’ to store name and age of an employee. case class Employee(name: String, age: Long) //Assigning a Dataset ‘caseClassDS’ to store the record of Andrew. val caseClassDS = Seq(Employee("Andrew", 55)).toDS() //Displaying the Dataset ‘caseClassDS’. caseClassDS.show() //Creating a primitive Dataset to demonstrate mapping of DataFrames into Datasets. val primitiveDS = Seq(1, 2, 3).toDS() //Assigning the above sequence into an array. primitiveDS.map(_ + 1).collect() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  32. Creating Dataset - Case Class & Dataset www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  33. Creating Dataset – Reading File //Setting the path to our JSON file ’employee.json’. val path = "examples/src/main/resources/employee.json" //Creating a Dataset and from the file. val employeeDS = spark.read.json(path).as[Employee] //Displaying the contents of ’employeeDS’ Dataset. employeeDS.show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  34. Creating Dataset – Reading File www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  35. Adding Schema To RDDs Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  36. Adding Schema To RDDs – Initialization //Importing Expression Encoder for RDDs, Encoder library and Implicts class into the shell. import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder import org.apache.spark.sql.Encoder import spark.implicits._ //Creating an ’employeeDF’ DataFrame from ’employee.txt’ and mapping the columns based on delimiter comma ‘,’ into a temporary view ’employee’. val employeeDF = spark.sparkContext.textFile("examples/src/main/resources/employee.txt").map(_.split(",")).ma p(attributes => Employee(attributes(0), attributes(1).trim.toInt)).toDF() //Creating the temporary view ’employee’. employeeDF.createOrReplaceTempView("employee") //Defining a DataFrame ‘youngstersDF’ which will contain all the employees between the ages of 18 and 30. val youngstersDF = spark.sql("SELECT name, age FROM employee WHERE age BETWEEN 18 AND 30") //Mapping the names from the RDD into ‘youngstersDF’ to display the names of youngsters. youngstersDF.map(youngster => "Name: " + youngster(0)).show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  37. Adding Schema To RDDs – Initialization www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  38. Adding Schema To RDDs - Transformation //Converting the mapped names into string for transformations. youngstersDF.map(youngster => "Name: " + youngster.getAs[String]("name")).show() //Using the mapEncoder from Implicits class to map the names to the ages. implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] //Mapping the names to the ages of our ‘youngstersDF’ DataFrame. The result is an array with names mapped to their respective ages. youngstersDF.map(youngster => youngster.getValuesMap[Any](List("name", "age"))).collect() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  39. Adding Schema To RDDs - Transformation www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  40. Adding Schema – Reading File & Adding Schema //Importing the ‘types’ class into the Spark Shell. import org.apache.spark.sql.types._ //Importing ‘Row’ class into the Spark Shell. Row is used in mapping RDD Schema. import org.apache.spark.sql.Row //Creating a RDD ’employeeRDD’ from the text file ’employee.txt’. val employeeRDD = spark.sparkContext.textFile("examples/src/main/resources/employee.txt") //Defining the schema as “name age”. This is used to map the columns of the RDD. val schemaString = "name age" //Defining ‘fields’ RDD which will be the output after mapping the ’employeeRDD’ to the schema ‘schemaString’. val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true)) //Obtaining the type of ‘fields’ RDD into ‘schema’. val schema = StructType(fields) www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  41. Adding Schema – Reading File & Adding Schema www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  42. Adding Schema – Transformation Result //We now create a RDD called ‘rowRDD’ and transform the ’employeeRDD’ using the ‘map’ function into ‘rowRDD’. val rowRDD = employeeRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim)) //We define a DataFrame ’employeeDF’ and store the RDD schema into it. val employeeDF = spark.createDataFrame(rowRDD, schema) //Creating a temporary view of ’employeeDF’ into ’employee’. employeeDF.createOrReplaceTempView("employee") //Performing the SQL operation on ’employee’ to display the contents of employee. val results = spark.sql("SELECT name FROM employee") //Displaying the names of the previous operation from the ’employee’ view. results.map(attributes => "Name: " + attributes(0)).show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  43. Adding Schema – Transformation Result www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  44. JSON Dataset Starting Up Spark Shell Creating Dataset Adding Schema To RDD JSON Dataset Hive Tables www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  45. JSON Data – Loading File //Importing Implicits class into the shell. import spark.implicits._ //Creating an ’employeeDF’ DataFrame from our ’employee.json’ file. val employeeDF = spark.read.json("examples/src/main/resources/employee.json") www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  46. JSON Data – Loading File www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  47. JSON Data – Parquet File //Creating a ‘parquetFile’ temporary view of our DataFrame. employeeDF.write.parquet("employee.parquet") val parquetFileDF = spark.read.parquet("employee.parquet") parquetFileDF.createOrReplaceTempView("parquetFile") //Selecting the names of people between the ages of 18 and 30 from our Parquet file. val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 18 AND 30") //Displaying the result of the Spark SQL operation. namesDF.map(attributes => "Name: " + attributes(0)).show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  48. JSON Data – Parquet File www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  49. JSON Dataset – Creating DataFrame //Setting to path to our ’employee.json’ file. val path = "examples/src/main/resources/employee.json" //Creating a DataFrame ’employeeDF’ from our JSON file. val employeeDF = spark.read.json(path) //Printing the schema of ’employeeDF’. employeeDF.printSchema() //Creating a temporary view of the DataFrame into ’employee’. employeeDF.createOrReplaceTempView("employee") //Defining a DataFrame ‘youngsterNamesDF’ which stores the names of all the employees between the ages of 18 and 30 present in ’employee’. val youngsterNamesDF = spark.sql("SELECT name FROM employee WHERE age BETWEEN 18 AND 30") //Displaying the contents of our DataFrame. youngsterNamesDF.show() www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

  50. JSON Dataset – Creating DataFrame www.edureka.co/apache-spark-scala-training EDUREKA SPARK CERTIFICATION TRAINING

More Related