1 / 18

Containerized Spark at RBC

Learn how containerizing Spark at RBC enables customizability, provisioning, predictability, security, and infrastructure optimization for efficient big data processing. Explore containerized Spark deployment modes, cluster architecture, and comparison with native Kubernetes and YARN. Discover how Kerberized HDFS ensures data storage security and see the benefits of containerized Spark on Openshift with logging, monitoring, and improved scalability.

randye
Download Presentation

Containerized Spark at RBC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Containerized Spark at RBC Raj Channa & Dhwanil Raval

  2. Big Data and Spark Security Data Format Parquet, Avro, ORC, Arrow Metadata Management Coordinate & Management Scheduler SQL-over Hadoop Scripting Stream Processing Machine Learning In-Memory Processing NoSQL Database Search Engine Data Piping Resource Management Storage

  3. Why Containerize Spark? • Customizability: Ability to run different versions of Spark on the same platform • Provisioning: Provision Spark Clusteron demand in an automated manner for each production job. • Predictability: Dedicated cores allows for consistent run time and predictable SLAs for batch jobs • Security: Vulnerability assessment and scanning of container at deployment time • Infrastructure Optimization: Efficient infrastructure utilization and resource sharing

  4. High Level Overview of what we ended up doing Spark Jobs Spark SQL Spark Streaming Spark MLIB Spark GraphX Other Container workloads Spark Standalone Scheduler Spark Core Engine Kubernetes Spark Standalone Scheduler Kubernetes YARN Mesos

  5. Apache Spark Deployment Modes & Building Blocks (Driver/Master & Executor/Worker) JOB LAUNCHING ENVIRONMENT CLUSTER Deploy-mode Cluster NODE NODE NODE Spark Driver Spark Executor Spark-Submit NODE NODE NODE Spark Executor Spark Executor Deploy-mode Client NODE NODE NODE Spark Executor Spark-Submit Spark Driver NODE NODE NODE Spark Executor Spark Executor

  6. Openshift Architecture

  7. Spark cluster on Openshift using Cluster Mode using K8S as resource manager spark-submit to K8S E E E D E E Client

  8. What did we learn?

  9. Oshinko using Spark Standalone cluster Manager with Client Mode spark-submit to Spark Master (not K8S) Deploy Spark-Cluster W M W Oshinko CLI Zero worker replicas W W W Scale up worker pods Client

  10. How did Oshniko compare against Native K8S? Allow for keytab-based HDFS security in Standalone mode https://issues.apache.org/jira/browse/SPARK-5158

  11. Kerberized HDFS as Storage for Spark – How YARN Does this? Node Manager (Spark Executor) 3 Spark Client YARN Application Mater 2 Spark Client gets Kerberos ticket using kinit Spark Client executes spark-submit with options –keytab and –principal YARN Distributes HDFS delegation tokens to all worker nodes Spark reads and stores data from and to HDFS using delegation Token Node Manager (Spark Executor) 3 1 4 KDC HDFS

  12. Kerberized HDFS as Storage for Spark – Scenario for Standalone Spark Standalone Cluster on Openshift Spark Worker Spark-env.sh 4 Spark Client Spark Master 3 2 Spark processes starts with kinit and gets hdfs service ticket Spark workers convert the kerberos tickets to delegation token at startup using post hooks Spark Client submits job with token location 4. Spark executors reads and stores data from and to HDFS using delegation token generated 1 Spark-env.sh Spark Worker KDC HDFS Spark-env.sh

  13. Custom template using Spark Standalone cluster Manager with Client Mode spark-submit to Spark Master (not K8S) Deploy Spark-Cluster W M W Spark-template.yaml Zero worker replicas W W W Scale up worker pods Client

  14. How do the 3 options compare? Standalone Stateful Set Allow for keytab-based HDFS security in Standalone mode https://issues.apache.org/jira/browse/SPARK-5158

  15. Spark cluster on Openshift – Logging W M W W W W

  16. Spark cluster on Openshift – Monitoring Using Prometheus and Grafana All Pods Expose metrics on Port 7777 W M W W W W 7777

  17. Demo

  18. What was the result? • Before • After • Time to provision environment • Up to 3 months • 45 seconds • Scaling down to zero • Not possible • 30 seconds and automated • Hardwarerequirements • 33% less hardware • Securityscanning for each run • None • Automated scanning for every container on each deployment • Run times • Variable • Predictable

More Related