1 / 44

Running spark Clusters in Containers with Docker

Running spark Clusters in Containers with Docker. Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata.com Kartik Mathur kartik@bluedata.com. Outline. Vocabulary Big Data New Realities Apache Spark Anatomy of a Spark Cluster

Download Presentation

Running spark Clusters in Containers with Docker

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Running spark Clusters in Containers with Docker Silicon Valley Big Data Association Meetup February 16, 2016 Tom Phelan tap@bluedata.com Kartik Mathur kartik@bluedata.com

  2. Outline • Vocabulary • Big Data New Realities • Apache Spark • Anatomy of a Spark Cluster • Deployment Options: Public Cloud, On-Premises • Demo • Trade-Offs and Choices

  3. Vocabulary • Bare-Metal • Virtual Machine (VM) • Container • Docker • Microservice • Monolithic (service)

  4. Apache Spark Apache Spark™ is a fast and general engine for large-scale data processing. Source: www.spark.apache.org

  5. Big Data Deployment Options Source: Enterprise Strategy Group (ESG) Survey, 2015

  6. Spark On-Premises • Individual developers or data scientists who build their own infrastructure on laptops, on VMs, or bare-metal machines • IT takes a bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case

  7. Why Change this Approach? As the number of Spark users grow … • IT needs to scale the deployment for additional use cases • Application lifecycle requires dev/test/QA/prod environments • Complexity overwhelms the organization, restricting adoption

  8. Spark Adoption On-Premises Prototyping Departmental Spark-as-a-Service Get started with Spark for initial use cases and users Evaluation, testing, development, and QA Prototype multiple data pipelines quickly Spin up dev/test clusters with replica image of production QA/UAT using production data without duplication Offload specific users and workloads from production LOB multi-tenancy with strict resource allocations Bare-metal performance for business critical workloads Self-service, shared infrastructure with strict access controls Multi-Tenant Spark Deployment On-Premises Spark in a Secure Production Environment Dev/Test and Pre-Production

  9. Big Data New Realities Big Data Traditional Assumptions Big Data New Realities New Benefits and Value Bare-metal Containers and VMs Big-Data-as-a-Service Data locality Compute and storage separation Agility and cost savings Data on local disks In-place access on remote data stores Faster time-to-insights

  10. New Realities, New Requirements • Software flexibility • Multiple distros, Hadoop and Spark, multiple configurations • Support new versions and apps as soon as they are available • Multi-tenant support • Data access and network security • Differential Quality of Service (QoS) • Stability, Scalability, Cost, Performance, and Security are always important

  11. Big Data Deployment – Public Cloud • Hadoop-as-a-Service • Amazon Web Services EC2 and EMR • Microsoft Azure HDInsight • Google Cloud Dataproc • IBM Bluemix... and others • Spark-as-a-Service • All of the above • Databricks

  12. Big Data Deployment – On-Premises • Bare-Metal • Virtual Machines • VMware Big Data Extensions • OpenStack Sahara • Containers • Mesos • BlueData

  13. Apache Spark - Anatomy of a Spark Cluster

  14. Running Spark in Cluster Mode Source: http://spark.apache.org/docs/1.3.0/cluster-overview.html

  15. Common Deployment Patterns Most Common Spark Deployment Environments (Cluster Managers) 48% 40% 11% Standalone mode YARN Mesos Source: Spark Survey Report, 2015 (Databricks)

  16. Avoid Solution Mismatch

  17. Spark Cluster – Standalone Mode Spark Client Bare Metal Virtual Machine Spark Master Bare Metal Bare Metal Bare Metal Virtual Machine Virtual Machine Virtual Machine Spark Slave Spark Slave Spark Slave task task task task task task task task task

  18. Spark Cluster – Hadoop YARN Spark Client Resource Manager Spark Master Node Manager Node Manager Node Manager Spark Executor Spark Executor Spark Executor task task task task task task task task task

  19. Spark MultiCluster + YARN Worker Worker Controller Worker Controller Worker

  20. Spark Cluster – Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task

  21. Spark Cluster – Mesos Spark Framework for Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task

  22. Spark Cluster – Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task

  23. Apache Spark - Deployment options

  24. Public Cloud – Spark-as-a-Service (e.g. AWS)

  25. Spark Cluster – Hadoop YARN Spark Client/ Zeppelin Virtual Machine Resource Manager Spark Master Virtual Machine Virtual Machine Virtual Machine Node Manager Node Manager Node Manager Spark Executor Spark Executor Spark Executor task task task task task task task task task

  26. AWS Spark-as-a-Service: Benefits • Amazon EC2 Elastic Container Service (ECS) • Launch containers on EC2 • Amazon Elastic Container Registry (ECR): DockerImages • Amazon Elastic MapReduce (EMR) • Easy to use • Low startup costs: Hardware and human • Expandable

  27. AWS Spark-as-a-Service: Challenges • Data access • Already exists in S3 • Ingest time • Data security • Software versions • Spark 1.6.0, Hadoop 2.71; MapR • Cost • Short running vs. long running clusters

  28. On-Premises – Spark + Containers + DCOS Microservices deployment Spark with Docker and Kubernetes/Swarm/Mesos

  29. Spark Cluster – Mesos Spark Client Spark Scheduler Mesos Master Mesos Slave Mesos Slave Mesos Slave Spark Executor Spark Executor Spark Executor task task task task task task task task task Containers

  30. Spark + Docker + DCOS: Benefits • Easy to set up a dev/demonstration environment • Mesos framework for Spark available • Container isolation • Most of the pieces are available • Complete control • Customization

  31. Spark + Docker + DCOS: Challenges • Can be difficult to set up a production environment • Multi-tenancy, QoS • Software interoperability • Container cluster network connectivity and security

  32. Spark + Docker + Mesos: Challenges Mesos Master Mesos Slave #1 Mesos Slave #2 Marathon Scheduler Container Task Container Task Container Task Mesos Exec Mesos Exec Mesos Scheduler Container Task Container Task Mesos Scheduler Mesos Exec Name Node Mesos Exec Container Data Node Container Data Node Mesos Scheduler Mesos Exec Container Task

  33. Spark + Docker + Mesos + Myriad Myriad Scheduler Mesos Master Mesos Slave #1 Mesos Slave #2 Marathon Scheduler Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Container Task Mesos Exec Mesos Exec Mesos Exec Mesos Exec Mesos Exec Mesos Scheduler Mesos Exec Mesos Exec Mesos Scheduler Container Data Node Container Data Node Mesos Scheduler Mesos Exec Name Node Mesos Exec Container Task

  34. Spark + Docker + Mesos (microservice) Myriad Scheduler Mesos Master Mesos Slave #1 Mesos Slave #2 Marathon Scheduler Container Task Container Task Task Task Container Task Task Mesos Exec Mesos Exec Container Task Job Mesos Scheduler Container Task Mesos Exec Mesos Scheduler Mesos Exec Name Node Mesos Exec Container Data Node Container Data Node Mesos Scheduler Container Task

  35. On-Premises – Spark + Containers + BlueData Monolithic deployment Spark-as-a-Service in an On-Premises Deployment

  36. Spark – Standalone with Containers Spark Client Bare Metal Container Virtual Machine Spark Master Bare Metal Bare Metal Bare Metal Virtual Machine Virtual Machine Virtual Machine Container Container Container Spark Slave Spark Slave Spark Slave task task task task task task task task task

  37. Spark + Docker + BlueData: Benefits • Enterprise quality • Deployment flexibility (on physical servers or VMs) • Network connectivity • Persistent IP addresses • Externally visible IP addresses • No NATing required • Cloud-like experience: Spark-as-a-Service • Self-service access to instant clusters, simple Web UI

  38. Spark + Docker + BlueData: Benefits • Dockerpackaging of images • Distribution agnostic • Spark, Kafka, Cassanda, Zeppelin, and more • With or without YARN • Bring your own BI/analytics tool • Currently only on-premises • Future: on-premises, public cloud, or hybrid

  39. Spark + Docker + BlueData: Benefits • Multi-tenancy • Per tenant QoS, not per service • Private VLAN per Tenant • Limit Data Access • HA, software upgrades, data access, … • BlueData’s DataTap isolates data from compute • Upgrade compute independent of data

  40. BlueData EPIC Software: Demo

  41. Trade-offs and Choices

  42. Trade-Offs (Not Unique to Spark) Less Stable Less Later More Later More Stable Open Source Proprietary On-Premises Public Cloud Less Cost More Cost More Now Less Now

  43. Use Cases Choice of Deployment • Just Spark, Just Works, no Customizations • Public Cloud • Lots of Customizations, Willing to Tinker, Limited QoS • Opensource, microservice, Mesos • Configurable, Flexible, Enterprise Multi-Tenancy • Monolithic (for the moment) container deployment

  44. Thank You www.bluedata.com Try BlueData EPIC for Free: bluedata.com/free

More Related