1 / 38

Deep Learning with TensorFlow and Spark using GPUS and Docker Containers

Deep Learning with TensorFlow and Spark using GPUS and Docker Containers. Tom Phelan BlueData CTO / HPE Fellow (BlueData was recently acquired by HPE) thomas.phelan@hpe.com , @tapbluedata. Agenda. Big Data, Artificial Intelligence, Machine Learning, and Deep Learning

tmaurine
Download Presentation

Deep Learning with TensorFlow and Spark using GPUS and Docker Containers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Learning with TensorFlow and Spark using GPUS and Docker Containers • Tom Phelan • BlueData CTO / HPE Fellow • (BlueData was recently acquired by HPE) • thomas.phelan@hpe.com, @tapbluedata

  2. Agenda • Big Data, Artificial Intelligence, Machine Learning, and Deep Learning • Common Enterprise Use Cases • Deep Learning Application Requirements • Distributed TensorFlow & Horovod in Containers with GPUs • Use of Docker Containers to meet Requirements • Challenges and Solutions • Lessons Learned • Key Takeaways and Recommendations

  3. Big Data Big Data refers to data sets that are so voluminous and complex that traditional data-processing application software are inadequate to deal with them. Characterized by: volume, variety, velocity Source: https://en.wikipedia.org/wiki/Big_Data

  4. Let’s Get Grounded … What is AI? Deep learning (DL) Subset of ML, using deep artificial neural networks as models, inspired by the structure and function of the human brain. Artificial intelligence (AI) Mimics human behavior. Any technique that enables machines to solve a task in a way like humans do. Deep learning Example: Self-driving car Example: Siri Machine learning (ML) Algorithms that allow computers to learn from examples without being explicitly programmed. Machine learning Artificial intelligence Example: Google Maps

  5. Game Changing Innovation Gartner 2019 CIO Agenda Q: Which technology areas do you expect will be a game changer for your organization? Source: Gartner, Insights From the 2019 CIO Agenda Report, by Andy Rowsell-Jones, et al.

  6. Why should you be interested in AI / ML / DL? AI and advanced analytics represent 2 of the top 3 CIO priorities Everyone wants AI / ML / DL and advanced analytics…. AI and advanced analytics infrastructure could constitute 15-20% of the market by 20211 Enterprise AI adoption 2.7X growth in last 4 years2 ….but face many challenges Use cases New roles, skill gaps Culture and change Data preparation Legacy infrastructure 1 IDC. Goldman Sachs. HPE Corporate Strategy.2018 2 Gartner - “2019 CIO Survey: CIOs Have Awoken to the Importance of AI”

  7. Example Enterprise Deep Learning Use Cases Fraud Detection Medical Diagnosis Prediction Credit Cards Car Loans Detection / Prediction of Cancer, Alzheimer’s, Pneumonia Weather ForecastGas & Oil Location Buyer / Trader Action

  8. AI / ML / DL Adoption in the Enterprise Financial servicesFraud detection, ID verification Government Cyber-security, smart cities and utilities Energy Seismic and reservoir modeling Retail Video surveillance, shopping patterns Health Personalized medicine, image analytics Consumer tech Chatbots Service providers Media delivery Manufacturing Predictive and prescriptive maintenance

  9. Financial Services Use Cases Wide Range of ML / DL Use Cases for Wholesale / Commercial Banking, Credit Card / Payments, Retail Banking, etc. Other Customer Segmentation CLV Prediction and Recommendation Fraud Detection Risk Modeling & Credit Worthiness Check • Image Recognition • NLP • Security • Video Analysis • Behavioral Analysis • Understanding Customer Quadrant • Effective Messaging & Improved Engagement • Targeted Customer Support • Enhanced Retention • Historical Purchase View • Pattern Recognition • Retention Strategy • Upsell • Cross-Sell • Nurturing • Real-Time Transactions • Credit Card • Merchant • Collusion • Impersonation • Social Engineering Fraud • Loan Defaults • Delayed Payments • Liquidity • Market & Currencies • Purchases and Payments • Time Series

  10. Fraud Detection Use Case • One of the most common use cases for ML / DL in Financial Services is to detect and prevent fraud • This requires: • Distributed Big Data processing frameworks such as Spark • ML / DL tools such as TensorFlow, H2O, and others • Continuous model training and deployment • Multiple large data sets

  11. ML / DL in Healthcare – Use Cases • Precision Medicine and Personal Sensing • Disease prediction, diagnosis, and detection (e.g. genomics research) • Using data from local sensors (e.g. mobile phones) to identify human behavior • Electronic Health Record (EHR) correlation • “Smart” health records • Improved Clinical Workflow • Decision support for clinicians • Claims Management and Fraud Detection • Identify fraudulent claims • Drug Discovery and Development

  12. 360° View of the Patient Demographics Visit Labs Patient Rx Diagnosis Care Site Genomics Studies

  13. Sample Architecture & Applications Electronic Health Record Systems Monitors / Devices Kafka Connect Database Access Secure HDFS Data Lake Centralized Publisher Subscriber Hub Model Build Local Store Publishers Promotion Results / Feedback Speed Layer Model Score

  14. Requirements for Deep Learning Q: What hardware is commonly used for deep learning and matrix multiplications? A: Graphics Processing Units (aka GPUs)

  15. Matrix Multiplication Source: https://www.khanacademy.org/math/precalculus/precalc-matrices/multiplying-matrices-by-matrices/a/multiplying-matrices

  16. Deep Neural Network (DNN) Source: https://www.kdnuggets.com/wp-content/uploads/deep-neural-network.jpg

  17. NVIDIA GPUs and CUDA • NVIDIA is the most common manufacturer of GPUs • CUDAis the software that allows applications to access the NVIDIA GPU hardware • CUDA library or toolkit • CUDA kernel device driver • GPU-hardware specific

  18. Architectural Challenges • Complexity, lack of repeatability and reproducibility across environments • Sharing data, not duplicating data • Need agility to scale up and down compute resources • Deploying multiple distributed platforms, libraries, applications, and versions • One size environment fits none • Need a flexible and future-proof solution Off-Prem Cluster On-Prem Cluster Laptop

  19. Cost Challenges • How to maximize the use of expensive hardware resources • How to run clusters on heterogeneous host hardware • CPUs and GPUs, including multiple GPU versions • How to minimize manual operations • Automating the cluster creation and and deployment process • Creating reproducible clusters and reproducible results • Enabling on-demand provisioning and elasticity

  20. Support and Security Challenges • How to support the latest versions of software • Compatibility across version upgrades • How to ensure enterprise-class security • Network, storage, user authentication, and access • Applying security patches

  21. Addressing The Challenges Simplify Deployments Innovate Faster Deploy Anywhere Dockeris software that performs operating-system-level virtualization also known as containerization. Containerization allows the existence of multiple instances on a server . Source: https://en.wikipedia.org/wiki/docker_(software)

  22. Surfacing GPU device into a Container • Use of docker –device command line option • Use of vendor specific Docker wrapper such as nvidia-docker (Introduced in 2016) • Use of NVIDIA Container Runtime (Introduced mid 2018) • Support for container runtimes other than Docker

  23. GPU Access from within a Container Source: http://www.nvidia.com/object/docker-container.html

  24. Spark • A unified analytics engine for large-scale data processing • Easy to use • Java, python, scala, R • Runs everywhere • Hadoop, mesos, kubernetes, standalone

  25. TensorFlow • Distributed solution for training models in parallel, on multiple devices, using GPUs • Goal is to improve accuracy and speed

  26. Horovod • An open source framework developed by Uber that supports allreduce • Distributed training framework for • Tensorflow, PyTorch, Keras • Separates infrastructure capabilities from ML • Installs easily on existing ML framework • pip install horovod • Uses bandwidth optimal communication protocol • RDMA, InfiniBand if available

  27. Spark, TensorFlow, NVIDIA, Horovod, on Docker Spark Driver Horovod cluster Spark Executor Spark Executor Spark Executor Horovod cluster on multiple GPUs, containers, and machines NCCL 2.3.7 NCCL 2.3.7 NCCL 2.3.7 MPI 3.1.3 MPI 3.1.3 MPI 3.1.3 TensorFlow 1.9 TensorFlow 1.9 TensorFlow 1.9 GPU / CUDA 9 GPU / CUDA 9 GPU / CUDA 9 Shared Data

  28. Deep Learning and Docker Containers • ML/DL applications are compute hardware intensiveas well as complexto setup and administer • They can benefit from the flexibility, agility, and resource sharing attributes of containerization • But care must be taken in how this is done, especially in a large-scale distributed environment

  29. AI-Driven Solutions for the Enterprise Example Industry Use Cases Solutions Video Surveillance Fraud Detection Genome Research Customer 360 Infrastructure Data Science and ML / DL Tools Data Platforms HDFS/NFS Data Data Store Data Duplication Cloud IT User Access Security Time to Deploy Multi-Tenant

  30. Distributed Deep Learning on Docker Support for Parallelism Prototype Model save, load, share, and run Quickly deploy sandbox nodes in minutes Secure, multi-tenant, elastic environments on shared infrastructure Compute Isolation • On-demand GPU-enabled clusters for deep learning • Submit jobs using notebooks, web UI, or API • Web-based SSH access for CLI jobs • Shared storage access with security • Cloud-ready architecture, using Docker containers CUDA runtime cuDNN Data Isolation Mount GPU devices and surface device drivers Mount GPU devices and surface device drivers DIY / K8S / DCOS / Containerized Infrastructure for AI / ML / DL GPU and Non-GPU Hosts Data Isolation HDFS Data Lake or NFS Enterprise Storage

  31. Challenges Solved • Pre-built Docker images with CUDA and automated cluster creation for the entire stack • Easy to deploy, repeatable • Appropriate NVIDIA kernel module surfaced automatically to the containers • Scalable access to resources (e.g. single node, single GPU, multi-node, multi-GPU combinations) • UI, CLI, and API use patterns (notebooks, web, SSH) • Fast access to shared data • Support of hybrid on premise and cloud infrastructure

  32. Faster ML / DL Deployment Time End User Submit Job / Model SSH / UI Software Management Security (KDC, AD/LDAP) Legacy Deployment User Access (SSH, SSL) Add Services Load Balancing 45 Days  ~10 Minutes Port Mapping ~ x Days Add / Configure Libraries Onboard Users Submit Job / Model Init.d Configuration SSH / UI Download / Install Cluster Configuration ~10 Minutes Hardware Networking Security (KDC, AD/LDAP, SSL) Storage Application Image Operating System Docker Physical Server Deployment with BlueData

  33. Lessons Learned • Enterprises are using ML / DL today to solve difficult problems • Example use cases: fraud detection, disease prediction, etc. • Distributed ML / DL in the enterprise requires a complex stack, with multiple different tools • TensorFlow is one popular option • Deployments are challenging, with many potential pitfalls • Containerization can deliver agility and cost saving benefits

  34. Takeaways • Distributed deep learning applications can be deployed on Docker containers • GPU resources can be effectively shared between multiple applications • Deep learning requires a complex software / hardware stack • DevOps complexity can be abstracted from the data scientist with self-service provisioning and automation • Data resources should be decoupled from compute resources in order to maximize platform flexibility and scalability

  35. Recommendations • Start with a few key use cases to explore deep learning • Be prepared – the only constant is change • Business needs and use cases constantly evolve • Tools evolve in dog years • Leverage a flexible, scalable, and elastic platform for success • Containerization can deliver agility and cost saving benefits • But there are significant challenges … DIY will likely be DOA

  36. Please Rate This Session Session page on conference website O’Reilly Events App

  37. Tom Phelan @tapbluedata

  38. For more information, visitwww.bluedata.com Thank You

More Related