1 / 54

Pass4sure CCD-410 Cloudera Study Material

Big data company Cloudera is preparing to launch major new open-source software for storing and serving lots of different kinds of unstructured data, with an eye toward challenging heavyweights in the database business, VentureBeat has learned.https://www.pass4sureexam.com/ccD-410.html

Download Presentation

Pass4sure CCD-410 Cloudera Study Material

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloudera Certified Developer for Apache Hadoop (CCDH) 1

  2. Who We Are Who We Are Mission: To help organizations profit from their data How We Do It Credentials Technical Team Leadership We deliver relevant products and services. The Apache Hadoop experts. Unmatched knowledge and experience. Strong executive team with proven abilities. Mike Olson CEO Jeff Hammerbacher Chief Scientist  A distribution of Apache Hadoop that is tested, certified and supported  Number 1 distribution of Apache Hadoop in the world  Founders, committers and contributors to Hadoop Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO Amr Awadalla VP Engineering  Largest contributor to the open source Hadoop ecosystem  A wealth of experience in the design and delivery of production software  Comprehensive support and professional service offerings Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions  More committers on staff than any other company  A suite of management software for Hadoop operations  More than 100 customers across a wide variety of industries  Training and certification programs for developers, administrators, managers and data scientists  Strong growth in revenue and new accounts 2

  3. Users of Cloudera Users of Cloudera Retail & Consumer Financial Web Telecom Media https://www.pass4sureexam.com/ccD-410.html 3

  4. What is Apache Hadoop? What is Apache Hadoop? CORE HADOOP COMPONENTS Hadoop is a platform for data storage and processing that is… Hadoop Distributed File System (HDFS) MapReduce  Scalable  Fault tolerant  Open source File Sharing & Data Protection Across Physical Servers Distributed Computing Across Physical Servers Flexibility Scalability Low Cost  A single repository for storing processing & analyzing any type of data  Not bound by a single schema  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks  Can be deployed on commodity hardware  Open source platform guards against vendor lock 4 https://www.pass4sureexam.com/ccD-410.html

  5. What Makes What Makes Hadoop Hadoop Different? Different? • Ability to scale out to Petabytes in size using commodity hardware • Processing (MapReduce) jobs are sent to the data versus shipping the data to be processed • Hadoop doesn’t impose a single data format so it can easily handle structure, semi-structure and unstructured data • Manages fault tolerance and data replication automatically https://www.pass4sureexam.com/ccD-410.html 5

  6. Why the Need for Why the Need for Hadoop Hadoop? ? 10,000 GIGABYTES OF DATA CREATED (IN BILLIONS) 1.8 trillion gigabytes of data was created in 2011…  More than 90% is unstructured data  Approx. 500 quadrillion files  Quantity doubles every 2 years 5,000 0 2005 2015 2010 STRUCTURED DATA UNSTRUCTURED DATA Source: IDC 2011 6

  7. Hadoop Hadoop Use Cases Use Cases Use Case Industry Use Case Application Application Web Social Network Analysis Clickstream Sessionization Media Clickstream Sessionization Content Optimization ADVANCED ANALYTICS DATA PROCESSING Telco Network Analytics Mediation Loyalty & Promotions Analysis Retail Data Factory Financial Fraud Analysis Trade Reconciliation Federal Entity Analysis SIGINT Bioinformatics Sequencing Analysis Genome Mapping 7

  8. Hadoop Hadoop in the Enterprise in the Enterprise ANALYSTS BUSINESS USERS OPERATORS ENGINEERS Management Tools Enterprise Reporting BI / Analytics IDE’s CUSTOMERS Enterprise Data Warehouse Web Application Relational Databases Logs Files Web Data https://www.pass4sureexam.com/ccD-410.html 8

  9. What is CDH? What is CDH? Cloudera’s Distribution Including Apache Hadoop (CDH) is an enterprise-ready distribution of Hadoop that is…  100% Apache open source  Contains all components needed for deployment  Fully documented and supported  Released on a reliable schedule Fastest Path to Success Stable and Reliable Community Driven  No need to write your own scripts or do integration testing on different components  Extensive Cloudera QA systems, software & processes  Incorporates only main-line components from the Apache Hadoop ecosystem – no forks or proprietary underpinnings  Tested & run in production at scale  Works with a wide range of operating systems, hardware, databases and data warehouses  Proven at scale in dozens of enterprise environments  FREE 9

  10. Cloudera’s Cloudera’s Commitment to the Open Commitment to the Open Source Community Source Community Component Cloudera Committers Cloudera Founder 2011 Commits Common 6 Yes #1 HDFS 6 Yes #2 MapReduce 5 Yes #1 HBase 2 No #2 Zookeeper 1 Yes #2 Oozie 1 Yes #1 Pig 0 No #3 Hive 1 No #2 Sqoop 2 Yes #1 Flume 3 Yes #1 Hue 3 Yes #1 Snappy 2 No #1 Bigtop 8 Yes #1 Avro 4 Yes #1 Whirr 2 Yes #1 10

  11. Components of CDH Cloudera Enterprise User Interface HUE Workflow File System Mount Scheduling APACHE OOZIE APACHE OOZIE FUSE-DFS Languages / Compilers APACHE PIG, APACHE HIVE Fast Read/Write Access Data Integration APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination APACHE ZOOKEEPER https://www.pass4sureexam.com/ccD-410.html 11

  12. Hadoop Distributed File System Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 1 2 2 4 5 5 1 1 2 HDFS 3 3 4 4 5 2 5 1 3 3 Cost is $400-$500/TB 4 5 12

  13. Components of Hadoop • NameNode – Holds all metadata for HDFS – Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded – The more memory the better – typical 36GB to - 64GB • Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 13

  14. Components of Hadoop • DataNodes – Hardware will depend on the specific needs of the cluster – No RAID needed, JBOD (just a bunch of disks) is used – Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM https://www.pass4sureexam.com/ccD-410.html 14

  15. Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch! 15

  16. Map Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key 1, values) (key 1, int. values) Shuffle Phase Map Task (key 2, values) (key 1, int. values) Reduce Task Final (key, values) (key 3, values) (key 1, int. values) 16

  17. Reduce Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key 1, values) (key 1, int. values) Shuffle Phase Map Task (key 2, values) (key 1, int. values) Reduce Task Final (key, values) (key 3, values) (key 1, int. values) 17

  18. MapReduce Execution MapReduce Execution https://www.pass4sureexam.com/ccD-410.html 18

  19. Sqoop SQL to Hadoop  Tool to import/export any JDBC-supported database into Hadoop  Transfer data between Hadoop and external databases or EDW  High performance connectors for some RDBMS  Developed at Cloudera 19

  20. Flume Distributed, reliable, available service for efficiently moving large amounts of data as it is produced  Suited for gathering logs from multiple systems  Inserting them into HDFS as they are generated Design goals  Reliability, Scalability, Manageability, Extensibility Developed at Cloudera 20

  21. Flume: high Flume: high- -level architecture level architecture Master send configuration to all Agents Configurable levels of reliability Guarantee delivery in event of failure Deployable, centrally administered Agent Agent Agent Agent encrypt MASTER Optionally pre-process incoming data: perform transformations, suppressions, metadata enrichment Processor Processor batch compress encrypt Writes to multiple HDFS file formats (text, sequence, JSON, Avro, others) Parallelized writes across many collectors – as much write throughput as Collector(s) Flexibly deploy decorators at any step to improve performance, reliability or security 21

  22. HBase Column-family store. Based on design of Google BigTable  Provides interactive access to information  Holds extremely large datasets (multi-TB)  Constrained access model  (key, value) lookup  Limited transactions (only one row) https://www.pass4sureexam.com/ccD-410.html 22

  23. HBase 23

  24. Hive SQL-based data warehousing application  Language is SQL-like  Supports SELECT, JOIN, GROUP BY, etc.  Features for analyzing very large data sets  Partition columns, Sampling, Buckets  Example: SELECT s.word, s.freq, k.freq FROM shakespeares JOIN ON (s.word= k.word) WHERE s.freq >= 5; 24

  25. Pig Data-flow oriented language – “Pig latin”  Datatypes include sets, associative arrays, tuples  High-level language for routing data, allows easy integration of Java for complex tasks  Example: emps=LOAD 'people.txt’ AS(id,name,salary); rich = FILTER emps BY salary > 100000; srtd = ORDER rich BY salary DESC; STORE srtd INTO ’ rich_people.txt'; https://www.pass4sureexam.com/ccD-410.html 25

  26. Oozie Oozie Oozie is a workflow/cordination service to manage data processing jobs for Hadoop 26

  27. Zookeeper Zookeeper is a distributed consensus engine  Provides well-defined concurrent access semantics:  Leader election  Service discovery  Distributed locking / mutual exclusion  Message board / mailboxes 27

  28. Pipes and Streaming Multi-language connector libraries for MapReduce  Write native-code MapReduce in C++  Write MapReduce passes in any scripting language, including  Perl  Python https://www.pass4sureexam.com/ccD-410.html 28

  29. FUSE - DFS Allows mounting of HDFS volumes via Linux FUSE file system  Does allow easy integration with other systems for data import/export  Does not imply HDFS can be used for general-purpose file system 29

  30. Hadoop Security Hadoop Security  Authentication is secured by Kerberos v5 and integrated with LDAP  Hadoop server can ensure that users and groups are who they say they are  Job Control includes Access Control Lists, which means Jobs can specify who can view logs, counters, configurations and who can modify a job  Tasks now run as the user who launched the job https://www.pass4sureexam.com/ccD-410.html 30

  31. Cloudera Enterprise Cloudera Enterprise makes open source Hadoop enterprise-easy CLOUDERA ENTERPRISE COMPONENTS Cloudera Manager Production-Level Support  Simplify and Accelerate Hadoop Deployment  Reduce Adoption Costs and Risks  Lower the Cost of Administration  Increase the Transparency Control of Hadoop  Leverage the Experience of Our Experts End-to-End Management Application for Apache Hadoop Our Team of Experts On- Call to Help You Meet Your SLAs EFFECTIVENESS EFFICIENCY Ensuring You Enabling You to Get Value From Your Hadoop Deployment Affordably Run Hadoop in Production 31

  32. Cloudera Manager The industry’s first for Apache Hadoop the Automates the of Apache Hadoop Apache Hadoop stack HDFS HDFS HBASE HBASE MAPREDUCE MAPREDUCE DISCOVER DIAGNOSE ACT OPTIMIZE ZOOKEEPER ZOOKEEPER HUE HUE OOZIE OOZIE 32

  33. Cloudera Enterprise Cloudera Enterprise Including Cloudera Support Feature Benefit Choose from 8x5 or 24x7 options to meet SLA requirements Flexible Support Windows Verify that your Hadoop cluster is fine-tuned for your environment Configuration Checks Proven processes ensure that support cases get resolved with maximum efficiency Issue Resolution and Escalation Processes Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Comprehensive Knowledgebase Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics Certified Connectors Stay up to speed with what’s going on in the Apache Hadoop community Notification of New Developments and Events 34

  34. Cloudera University Cloudera University Public and Private Training to Enable Your Success Class Description Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop Developer Training & Certification (4 Days) Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster System Administrator Training & Certification (3 Days) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices HBase Training (2 Day) Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data Analyzing Data with Hive and Pig (2 Days) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as “when is Hadoop appropriate?”, “what are people using Hadoop for?” and “what do I need to know about choosing Hadoop?” Essentials for Managers (1 Day) 35

  35. Cloudera Consulting Services Cloudera Consulting Services Put Our Expertise To Work For You. Cloudera’s team of Solutions Architects provides guidance and hands-on expertise to address unique enterprise challenges. Service Description Assess the appropriateness and value of Hadoop for your organization Use Case Discovery Set up and configure high performance, production-ready Hadoop clusters New Hadoop Deployment Verify the prototype functionality and project feasibility for a new Hadoop cluster Proof of Concept Deploy your first production-level project using Hadoop Production Pilot Define the requirements and processes for creating a new Hadoop team Process and Team Development Perform periodic health checks to certify and tune up existing Hadoop clusters Hadoop Deployment Certification 36

  36. Journey of the Cloudera Customer Journey of the Cloudera Customer Discover the Benefits of Apache Hadoop Subscribe to Cloudera Enterprise Cloudera’s Distribution Flexibility to store and mine all types of data The fastest, surest path to success with Apache Hadoop Simplify and accelerate Apache Hadoop deployment https://www.pass4sureexam.com/ccD-410.html 37

  37. Cloudera in Production Cloudera in Production Consulting Services Cloudera University   Cloudera Services CUSTOMERS ANALYSTS BUSINESS USERS OPERATORS ENGINEERS Cloudera Enterprise Cloudera Management Suite  Cloudera Support  Web Management Tools Enterprise Reporting BI / Analytics IDE’s Application Enterprise Data Warehouse Cloudera’s Distribution Including Apache Hadoop (CDH) & & SCM Express Operational Rules Engines Relational Databases Logs Files Web Data 38

  38. Get Hadoop Cloudera helps you profit from all your data. cloudera cloudera.com twitter.com/ cloudera +1 (888) 789 +1 (888) 789- -1488 sales sales@cloudera.com 1488 facebook.com/ cloudera 39

  39. Cloudera Manager The application that: Hadoop management Manages the Manages and monitors the Incorporates comprehensive Has built-in https://www.pass4sureexam.com/ccD-410.html 40

  40. Cloudera Manager Key and ONLY CLOUDERA Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps. Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface ONLY CLOUDERA Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed ONLY CLOUDERA Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA Monitors dozens of service performance metrics and alerts you when you approach critical thresholds ONLY CLOUDERA Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster https://www.pass4sureexam.com/ccD-410.html 41

  41. Cloudera Manager Key and ONLY CLOUDERA Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis ONLY CLOUDERA Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution ONLY CLOUDERA Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Generates email alerts when certain events occur ONLY CLOUDERA Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles 42

  42. Two Editions: FREE EDITION ENTERPRISE EDITION** Max Number of Nodes Supported 50 Unlimited Automated Deployment Host-Level Monitoring Secure Communication Between Server & Agents Configuration Management Manage HDFS, MapReduce, HBase, Hue, Oozie & Zookeeper Audit Trails Start/Stop/Restart Services Add/Restart/Decomission Role Instances Configuration Versioning & History Support for Kerberos Service Monitoring Proactive Health Checks Status & Health Summary Intelligent Log Management Events Management & Alerts Activity Monitoring Operational Reporting Global Time Control Support Integration ** Part of the Cloudera Enterprise subscription 43

  43. View Service Health and Performance https://www.pass4sureexam.com/ccD-410.html 44

  44. Get Host-Level Snapshots https://www.pass4sureexam.com/ccD-410.html 45

  45. Monitor and Diagnose Cluster Workloads https://www.pass4sureexam.com/ccD-410.html 46

  46. Gather, View and Search Hadoop Logs https://www.pass4sureexam.com/ccD-410.html 47

  47. Track Events From Across the Cluster https://www.pass4sureexam.com/ccD-410.html 48

  48. Run Reports on System Performance & Usage https://www.pass4sureexam.com/ccD-410.html 49

  49. New in Cloudera Manager 3.7 ONLY CLOUDERA Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds ONLY CLOUDERA Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster ONLY CLOUDERA Global Time Control Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis ONLY CLOUDERA Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution ONLY CLOUDERA Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching Alerts Generates email alerts when certain events occur ONLY CLOUDERA Audit Trails Maintains a complete record of configuration changes for SOX compliance ONLY CLOUDERA Operational Reporting Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user https://www.pass4sureexam.com/ccD-410.html 50

  50. Cloudera Support Our on call to help you meet your SLAs Feature Benefit Choose from 8x5 or 24x7 options to meet SLA requirements Flexible Support Windows Verify that your Hadoop cluster is fine-tuned for your environment Configuration Checks Proven processes ensure that support cases get resolved with maximum efficiency Issue Resolution and Escalation Processes Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop Comprehensive Knowledgebase Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy Certified Connectors Stay up to speed with what’s going on in the Apache Hadoop community Proactive Notification of New Developments and Events https://www.pass4sureexam.com/ccD-410.html 51

More Related