1 / 30

Data Management Platform on Hadoop

Data Management Platform on Hadoop. (Incubating). Srikanth Sundarrajan Venkatesh Seetharam. whoami. Agenda. Motivation. 1. Falcon Overview. 2. Case Studies. 3. Questions & Answers. 4. MOTIVATION. Data Processing Landscape. Data Processing (Transform/Pipeline). Acquire (Import).

joelle
Download Presentation

Data Management Platform on Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Management Platform on Hadoop (Incubating) SrikanthSundarrajan Venkatesh Seetharam

  2. whoami

  3. Agenda Motivation 1 Falcon Overview 2 Case Studies 3 Questions & Answers 4

  4. MOTIVATION

  5. Data Processing Landscape Data Processing (Transform/Pipeline) Acquire (Import) External data source Replicate (Copy) Export Archive Eviction

  6. Core Services

  7. Process Management – Relays picture courtersy: http://istockphoto.com/

  8. Late Data Management picture courtersy: http://iwebask.com

  9. Data Retention As Service picture courtersy: http://vimeo.com/

  10. Data Replication As Service picture courtersy: http://boylesmedia.com

  11. Data Acquisition As Service picture courtersy: http://wmpu.org

  12. Operability – Dashboard picture courtersy: http://www.opentrack.ch/

  13. FALCON OVERVIEW

  14. Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com

  15. Entity Dependency Graph Hadoop / Hbase … Cluster External data source depends depends depends feed Process depends

  16. High Level Architecture Apache Falcon Hadoop Config store Entity Oozie CLI/REST Entity status HCatalog Process status / notification Messaging JMS

  17. Feed Schedule Cluster xml Falcon Falcon config store / Graph Feed xml Retention / Replication workflow Oozie Scheduler HDFS Instance Management Catalog service JMS Notification per action

  18. Process Schedule Cluster/feed xml Falcon Falcon config store / Graph Process xml Process workflow Oozie Scheduler HDFS Instance Management Catalog service JMS Notification per available feed

  19. Physical Architecture Falcon Colo1 Scheduler Scheduler Falcon Colo2 Falcon – Prism Global view Falcon Colo3 Scheduler

  20. CASE STUDY Multi Cluster Failover

  21. Multi Cluster – Failover Primary Hadoop Cluster Staged Data Cleansed Data Conformed Data Presented Data BI and Analytics Replication Staged Data Presented Data Failover Hadoop Cluster • Falcon manages workflow, replication or both. • Enables business continuity without requiring full data reprocessing. • Failover clusters require less storage and CPU.

  22. Retention Policies Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only • Sophisticated retention policies expressed in one place. • Simplify data retention for audit, compliance, or for data re-processing.

  23. CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi

  24. Hadoop @ InMobi • About InMobi • Worlds leading independent mobile advertising company • Hadoop usage at InMobi • ~ 6 Clusters • > 1PB of storage • > 5TB new data ingested each day • > 20TB data crunched each day • > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase • > 175K hadoop jobs / day • > 60K Oozie workflows / day • 300+ Falcon feed definitions • 100+ Falcon process definitions

  25. Processing – Single Data Center Ad Request data Impression render event Hourly summary Click event Conversion event Continuous Streaming (minutely) Enrichment (minutely/5 minutely) Summarizer

  26. Global Aggregation Ad Request data Ad Request data Data Center 1 Impression render event Impression render event Consumable global aggregate Hourly summary Hourly summary Click event Click event …….. Conversion event Conversion event Data Center N Continuous Streaming (minutely) Continuous Streaming (minutely) Enrichment (minutely/5 minutely) Enrichment (minutely/5 minutely) Summarizer Summarizer

  27. HIGHLIGHTS

  28. Future 1 Security 2 Embed Pig/Hive scripts 3 Data Acquisition – file-based Monitoring/Management Dashboard 4

  29. Summary

  30. Questions? • Apache Falcon • http://falcon.incubator.apache.org • mailto: dev@falcon.incubator.apache.org • SrikanthSundarrajan • sriksun@apache.org • #sriksun • Venkatesh Seetharam • venkatesh@apache.org • #innerzeal

More Related