Download
hortonworks we do hadoop n.
Skip this Video
Loading SlideShow in 5 Seconds..
Hortonworks: We Do Hadoop PowerPoint Presentation
Download Presentation
Hortonworks: We Do Hadoop

Hortonworks: We Do Hadoop

129 Views Download Presentation
Download Presentation

Hortonworks: We Do Hadoop

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Hortonworks: We Do Hadoop Our mission is to enable your Modern Data Architecture by delivering One Enterprise Hadoop January 2014

  2. Our Mission: Enable your Modern Data Architecture by delivering One Enterprise Hadoop Our Commitment Open LeadershipDrive innovation in the open exclusively via the Apache community-driven open source process Enterprise RigorEngineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem EndorsementFocus on deep integration with existing data center technologies and skills Headquarters: Palo Alto, CA Employees: 250+ and growing Trusted Partners

  3. A Traditional Approach Under Pressure APPLICATIONS Custom Applications Packaged Applications Business Analytics 2.8 ZB in 2012 DATA SYSTEM REPOSITORIES 85% from New Data Types RDBMS EDW MPP 15x Machine Data by 2020 40 ZB by 2020 Source: IDC SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

  4. Modern Data Architecture Enabled APPLICATIONS Custom Applications Packaged Applications Business Analytics OPERATIONAL TOOLS DEV & DATA TOOLS MANAGE & MONITOR BUILD & TEST DATA SYSTEM REPOSITORIES RDBMS EDW MPP SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

  5. Drivers of Hadoop Adoption A Modern Data ArchitectureComplement your existing data systems: the right workload in the right place • New Business Applications • Architectural

  6. Most Common NEW TYPES OF DATA • SentimentUnderstand how your customers feel about your brand and products – right now • ClickstreamCapture and analyze website visitors’ data trails and optimize your website • Sensor/MachineDiscover patterns in data streaming automatically from remote sensors and machines • GeographicAnalyze location-based data to manage operations where they occur • Server LogsResearch logs to diagnose process failures and prevent security breaches • Unstructured (txt, video, pictures, etc..)Understand patterns in files across millions of web pages, emails, and documents Value + Keep existing data longer!

  7. Enterprise Requirements: Key Services 1 Key ServicesPlatform, Operational and Data services essential for the enterprise OPERATIONAL SERVICES DATASERVICES PIG HIVE & HCATALOG AMBARI FLUME HBASE FALCON* SQOOP OOZIE 2 LOAD & EXTRACT SkillsLeverage your existing skills: development, analytics, operations CORE MAP TEZ REDUCE NFS YARN WebHDFS HDFS PLATFORM SERVICES KNOX* Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots 3 IntegrationInteroperable with existing data center investments HORTONWORKS DATA PLATFORM (HDP) OS/VM Cloud Appliance

  8. Powering the Modern Data Architecture HADOOP 2.0 HADOOP 1.0 Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … Online Data Processing HBase, Accumulo others… Standard SQL Processing Hive Real Time Stream Processing Storm Data Processing Frameworks (Hive, Pig, Cascading, …) Batch MapReduce Interactive Tez MapReduce (distributed data processing & cluster resource management) Cluster Resource Management YARN Redundant, Reliable Storage HDFS 2 HDFS 1 (redundant, reliable storage) Interact with all data in multiple ways simultaneously

  9. Apache YARN The data operating system for Hadoop 2.0 • FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads Data Processing Engines Run Natively INHadoop BATCH MapReduce OTHERS INTERACTIVE Tez ONLINE HBase STREAMING Storm, S4, … GRAPH Giraph MICROSOFT REEF SAS LASR, HPA YARN: Cluster Resource Management HDFS2: Redundant, Reliable Storage

  10. Driving Our Innovation Through Apache Total Net Lines Contributed to Apache Hadoop Total Committers Across Apache Projects End Users 449,768lines 614,041 lines 147,933 lines Hortonworks mission is to enable your modern data architecture by delivering one Enterprise Hadoop that deeply integrates with your data center technologies

  11. Enterprise Requirements: Leverage Skills 1 Key ServicesPlatform, operational and data services essential for the enterprise DEVELOP COLLECT PROCESS BUILD 2 SkillsLeverage your existing skills: development, analytics, operations ANALYZE EXPLORE QUERY DELIVER OPERATE PROVISION MANAGE MONITOR 3 IntegrationInteroperable with existing data center investments

  12. Enterprise Requirements: Leverage Skills 1 Key ServicesPlatform, operational and data services essential for the enterprise DEVELOP COLLECT PROCESS BUILD 2 SkillsLeverage your existing skills: development, analytics, operations ANALYZE EXPLORE QUERY DELIVER BusinessObjects BI OPERATE PROVISION MANAGE MONITOR 3 IntegrationInteroperable with existing data center investments

  13. Enterprise Requirements: Integration Integrate with Applications Business Intelligence, Developer IDEs, Data Integration Systems Data Systems & Storage, Systems Management Platforms Operating Systems, Virtualization, Cloud, Appliances APPLICATIONS Custom Applications Packaged Applications Business Analytics OPERATIONAL TOOLS DEV & DATA TOOLS MANAGE & MONITOR BUILD & TEST DATA SYSTEM REPOSITORIES RDBMS EDW MPP 3 IntegrationInteroperable with existing data center investments SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

  14. One Hadoop: Deep Integration APPLICATIONS DEV & DATATOOLS OPERATIONAL TOOLS DATA SYSTEM RDBMS EDW MPP HANA BusinessObjects BI INFRASTRUCTURE SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

  15. Sensor Data Monitors Buildings for Efficiency Improving Efficiency Data: Sensor Problem Managing service calls on HVAC in commercial buildings • More than 70K systems in buildings around US • Systems transmits data, but mostly kept on site or discarded • Servicing costs high, due to limited data on each unit • Data on work orders, sales orders, service orders stored in different databases and not correlated Solution Data consolidation and predictive analytics for efficiency • Raw data from HVAC sensors will land in HDP, along with work order, sales order and service call data • System will predict component failures for: • Product upsell  increased revenue • Service call efficiency  reduced costs • Management insight for a new service offering Building Management Building efficiency and power solutions >$420B in revenue >140 employees

  16. Sensor Data From Smart Electricity Meters Improving Efficiency Data: Sensor Problem Utility needs to match electricity supply with demand • Utilities cannot store power, it needs to be used • Some energy load is predictable, some is unpredictable • Overproduction requires cutting back, running below capacity • Underproduction risks starting less efficient “peaker plants” • Smart meter data allows real-time analysis that can help effectively match energy production with consumption Solution Predict demand spikes by analyzing real-time sensor data • Hive + Storm on YARN streams data into Hadoop • R + Mahout to analyze aggregate consumption trendsfor predictive algorithms • More effective matching of energy production and consumption reduces energy costs and emissions Energy One of the world’s largest producers of electricity >$100B in revenue >39 million customers >150K employees

  17. Powering Music Recommendations Creating Opportunity Data: Clickstream & Server Log Problem CDH cluster failed, causing down time • Highly technical team was running CDH cluster, without support • CDH failed, CTO asked team to research support options • Hive table stores data on all music streamed by users • Data on Hive is mission-critical: used to recommend music & to pull monthly reports used to pay each music label • Data expertise is their only sustainable competitive advantage Solution HDP powers music recommendation engine • Stable recommendation engine and reconciliation reports • Pro-active technology partnership with their engineers, who are consumers of & contributors to Hadoop • 2X per year, Hortonworks reviews cluster for optimization • Data was migrated from CDH to HDP, quickly and easily Entertainment Online music streaming >$500M in revenue >24M users

  18. Donor and Voter Analytics for Political Org Creating Opportunity Data: Unstructured Problem Limited insight into donor behavior & voter mobilization • Fundraising phone services lack analysis on why donors give • For campaign management, needed analysis on what factors cause constituents to register and vote • Client knew they needed Hadoop for storage and analysis • Needed education on roadmap, use cases and execution Solution Donor data store improves revenue from tele-fundraising • Speed: Rapid delivery of donor data store • Deployment flexibility: Runs in Windows environment • Targeted: Phone reps talk to donors about their important issues • Discovery: Explore and enrich data from campaign operations Fundraising Political organization dedicated to tele-fundraising, voter contact and media services >$1M in revenue ~100 employees

  19. Analysis of Gamer Data for Future Innovation Creating Opportunity Data: ETL Problem Social gaming platform needs more storage, more stability • 4 million monthly gamers generate customer interaction data • Existing CDH cluster was going down every month • Desired tight integration with Datameer analytics tools • Needed interactive query, Impala was not meeting that need • Rapidly growing user base, need to manage cluster as it scales Solution HDP for stability at scale, tight integration with Datameer • Stable cluster that doesn’t fall down like CDH did • Easy data extracts from SQL server • Datameer analytics tools certified on HDP • High-performing Hive queries • Ambari for provisioning and maintenance as cluster scales up Gaming Online strategy & role playing games ~4M users ~$325M in revenue ~500 employees

  20. Clearing the Federal ETL Consulting Backlog Improving Efficiency Data: ETL Problem Federal consulting practice faces ETL backlog • Sequestration budget cuts created demand for ETL from SAS • Consulting practice faces backlog of millions of dollars consulting on offload from SAS at 20 fed civilian agencies • After offload, all data must still be easily accessible Solution Rationalized data storage saves taxpayer money • Federal civilian agencies reduce ongoing data storage cost • No loss of data or disruption to operations • Base SAS and SAS/ACCESS are two out-of-the-box solutions for connectivity between SAS and Hadoop, via Hive Government Professional service provider consulting on federal projects >$13B in revenue >50K employees

  21. Sentiment Analysis for Government Programs Creating Opportunity Data: Social Problem Min. of Ed. felt removed from public sentiment on programs • In-person events lacked reach and persistence • Ministry of Education wanted to understand sentiment from citizenry on specific issues such as childhood obesity • Two dedicated analysts pored over social media stream and provided daily reports to member of parliament • IT team sought improvement over limitations of manual analysis Solution Powerful “same day” sentiment analysis helps outreach • Team produces daily memos on public sentiment, now with: • Reach: includes opinions from broader base of citizenry • Confidence: more data, more confidence in opinion analysis • Frequency: daily reads show policy-makers changes over time • Precision: allows micro-analysis of specific issues and geos • Solution aligns to government’s support for open source • Individual social media authors receive invitations to in-person meetings with government ministers Government European national government

  22. Sensor Data for Healthcare Supply Chain Improving Efficiency Data: Sensor Problem Medical products have limited shelf life, tracking essential • Medical products delivered to pharmacies and hospitals • Epidemics require agile changes to delivery schedules • Materials are time sensitive and climate-controlled • Delivery logistics are complex & subject to risks outside of the company’s control (product availability, weather, traffic, etc) • Slow delivery can harm supplies and medical outcomes Solution Sensor data protects supply chain, improves efficiency • Sensor data from individual items and vehicles will give the company unprecedented supply chain visibility • Analytic platform enable predictive algorithms for infrastructure planning, disease forecasting and supply chain forecasts • Better tracking reduces waste, improves customer confidence and patient health Healthcare Supplier of pharmaceuticals & medical products to pharmacies & hospitals >$100B in revenue >30K employees

  23. Predictive Analytics & Real-time Monitoring Improving Efficiency Data: Sensor, Social & ETL Problem Unable to store sufficient data for decision support • 22 years of data for 1.2 million patients ~ 9 million records • Data on legacy system was not searchable nor retrievable • Cohort selection for research projects was slow • For decision support, clinicians had minimal access to historical data gathered across all patients Solution Unified repo provides data to both researchers & clinicians • “View only” legacy system retired, saving $500K • 9 million historical records now searchable & retrievable • Records stored with patient identification for clinical use, same data presented anonymously to researchers for cohort selection • Social data and sensor data how saved, incorporated in analysis • Real-time monitoring: patches record vital signs every minute, algorithms notify clinicians if numbers cross risk thresholds • Readmit reduction: heart patients weigh themselves daily, algorithms notify docs about unsafe weight changes Healthcare Public university teaching hospital Consistently rated by US News & World Report as among America’s best hospitals >17K patient admissions >400 physicians ~12K surgeries (‘12)

  24. Affordable, Scalable Data for Healthcare Analytics Creating Opportunity Data: ETL Problem Relational database architecture limited data exploration • Develops and maintains analytic applications for doctors • Company couldn’t access the volume or variety of data they wanted for those applications • Analyzing huge data sets on relational databases was too slow Solution HDP provides costs savings and flexibility at scale • Per-node TCO of data on HDP was 25% that of current relational DB • Open-source Hadoop ecosystem gives multiple hardware and software integration options as company scales its architecture Healthcare Analytics tools and decision support for the healthcare industry ~$130M in revenue >2K employees

  25. Data Science on Text-based Claims Records Improving Efficiency Data: Unstructured Problem Claims data in PDFs, hard to identify coding errors • Produces applications for medical decision support • Goal is marrying electronic health records with claims data • 300K daily connections with individuals around unstructured data in PDFs (claims records and patient-reported outcomes) • Data analysis is disjointed, difficult to identify patients and events that have been mis-coded or incompletely coded Solution Datasets unified in Hadoop to improve health outcomes • Optical character recognition & natural language processing • All of the unstructured, text-based data stored on HDP • Coding errors will be identified much more efficiently • Impartially coded records can also be identified • Coding efficiency will improve revenue • Analysis of underlying data will improve health outcomes Insurance – Health Large US medical insurer >$100B in revenue >100K employees

  26. Insurance Data Lake to Manage Risk Creating Opportunity Data: Structured, Clickstream,Server log Problem Challenges merging new & old data hamper analysis • Traditional and newer types of data were both growing quickly but were difficult to combine in the EDW • “Schema on load” requirements of EDW platform limited ingest of some data with significant predictive power • Company missed data-driven ways to serve customers • Process of separating legitimate from fraudulent claims created “needle-in-a-haystack” problem Solution Common platform for all types of data improves up-sell and reduces fraud • “Schema on read” Hadoop architecture means that more data sources can be easily ingested to enrich predictive analytics • Agents use big data insights to determine the best action for valued customers and recommend those in real-time • Claims analysts and underwriters process streaming data to quickly flag fraud risks and fast-track legitimate claims Insurance -Health Large US medical insurer >$30B in revenue >20M members ~35K employees

  27. Speeding Analysis for Usage-Based Insurance Creating Opportunity Data: Clickstream & ETL Problem Risk analysis lagged because of architecture gaps • Business insight from data analysis was too slow • Growing volume, velocity and variety of incoming data taxed their existing systems and processes • ETL process across disparate systems only captured 25% of the dataset, took 5-7 days to complete Solution Speed time-to-insight w/ clickstream analytics & faster ETL • Clickstream analytics • Moving from a hosted Azure platform to HDP on site will improve performance and analytical functions (with Apache Hive) • ETL acceleration • Process 100% of the data, in three days or less Insurance – Property & Casualty Personal auto & other property-casualty insurance >$17B in revenue ~28K employees

  28. Data Lake for P&C Insurance Claim Analysis Improving Efficiency Data: Structured, Social & Unstructured Problem Structured data scaled, unstructured data analysis did not • Large P&C insurance provider had systems for analyzing structured data at scale • Unstructured data from claims notes and social media data had the potential to add valuable information to claims analysis • Structured data analysis scaled, but joining this information with hand-written or social media data did not scale • Limited data visibility hampered underwriting and claims Solution Merge structured & unstructured data for better decisions • “Schema on read” Hadoop architecture means that more data sources can be easily ingested (text and social media) • Previously disparate data sets are joined for greater insight • Larger data sets fed to front-end business tools provided by Hortonworks partners: SAS, Tableau and QlikView Insurance – Property & Casualty Major provider of property casualty, life and mortgage insurance >$65B in revenue >60K employees Operations in >100 countries

  29. Maintaining SLAs for Equity Trading Information Improving Efficiency Data: Server logs & ETL Problem Meeting 12 millisecond SLAs for “ticker plant” • Daily ingest: 50GB server log data from 10,000 feeds • Four times daily, this data is pushed into DB2 • Applications query this data 35K times per second • 70% of queries are for data <1 year old, 30% for >1 year old • Current architecture can only hold 10 years trading data • Growing volume puts performance at risk of missing SLAs Solution Meeting SLAs with confidence • HBase provides super-fast queries within SLA targets • ETL offloading to Hadoop allows longer data retention, without jeopardizing fast response times Investment Services Highly trafficked website providing business and financial information ~15K employees

  30. Banking Data Lake for 100s of Use Cases Creating Opportunity Data: Server log Problem Architecture unsuited to capitalize on server log data • Huge investments company generates valuable data assets which are largely unavailable across the organization • Current EDW solutions are appropriate for some data workloads but too expensive for others • Financial log data is difficult to aggregate & analyze at scale • Short retention hampers price history & performance analysis • Limited visibility into cost of acquiring customers Solution Multi-tenant Hadoop cluster to merge data across groups • Server log data will be merged with structured data to uncover trends across assets, traders and customers • ETL offload will save money for Hadoop-appropriate workloads • Longer data retention enables price history analysis • Joining data sets for insight into customer acquisition costs • Accumulo enforces read permissions on individual data cells Investment Services Global investments company > $1.5 trillion assets under management > $14B billion in revenue ~ 50K employees

  31. Anti-Laundering & Trade Surveillance Creating Opportunity Data: Structured Problem Lags in back office system limit intraday risk analysis • 15M transactions and 300K trades every day • Storage limitations required archiving, limiting data availability • Trading data not available for risk analysis until end of day, which hampers intraday risk analytics and creates a time window of unacceptable exposure Solution Data lake accelerates time-to-analytics & extends retention • Shared data repository combines more comprehensive data sets about all firm activities, improving data transparency • Operational data available to risk analysts earlier, same day • Trading risk group will process more position, execution and balance data and hold that data for five years • Hadoop enables ingest of data from recent acquisitions despite disparate data definitions and infrastructures Investment Services Trading services for millions of client accounts >$16B in assets >4,000 advisors

  32. Customer Insight Through Product Usage Data Creating Opportunity Data: Geolocation, Clickstream,Server Log, Sensor & Unstructured Problem Lacked central repository for efficient data storage & analysis • Rivers of data flow from millions of consumer electronic products • Company lacked a platform to capture new types of data: geolocation, clickstream, server log, sensor & unstructured • Unable to exploit key competitive advantage: unique customer insight from troves of big data Solution Efficient data storage unlocks value in company data • Hadoop data lake permits view into how customers use products across multiple types of data • Lower cost of storage improves the margin for retaining data • Powerful cluster includes many key ecosystem projects: Hive, Hbase, HCatalog, Pig, Flume, Sqoop, Ambari, Oozie, Knox, Falcon, Tez and YARN Manufacturing Consumer electronics >$180B in revenue >400K employees

  33. Optimizing High-Tech Manufacturing Improving Efficiency Data: Sensor Problem Data scarcity for root cause analysis on products defects • 200 million digital storage devices manufactured yearly • Devices not passing QA scrapped at the end of the line • >10K faulty devices returned by customers every month • Limited data available for root cause analysis means that diagnosing problems is highly manual (physical inspections) • Subset of sensor data from QA testing retained 3-12 months Solution Data retention doubled, with 10x processing improvement • Repository of sensor data now holds larger portion of total data • Dashboard created 10x more quickly than before Hadoop • Data retained for at least 24 months • Manufacturing dashboard allows >1,000 employees to search data, with results returned in less than 1 second Manufacturing Digital Storage Devices >$15B in revenue >85K employees

  34. Social Site Speeds Processing, Reduces Cost Creating Opportunity Data: Clickstream& Server Log Problem Data growth outpaced existing Greenplumsolution • 20M monthly unique visitors, and growing • Greenplum storage solution was slow and expensive • Operations team challenged by data growth • Analytics team hampered by slow processing speed Solution Processing speeds doubled, storage cost decreased • Operations team saw processing speed 2x of Greenplum’s • Significant cost savings from moving data to HDP • During this second year of support relationship, plans to move more workloads to HDP, for better insights at a lower cost Online Community Online social network >$50M in revenue >300M members 2nd year with Hortonworks

  35. Powering Professional Network Recommendations Creating Opportunity Data: Clickstream, Server Log & Social Problem Lack of a recommendation engine to promote connections • >13M non English-speaking members find jobs & connections • User interactions generate semi-structured data • Clickstream, server log and social data could feed recommendations • Company lacked stable platform to store, refine & enrich that raw data Solution Hadoop recommendation engine to compete with LinkedIn • Replaced existing CDH cluster • New types of data feed a superior recommendation engine that enhances the value of belonging to the community • YARN, Tez and Stinger initiative provide near-term functionality and long-term confidence Online Community Online professional network >$90M in revenue >13M members

  36. Better Romantic Matches with Data Science Creating Opportunity Data: Server logs & ETL Problem Newer types of data unavailable for matchmaking algorithms • Unable to store clickstream data and user-entered content • Other types of data only retained for seven days • Recommendations would help users craft attractive profiles • High costs to store an ever growing amount of data • Relational data platform did not fulfill their requirements Solution Hadoop cluster for A/B testing, device analysis, text mining • A/B testing: consolidate email & clickstream from SQL databases • Usage patterns across devices, browsers and applications. Understand who uses their mobile app. • Mine user-created text (profile language and user-to-user communications) for recommendation engine • Longer data retention: find subtle trends with longer time window Online Community Online dating site >300 employees

  37. 360° View of Customer for Call Center Sales Creating Opportunity Data: Unstructured Problem Call center sales reps unable to recommend best product • 2000+ product lines • Multiple customer interaction channels (web, Salesforce, face-to-face, phone) • Poor visibility causes sales reps to miss opportunities and customer satisfaction suffers Solution Improve sales conversions with optimal product recs • Call center reps will understand every interaction with the customer, to improve service calls • Natural language analysis of rep emails to customers identifies best response language and coaching opportunities • Recommendation engine predicts the next best product for each customer Retail IT solution and equipment reseller >$10B in revenue >6K employees

  38. 360° Customer View for Home Supply Retailer Creating Opportunity Data: Clickstream, Unstructured, Structured Problem Lack of a unified customer record across all channels • Global distribution online, in home and across 2000+ stores • Unable to create “golden record” for analytics on customer buying behavior across all channels • Data repositories on website traffic, POS transactions and in-home services existed in isolation of each other • Limited ability for targeted marketing to specific segments • Data storage costs increasing Solution HDP delivers targeted marketing & data storage savings • Golden record enables targeted marketing capabilities: customized coupons, promotions and emails • Data warehouse offload saved millions in recurring expense • Customer team continues to find unexpected, unplanned uses for their 360 degree view of customer buying behavior Retail Major home improvement retailer >$74B in revenue >300K employees >2,200 stores

  39. Using In-Store Location Data to Improve X-Sell Creating Opportunity Data: Sensor & Geolocation Problem Retailer lacks data on how customers move through stores • Placement of product within department stores affects sales • Sales data is not specific enough to suggest specific changes • Online retailers can compare what shoppers view with what they buy, but they lack this insight in brick and mortar stores • Result: critical decisions about store layout, inventory Solution Micro-data on shopper location enables in-store analysis similar to website analysis: locations visited v. purchases • Apple iBeacon app captures in-store location data for shoppers that have the app on their iPhones • Data streams into HDFS on how customers move through their stores, relative to location of particular products • Enables real-time promotions to customers w/ smart phones, based on who they are and where they stand in the store • Historical data across all shoppers and their purchases provides valuable insight regarding store design Retail Major omni-channel retailer > $27B in revenue >175K employees >800 stores

  40. Unified Data for Online Recommendation Engine Creating Opportunity Data: Structured, Clickstream, Server Log & Unstructured Problem 5 data sets are fragmented, hampering product recs • 5 major data sets: inventory data, transactional data, user behavior data, customer profiles & log data • Unified view needed, to recommend items to users • Currently lack analytics dashboard across all types of data • Storing non-transactional data on EDW is expensive Solution Unified data lake for increased sales and lower costs • Unified 360° view for recommendations of similar products • Analytics dashboard joins clickstream w/ transactional data • Summary data stored in HBase, can be queried with web apps • Offload some data from Teradata EDW, to lower storage costs • Actively partnering with engineers to improve Hadoop Retail eCommerce marketplace >$12B in revenue >30K employees

  41. Predicting Car Prices With High Confidence Creating Opportunity Data: Server logs & ETL Problem Achieving 99.1% confidence in car price estimates • Goal to provide consumers & dealers reliable car price guides • Promise: 99.1% confidence that projected price paid will be within $20 of the average national price paid in a given week • As network of dealers grew, existing SQL Server data warehouse was expensive and difficult to scale Solution Cost savings & data reliability at scale in a data lake • Mission-critical price data moved to Hadoop architecture • Server log data flows into HDP with Flume • Analysis of this data allows analysts to further improve accuracy of estimates Retail Online eCommerce service for buying and selling cars ~300 employees

  42. Recommendation Engine Improves Conversion Improving Efficiency Data: ETL Problem Need to create better product recommendations • Multiple touch points: store, kiosk, web and mobile app • Wants to promote customized promotions, coupons & recs • Data was not integrated, making 360-view of customer behaviors impossible Solution Recommendations to all channels, based on data lake • Ingest all raw data from different product lines into HDP • Real-time data ingestion • Structured data ingestion • Transform raw data • ETL processing with Pig and Hive • Use Mahout and R to make recommendations • Recommendations will be fed to all channels • HBase serves recommendations to web site, kiosk and mobile app Retail Specialty department store >$19B in revenue >130K employees

  43. Faster Real Estate Reports for Agents Improving Efficiency Data: Clickstream & ETL Problem Accelerate reports on movers for real estate agents • 20 million monthly visitors to family of websites • Reports on movers not consistently generated quickly enough • Pressure from newer market entrants • High data storage costs reduce margins on data Solution More data for faster reports at a lower cost • Improved analytical efficiency speeds report turnaround • Data storage costs lower than before • Improved visibility into macro trends in real estate • Refine, explore and enrich the data better than competitors Software Operator of real estate websites ~$200M in revenue >1,000 employees

  44. Unified View Across Products, for Product Mgrs Creating Opportunity Data: ETL Problem Data fragmentation across products and verticals • More than 20 product lines • Multiple verticals: retail, financial services, healthcare, manufacturing, communications, utilities & government • Each product line has a separate data repository • Unified analysis across product lines was impossible Solution Data consolidation for cross-product customer analysis • Product managers will have unified data for analysis • Raw data from different products will land in HDP • Data will then be refined and transformed • Real time data ingestion with Flume • Batch data movement with Sqoop • ETL processing with Pig and Hive Software Data security software, cloud computing ~$130M in revenue ~1,100 employees

  45. Launching New Data Analysis Products Creating Opportunity Data: ETL Problem Enterprise customers have no visibility into performance • Platforms connect 3.4 billion transactions per year • Currently storing 90TB, growing at 20% YoY • All divisions retain 36 months, except healthcare network: 7yrs • Customers have no visibility into their companies’ activity on their commerce platforms • Client wants to add analytics services to cross-sell to existing customers and attract new customers Solution HDP data lake enables launch of new information products • Shorten data processing workloads from days to hours • Enable ad hoc analytics queries • Create data analysis products and services for customers of promotion, supply chain and healthcare networks • New product: anonymous reports that benchmark customer against competitors in same industry Software Operator of intelligent ecommerce networks >1,400 customers ~5K employees

  46. Product Managers Speed Product Innovation Creating Opportunity Data: Server Log Problem Product managers needed to analyze server logs • 130K clients drive 780M transactions per day • Services incorporate streams from core CRM and 3rd party platforms like Twitter, Facebook and YouTube • Product managers need to capture and interpret server log data to analyze new feature adoption & performance • Unable to process current volume using relational data stores • Unable to retain enough data because of cost Solution HDP gives PMs power, reliability and liberty • Power: Analysis of more than 30TB per month • Reliability: Previous system broke every 2 weeks. No longer. • Liberty: Open source solution prevents vendor lock • HDP increases Product Management storage and analysis without corresponding increase in IT spend Software Sales & CRM software, cloud computing ~$3B in revenue ~10K employees

  47. eCommerce Platform Uses Data Lake for Insight Creating Opportunity Data: Server Log Problem New types of data difficult to store, unavailable for analysis • Millions of payments processed every day • Fraudsters selling fake items or extract buyer account info • Some creditors default, resulting in losses • Unable to store current volume using relational data stores • Unable to retain vintage data because of RDBMS storage cost Solution HDP data lake accelerates multiple analysis projects • Platform stores all new types of data: clickstream, social, sensor, geolocation, server logs and unstructured data • Detects and prevents theft: fraudsters stealing from members • Assesses credit risk: server log analysis & machine learning • Manages offers: aggregates data for advertisers • User experience: social sentiment analysis on usability • Site optimization: analyze clickstream for site improvements Software eCommerce payments platform ~$6B in revenue >130M users ~13K employees

  48. Offloading Clickstream Data from Netezza Improving Efficiency Data: ETL & Clickstream Problem System receives millions of call detail records per second • NetezzaEDW operating near capacity • Netwezza housing exhaust data not required for intended reporting and analytics, leading to unnecessary expense • Enterprise IT maintained redundant data stores • Unable to store clickstream data to enrich consumer intelligence Solution Longer storage, lower cost & better consumer intelligence • Hadoop will recover premium Teradata cycles, currently used for transformations and data movement • Projected costs savings of >$1M by offloading exhaust data • Analysis of clickstream adds new dimension of customer view • Improved service efficiency: bill processing & reporting Telecom Major telecom provider ~ $25B in revenue > 40M customers

  49. Unified Household View of the Customer Creating Opportunity Data: ETL, Social, Sensor & Clickstream Problem Acquisitions & data explosion fragment view of customer • Recent acquisitions and proliferation of types of data caused fragmented view of customers • Data exists across multiple applications & data stores • Semi-structured data: social, sensors & networked devices • Difficult to integrate structured, semi-structured & unstructured data sets from so many distinct sources Solution HDP data lake delivers 360° unified household View • Stable environment for exploring and enriching the data • Store all of the data and retain it for longer • Parse on demand: no need to pre-parse data before loading • Analysis on demand: allows analysts to explore raw data and find unexpected truths in the data Telecom Major telecom provider, offering data networks& services > $100B in revenue > 200K employees

  50. Call Record Analysis for Improved Cell Service Creating Opportunity Data: Sensor Problem System receives millions of call detail records per second • System enables proactive management of phone call quality • Call detail records (CDRs) are the raw data used for analysis • Millions of CDRs stream in every second • Storage is expensive & ingest rates are increasing 20% YoY • 24-hour data retention not sufficient to discover long-term trends Solution Longer storage & rich analysis improve customer service • HDP’s 10:1 compression allows affordable 6 month retention • Improved forensics on instances of poor call quality drive: • Informed decisions on expansion of transmission infrastructure • Predictive analytics on when to repair/replace equipment • Access to more data helps service reps solve customer issues in near real-time Telecom Major telecom provider, offering data networks& services > $100B in revenue > 200K employees