1 / 20

Deep Analytics Pipeline A Benchmark Proposal

Deep Analytics Pipeline A Benchmark Proposal. Milind Bhandarkar Chief Scientist, Pivotal ( mbhandarkar@gopivotal.com ) (Twitter: @techmilind). Why “ Representative ” Benchmarks ?. Benchmarks most relevant if representative of real-world applications

Download Presentation

Deep Analytics Pipeline A Benchmark Proposal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deep Analytics PipelineA Benchmark Proposal • Milind Bhandarkar • Chief Scientist, Pivotal • (mbhandarkar@gopivotal.com) • (Twitter: @techmilind)

  2. Why “Representative” Benchmarks ? • Benchmarks most relevant if representative of real-world applications • Design, Architect, & Tune systems for broadly applicable workloads • Rule of System Design: Make common tasks fast, other tasks possible

  3. Drivers for Big Data • Ubiquitous Connectivity - Mobile • Sensors Everywhere – Industrial Internet • Content Production - Social

  4. Big Data Sources • Events • Direct - Human Initiated • Indirect - Machine Initiated • Software Sensors (Clickstreams, Locations) • Public Content (blogs, tweets, Status updates, images, videos)

  5. Typical Big Data Use Case • Gain better understanding of “business entities” using their past behavior • Use “All Data” to model entities • Lower costs means even incremental improvements in model is worthwhile

  6. Domains & Business Entities

  7. “User” Modeling • Objective: Determine User-Interests by mining user-activities • Large dimensionality of possible user activities • Typical user has sparse activity vector • Event attributes change over time

  8. User-Modeling Pipeline • Data Acquisition • Sessionization, Normalization • Feature and Target Generation • Model Training • Model Testing • Upload to serving

  9. Data Acquisition

  10. Data Extraction & Sessionization

  11. Normalization

  12. Targets • User-Actions of Interest • Clicks on Ads & Content • Site & Page visits • Conversion Events • Purchases, Quote requests • Sign-Up for membership etc

  13. Features • Summary of user activities over a time-window • Aggregates, moving averages, rates over various time-windows • Incrementally updated

  14. Joining Targets & Features • Target rates very low: 0.01% ~ 1% • First, find users with target in sessions • Filter user activity without targets • Join feature vector with targets

  15. Model Training • Regressions • Boosted Decision Trees • Naive Bayes • Support Vector Machines • Maximum Entropy modeling

  16. Model Testing • Apply models to features in hold-out data • Pleasantly parallel • Find effectiveness of model

  17. Upload to Serving • Upload scores to serving systems • Benchmark large-scale serving systems • NoSQL Systems with rolling batch upserts, and read-mostly access

  18. Proposal: 5 Classes • Tiny (100K entities, 10 events per entity) • Small (1M entities, 10 events per entity) • Medium (10M entities, 100 events per entity) • Large (100M entities, 1000 events per entity) • Huge (1B entities, 1000 events per entity)

  19. Proposal: Publish results for every stage • Data pipelines constructed by mix-and-match of various stages • Different modeling techniques per class • Need to publish performance numbers for every stage

  20. Questions ?

More Related