1 / 39

From The Lab to the Factory

From The Lab to the Factory. Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera. About Me. What Do Data Scientists Do?. What I Think I Do. What Other People Think I Do. What I Actually Do. Data Science In the Lab.

doane
Download Presentation

From The Lab to the Factory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera

  2. About Me

  3. What Do Data Scientists Do?

  4. What I Think I Do

  5. What Other People Think I Do

  6. What I Actually Do

  7. Data Science In the Lab

  8. Data Science as Statistics

  9. Investigative Analytics

  10. Tools for Investigative Analytics

  11. Inputs and Outputs

  12. On Actionable Insights

  13. Data Science in the Factory

  14. Building Data Products

  15. A Shift In Perspective Analytics in the Lab Analytics in the Factory Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions • Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine

  16. Data Science as Decision Engineering

  17. All* Products Become Data Products

  18. From the Lab to the Factory: First Steps

  19. Step 1: Choose a Good Problem

  20. Step 2: DTSTCPWTM

  21. Step 3: Log Everything

  22. Step 4: Hire (More) Data Scientists

  23. Workflow Optimization

  24. The Data Science Workflow

  25. Identifying the Bottlenecks

  26. Myrrix

  27. Introducing Oryx

  28. Generational Thinking

  29. Oryx ALS Recommender Demo

  30. Rolling to Production

  31. The Limits of Our Models

  32. Space Exploration

  33. Data Science Needs DevOps

  34. Introducing Gertrude • Multivariate Testing • Define and explore a space of parameters • Overlapping Experiments • Tang et al. (2010) • Runs multiple independent experiments on every request

  35. Simple Conditional Logic • Declare experiment flags in compiled code • Settings that can vary per request • Create a config file that contains simple rules for calculating flag values and rules for experiment diversion

  36. Separate Data Push from Code Push • Validate config files and push updates to servers • Zookeeper via Curator • File-based • Servers pick up new configs, load them, and update experiment space and flag value calculations

  37. The Experiments Dashboard

  38. A Few Links I Love • http://research.google.com/pubs/pub36500.html • The original paper on the overlapping experiments infrastrucure at Google • http://www.exp-platform.com/ • Collection of all of Microsoft’s papers and presentations on their experimentation platform • http://www.deaneckles.com/blog/596_lossy-better-than-lossless-in-online-bootstrapping/ • Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies

  39. Josh Wills, Director of Data Science, Cloudera @josh_wills Thank you!

More Related