1 / 41

Big Data Use Cases in the cloud

Big Data Use Cases in the cloud. Peter Sirota, GM Elastic MapReduce @ petersirota. What is Big Data?. Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras). Human generated data

Download Presentation

Big Data Use Cases in the cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Big Data Use Cases in the cloud Peter Sirota, GM Elastic MapReduce @petersirota

  2. What is Big Data?

  3. Computer generated data • Application server logs (web sites, games) • Sensor data (weather, water, smart grids) • Images/videos (traffic, security cameras)

  4. Human generated data • Twitter “Firehose” (50 mil tweets/day 1,400% growth per year) • Blogs/Reviews/Emails/Pictures • Social graphs • Facebook, linked-in, contacts

  5. Big Data is full of valuable, unanswered questions!

  6. Why is Big Data Hard (and Getting Harder)?

  7. Why is Big Data Hard (and Getting Harder)? • Data Volume • Unconstrained growth • Current systems don’t scale

  8. Why is Big Data Hard (and Getting Harder)? • Data Structure • Need to consolidate data from multiple data sources in multiple formats across multiple businesses

  9. Why is Big Data Hard (and Getting Harder)? • Changing Data Requirements • Faster response time of fresher data • Sampling is not good enough and history is important • Increasing complexity of analytics • Users demand inexpensive experimentation

  10. We need tools built specifically for Big Data!

  11. Innovation #1: • Apache Hadoop • The MapReduce computational paradigm • Open source, scalable, fault‐tolerant, distributed system Hadoop lowers the cost of developing a distributed system for data processing

  12. Innovation #2: • Amazon Elastic Compute Cloud (EC2) • “provides resizable compute capacity in the cloud.” Amazon EC2 lowers the cost of operating a distributed system for data processing

  13. Amazon Elastic MapReduce = Amazon EC2 + Hadoop

  14. Elastic MapReduce applications • Targeted advertising / Clickstream analysis • Security: anti-virus, fraud detection, image recognition • Pattern matching / Recommendations • Data warehousing / BI • Bio-informatics (Genome analysis) • Financial simulation (Monte Carlo simulation) • File processing (resize jpegs, video encoding) • Web indexing

  15. Clickstream Analysis – • Big Box Retailer came to Razorfish • 3.5 billion records • 71 million unique cookies • 1.7 million targeted ads required per day Problem: Improve Return on Ad Spend (ROAS)

  16. Clickstream Analysis – User recently purchased a sports movie and is searching for video games Targeted Ad (1.7 Million per day)

  17. Clickstream Analysis – • Lots of experimentation but final design: • 100 node on-demand Elastic MapReduce cluster running Hadoop

  18. Clickstream Analysis – • Processing time dropped from 2+ days to 8 hours (with lots more data)

  19. Clickstream Analysis – • Increased Return On Ad Spend by 500%

  20. World’s largest handmade marketplace • 8.9 million items • 1 billion page view per month • $320MM 2010 GMS

  21. Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes Job Job Web event logs ETL – Step 1 ETL – Step 2 Production DB snapshots Job

  22. Recommendations The Taste Test http://www.etsy.com/tastetest

  23. Recommendations Gift Ideas for Facebook Friends etsy.com/gifts

  24. Yelp Yelp generates close to 400GB of logs per day

  25. MapReduce at Yelp Yelp does not have a physical MapReduce cluster Running 250 production clusters per week All of those run on Elastic MapReduce

  26. Features driven by MapReduce

  27. Features driven by MapReduce

  28. More MapReduce uses Analyze ad stats (reporting, billing, algorithm inputs) Analyze A/B test results Detect duplicate business listings Email bounce processing Identify bots based on traffic patterns

  29. Big Data @ foursquare

  30. How do we use EMR? • Map-Reduce • Run algorithms on our entire dataset • Streaming jobs, complex analyses • Hive • Business intelligence • Exploratory analyses • Infographics!

  31. How big is our data? Global reach (North Pole, Space) Native app for almost every smartphone, SMS, web, mobile-web 10M+ users, 15M+ venues, ~1B check-ins Terabytes of log data

  32. Our Stack

  33. Computing venue-to-venue similarity • Spin up 40 node cluster • Submit Ruby streaming job • Invert User x Venue matrix • Grab Co-occurrences • Compute similarity • Spin down cluster • Load data to app server

  34. Who is checking in?

  35. What are people doing?

  36. Where are our users?

  37. When do people go to a place? Thursday Friday Saturday Sunday

  38. Why are people checking in? Explore their city, discover new places Find friends, meet up Save with local deals Get insider tips on venues Personal analytics, diary Follow brands and celebrities Earn points, badges, gamification of life The list grows…

  39. How can we leverage these insights?

  40. Join us! Justin Moore @injust justin@foursquare.com foursquare is hiring www.foursquare.com/jobs

  41. http://aws.amazon.com/elasticmapreduce/

More Related