big data use cases in the cloud l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Big Data Use Cases in the cloud PowerPoint Presentation
Download Presentation
Big Data Use Cases in the cloud

Loading in 2 Seconds...

play fullscreen
1 / 41

Big Data Use Cases in the cloud - PowerPoint PPT Presentation


  • 179 Views
  • Uploaded on

Big Data Use Cases in the cloud. Peter Sirota, GM Elastic MapReduce @ petersirota. What is Big Data?. Computer generated data Application server logs (web sites, games) Sensor data (weather, water, smart grids) Images/videos (traffic, security cameras) . Human generated data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Big Data Use Cases in the cloud' - mackenzie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
big data use cases in the cloud

Big Data Use Cases in the cloud

Peter Sirota, GM Elastic MapReduce @petersirota

slide3

Computer generated data

    • Application server logs (web sites, games)
    • Sensor data (weather, water, smart grids)
    • Images/videos (traffic, security cameras)
slide4

Human generated data

    • Twitter “Firehose” (50 mil tweets/day 1,400% growth per year)
    • Blogs/Reviews/Emails/Pictures
  • Social graphs
    • Facebook, linked-in, contacts
why is big data hard and getting harder
Why is Big Data Hard (and Getting Harder)?
  • Data Volume
    • Unconstrained growth
    • Current systems don’t scale
why is big data hard and getting harder8
Why is Big Data Hard (and Getting Harder)?
  • Data Structure
    • Need to consolidate data from multiple data sources in multiple formats across multiple businesses
why is big data hard and getting harder9
Why is Big Data Hard (and Getting Harder)?
  • Changing Data Requirements
    • Faster response time of fresher data
    • Sampling is not good enough and history is important
    • Increasing complexity of analytics
    • Users demand inexpensive experimentation
innovation 1
Innovation #1:
  • Apache Hadoop
    • The MapReduce computational paradigm
    • Open source, scalable, fault‐tolerant, distributed system

Hadoop lowers the cost of developing a distributed system for data processing

innovation 2
Innovation #2:
  • Amazon Elastic Compute Cloud (EC2)
  • “provides resizable compute capacity in the cloud.”

Amazon EC2 lowers the cost of operating a distributed system for data processing

slide13

Amazon Elastic MapReduce =

Amazon EC2 + Hadoop

elastic mapreduce applications
Elastic MapReduce applications
  • Targeted advertising / Clickstream analysis
  • Security: anti-virus, fraud detection, image recognition
  • Pattern matching / Recommendations
  • Data warehousing / BI
  • Bio-informatics (Genome analysis)
  • Financial simulation (Monte Carlo simulation)
  • File processing (resize jpegs, video encoding)
  • Web indexing
clickstream analysis
Clickstream Analysis –
  • Big Box Retailer came to Razorfish
    • 3.5 billion records
    • 71 million unique cookies
    • 1.7 million targeted ads required per day

Problem: Improve Return on Ad Spend (ROAS)

clickstream analysis16
Clickstream Analysis –

User recently purchased a sports movie and is searching for video games

Targeted Ad

(1.7 Million per day)

clickstream analysis17
Clickstream Analysis –
  • Lots of experimentation but final design:
    • 100 node on-demand Elastic MapReduce cluster running Hadoop
clickstream analysis18
Clickstream Analysis –
  • Processing time dropped from 2+ days to 8 hours (with lots more data)
clickstream analysis19
Clickstream Analysis –
  • Increased Return On Ad Spend by 500%
slide20

World’s largest handmade marketplace

    • 8.9 million items
    • 1 billion page view per month
    • $320MM 2010 GMS
slide21

Easy to ‘backfill’ and run experiments just boot up a cluster with 100, 500, or 1000 nodes

Job

Job

Web event logs

ETL – Step 1

ETL – Step 2

Production DB snapshots

Job

recommendations
Recommendations

The Taste Test http://www.etsy.com/tastetest

recommendations23
Recommendations

Gift Ideas for Facebook Friends

etsy.com/gifts

slide24

Yelp

Yelp generates close to 400GB of logs per day

mapreduce at yelp

MapReduce at Yelp

Yelp does not have a physical MapReduce cluster

Running 250 production clusters per week

All of those run on Elastic MapReduce

more mapreduce uses

More MapReduce uses

Analyze ad stats (reporting, billing, algorithm inputs)

Analyze A/B test results

Detect duplicate business listings

Email bounce processing

Identify bots based on traffic patterns

how do we use emr
How do we use EMR?
  • Map-Reduce
    • Run algorithms on our entire dataset
    • Streaming jobs, complex analyses
  • Hive
    • Business intelligence
    • Exploratory analyses
    • Infographics!
how big is our data
How big is our data?

Global reach (North Pole, Space)

Native app for almost every smartphone, SMS, web, mobile-web

10M+ users, 15M+ venues, ~1B check-ins

Terabytes of log data

computing venue to venue similarity
Computing venue-to-venue similarity
  • Spin up 40 node cluster
  • Submit Ruby streaming job
    • Invert User x Venue matrix
    • Grab Co-occurrences
    • Compute similarity
  • Spin down cluster
  • Load data to app server
when do people go to a place
When do people go to a place?

Thursday

Friday

Saturday

Sunday

why are people checking in
Why are people checking in?

Explore their city, discover new places

Find friends, meet up

Save with local deals

Get insider tips on venues

Personal analytics, diary

Follow brands and celebrities

Earn points, badges, gamification of life

The list grows…

join us
Join us!

Justin Moore

@injust

justin@foursquare.com

foursquare is hiring

www.foursquare.com/jobs