Hadoop on azure 101 what is the big deal
This presentation is the property of its rightful owner.
Sponsored Links
1 / 29

Hadoop on Azure 101 What is the Big Deal? PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on
  • Presentation posted in: General

Hadoop on Azure 101 What is the Big Deal?. Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation. Agenda. Why Big Data? Understanding the Basics Microsoft and Hadoop. Why Big Data ?. 1.8 ZETTABYTES. Of Information will be created in 2011

Download Presentation

Hadoop on Azure 101 What is the Big Deal?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hadoop on azure 101 what is the big deal

Hadoop on Azure 101What is the Big Deal?

Dennis Mulder

Solution Architect – Global Windows Azure Center of Excellence

Microsoft Corporation


Agenda

Agenda

Why Big Data?

Understanding the Basics

Microsoft and Hadoop


Hadoop on azure 101 what is the big deal

Why Big Data?


Hadoop on azure 101 what is the big deal

1.8 ZETTABYTES

  • Of Information will be created in 2011

  • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011


Hadoop on azure 101 what is the big deal

7.9 ZETTABYTES

By 2015

  • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011


Hadoop on azure 101 what is the big deal

Bing ingests > 7 petabytes a month

The Twitter community generates over 1 terabyte of tweets every day

Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes

Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp


Hadoop on azure 101 what is the big deal

Example Scenarios


The potential solving specific industry problems

The Potential: Solving Specific Industry Problems

eCommerce: mining web logs: collaborative filtering, user experience optimisation…

Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faults

Capital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations…

Retail Banking: historical transaction mining: fraud detection, customer segmentation…

 Industry-specific data-sets leveraged to improve decision making and generate new revenue streams


Traditional e commerce d ata f low

Traditional E-Commerce Data Flow

OPERATIONAL DATA

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

ETL Some Data

Logs

Excess Data


New e commerce b ig d ata f low

New E-Commerce Big Data Flow

OPERATIONAL DATA

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

Logs

Logs

Raw Data

“Store it All”

Cluster

How much do views for certain products increase when our TV ads run?

Raw Data

“Store it All”

Cluster


Hadoop on azure 101 what is the big deal

Understanding the Basics Move the Compute to the Data


So how d oes i t w ork

So How Does It Work?

FIRST, STORE THE DATA

Server

Server

Files

Server

Server


So how d oes i t w ork1

So How Does It Work?

SECOND, TAKE THE PROCESSING TO THE DATA

RUNTIME

// Map Reduce function in JavaScript

varmap = function (key, value, context) {

varwords = value.split(/[^a-zA-Z]/);

for (var i = 0; i < words.length; i++){

if (words[i] !== "")

context.write(words[i].toLowerCase(),

1);}

}};

varreduce = function (key, values, context) {

varsum = 0;

while (values.hasNext()){

sum += parseInt(values.next());

}

context.write(key, sum);

};

Code

Server

Server

Server

Server


Mapreduce workflow

MapReduce – Workflow

A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner

The framework sorts the outputs of the maps, which are then input to the reducetasks

The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks

Input Domain

Map

Map

Map

IntermediateDomain

IntermediateDomain

IntermediateDomain

Reduce

Reduce

Reduce

IntermediateDomain

Reduce

Outputdomain


Hadoop on azure 101 what is the big deal

Map

Scenario: Get sum sales grouped by zipCode

(custId, zipCode, amount)

DataNode3

5

6

7

5

2

9

8

3

6

5

0

6

0

5

9

6

8

2

3

1

4

4

7

1

02115

53705

54235

53705

53705

53705

02115

44313

10025

54235

10025

44313

44313

02115

53705

44313

10025

10025

53705

53705

44313

02115

53705

02115

54235

44313

10025

54235

10025

54235

44313

54235

53705

44313

02115

44313

$75

$55

$95

$55

$22

$30

$15

$25

$10

$60

$15

$60

$15

$15

$60

$25

$75

$15

$30

$15

$15

$55

$10

$22

$15

$22

$15

$95

$25

$95

$65

$75

$10

$65

$65

$30

Group

By

Mapper

One output bucket per reduce task

Blocks

of the

Sales

file in

HDFS

DataNode2

Group

By

Mapper

DataNode1

Map tasks


Reduce

Reducer

Reduce

SUM

SUM

SUM

Sort

Sort

Sort

Mapper

  • Done!

Reducer

02115

54235

10025

44313

53705

$30

$97

$90

$155

53705

44313

53705

02115

53705

44313

02115

10025

44313

53705

44313

54235

44313

53705

54235

53705

54235

10025

53705

02115

02115

53705

53705

54235

10025

44313

44313

44313

10025

10025

54235

44313

02115

02115

54235

10025

$110

$25

$60

$75

$22

$30

$95

$65

$15

$15

$25

$65

$10

$30

$15

$15

$10

$15

$55

$75

$10

$75

$60

$30

$22

$25

$95

$60

$15

$95

$22

$65

$15

$15

$15

$55

$55

Shuffle

Reducer

Mapper

Reduce tasks


Hadoop on azure 101 what is the big deal

Hadoop


Hadoop on azure 101 what is the big deal

Hadoop Architecture

Task tracker

Task tracker

MapReduce Layer

Job tracker

Name node

HDFS Layer

Data node

Data node

Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png


Traditional rdbms vs mapreduce

Traditional RDBMS vs. MapReduce

  • Reference: Tom White’s Hadoop: The Definitive Guide


The hadoop ecosystem

The Hadoop Ecosystem

ETL Tools

BI Reporting

RDBMS

Zookeepr (Coordination)

Pig (Data Flow)

Hive (SQL)

Sqoop

Avro (Serialization)

MapReduce(Job Scheduling/ Execution System)

Hbase (Column DB)

HDFS(Hadoop Distributed File System)

  • Reference: Tom White’s Hadoop: The Definitive Guide


Hadoop on azure 101 what is the big deal

Microsoft and Hadoop


Hadoop on azure

Azure Blob Storage

Hadoop on Azure

Name Node

  • On Premise Enterprise Content

  • Transactional DBs

  • On Prem logs

  • Internal sensors

Azure Blob Storage

Data Node

Data Node

Azure Blob Storage

Data Node

Data Node

SQL Azure

HDFS

  • Cloud Enterprise Content

  • Generated in Azure

Application end point

S3

  • Generated/stored elsewhere

  • What does Hadoop in the Cloud mean?

    • Where is HDFS?

    • Where is my data stored?

    • Azure Blob Storage vs. HDFS

  • 3rd Party Content

  • Azure Datamarket

  • Public content

  • Delivered online


Detailed offerings

Detailed Offerings

INSIGHTS

Hive ODBC Driver & Hive Add-in for Excel

Integration with Microsoft PowerPivot

Hadoop based distribution for Windows Server & Azure

Strategic Partnership with Hortonworks

ENTERPRISE

READY

JavaScript framework for Hadoop

RTM of Hadoop connectors for SQL Server and PDW

BROADER

ACCESS


Microsoft big data solution

Microsoft Big Data Solution

FAMILIAR END USER TOOLS

Excel with PowerPivot

Power View

Predictive Analytics

Embedded BI

BI PLATFORM

SSAS

SSRS

Microsoft EDW

Connectors

Hadoop On Windows Azure

Hadoop On Windows Server

UNSTRUCTURED & STRUCTURED DATA

Sensors

Devices

Bots

Crawlers

ERP

CRM

LOB

APPs


Deploying and interacting w ith a hadoop cluster on azure

Deploying and Interacting With a Hadoop Cluster on Azure

demo


Hadoop on windows insights to all users by activating new types of data

Hadoop on WindowsInsights to all users by activating new types of data

Differentiation

INSIGHTS

Integrate with Microsoft Business Intelligence

Choice of deployment on Windows Server + Windows Azure

Integrate with Windows Components (AD, Systems Center)

ENTERPRISE

READY

Easy installation and configuration of Hadoop on Windows

Simplified programming with . Net & Javascript integration

Integrate with SQL Server Data Warehousing

BROADER

ACCESS

  • Contributions proposed back to community distribution


Summary

Summary

Hadoop is about massive compute and massive data

The code is brought to the data

Map -> Split the work

Reduce -> Combine the results

Relational databases vsHadoop?

Wrong question - Serve different needs


Resources

Resources

http://www.hadooponazure.com/

http://hadoop.apache.org/


  • Login