hadoop on azure 101 what is the big deal
Download
Skip this Video
Download Presentation
Hadoop on Azure 101 What is the Big Deal?

Loading in 2 Seconds...

play fullscreen
1 / 29

Hadoop on Azure 101 What is the Big Deal? - PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on

Hadoop on Azure 101 What is the Big Deal?. Dennis Mulder Solution Architect – Global Windows Azure Center of Excellence Microsoft Corporation. Agenda. Why Big Data? Understanding the Basics Microsoft and Hadoop. Why Big Data ?. 1.8 ZETTABYTES. Of Information will be created in 2011

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hadoop on Azure 101 What is the Big Deal?' - tangia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
hadoop on azure 101 what is the big deal

Hadoop on Azure 101What is the Big Deal?

Dennis Mulder

Solution Architect – Global Windows Azure Center of Excellence

Microsoft Corporation

agenda
Agenda

Why Big Data?

Understanding the Basics

Microsoft and Hadoop

slide4
1.8 ZETTABYTES
  • Of Information will be created in 2011
  • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011
slide5
7.9 ZETTABYTES

By 2015

  • Source: CenturyLink resource center, as reported in The readwriteweb, Nov 17, 2011
slide6
Bing ingests > 7 petabytes a month

The Twitter community generates over 1 terabyte of tweets every day

Cisco predicts that by 2013 annual internet traffic flowing will reach 667 exabytes

Sources: The Economist, Feb ‘10; DBMS2; Microsoft Corp

the potential solving specific industry problems
The Potential: Solving Specific Industry Problems

eCommerce: mining web logs: collaborative filtering, user experience optimisation…

Manufacturing: detecting trends and anomalies in sensor data: predicting and understanding faults

Capital Markets: joining market and external data: correlation detection for investment strategy identification, risk calculations…

Retail Banking: historical transaction mining: fraud detection, customer segmentation…

 Industry-specific data-sets leveraged to improve decision making and generate new revenue streams

traditional e commerce d ata f low
Traditional E-Commerce Data Flow

OPERATIONAL DATA

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

ETL Some Data

Logs

Excess Data

new e commerce b ig d ata f low
New E-Commerce Big Data Flow

OPERATIONAL DATA

NEW USER REGISTRY

NEW PURCHASE

NEW PRODUCT

Data Warehouse

Logs

Logs

Raw Data

“Store it All”

Cluster

How much do views for certain products increase when our TV ads run?

Raw Data

“Store it All”

Cluster

so how d oes i t w ork
So How Does It Work?

FIRST, STORE THE DATA

Server

Server

Files

Server

Server

so how d oes i t w ork1
So How Does It Work?

SECOND, TAKE THE PROCESSING TO THE DATA

RUNTIME

// Map Reduce function in JavaScript

varmap = function (key, value, context) {

varwords = value.split(/[^a-zA-Z]/);

for (var i = 0; i < words.length; i++){

if (words[i] !== "")

context.write(words[i].toLowerCase(),

1);}

}};

varreduce = function (key, values, context) {

varsum = 0;

while (values.hasNext()){

sum += parseInt(values.next());

}

context.write(key, sum);

};

Code

Server

Server

Server

Server

mapreduce workflow
MapReduce – Workflow

A MapReducejob usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner

The framework sorts the outputs of the maps, which are then input to the reducetasks

The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks

Input Domain

Map

Map

Map

IntermediateDomain

IntermediateDomain

IntermediateDomain

Reduce

Reduce

Reduce

IntermediateDomain

Reduce

Outputdomain

slide15
Map

Scenario: Get sum sales grouped by zipCode

(custId, zipCode, amount)

DataNode3

5

6

7

5

2

9

8

3

6

5

0

6

0

5

9

6

8

2

3

1

4

4

7

1

02115

53705

54235

53705

53705

53705

02115

44313

10025

54235

10025

44313

44313

02115

53705

44313

10025

10025

53705

53705

44313

02115

53705

02115

54235

44313

10025

54235

10025

54235

44313

54235

53705

44313

02115

44313

$75

$55

$95

$55

$22

$30

$15

$25

$10

$60

$15

$60

$15

$15

$60

$25

$75

$15

$30

$15

$15

$55

$10

$22

$15

$22

$15

$95

$25

$95

$65

$75

$10

$65

$65

$30

Group

By

Mapper

One output bucket per reduce task

Blocks

of the

Sales

file in

HDFS

DataNode2

Group

By

Mapper

DataNode1

Map tasks

reduce
ReducerReduce

SUM

SUM

SUM

Sort

Sort

Sort

Mapper

  • Done!

Reducer

02115

54235

10025

44313

53705

$30

$97

$90

$155

53705

44313

53705

02115

53705

44313

02115

10025

44313

53705

44313

54235

44313

53705

54235

53705

54235

10025

53705

02115

02115

53705

53705

54235

10025

44313

44313

44313

10025

10025

54235

44313

02115

02115

54235

10025

$110

$25

$60

$75

$22

$30

$95

$65

$15

$15

$25

$65

$10

$30

$15

$15

$10

$15

$55

$75

$10

$75

$60

$30

$22

$25

$95

$60

$15

$95

$22

$65

$15

$15

$15

$55

$55

Shuffle

Reducer

Mapper

Reduce tasks

slide18
Hadoop Architecture

Task tracker

Task tracker

MapReduce Layer

Job tracker

Name node

HDFS Layer

Data node

Data node

Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

traditional rdbms vs mapreduce
Traditional RDBMS vs. MapReduce
  • Reference: Tom White’s Hadoop: The Definitive Guide
the hadoop ecosystem
The Hadoop Ecosystem

ETL Tools

BI Reporting

RDBMS

Zookeepr (Coordination)

Pig (Data Flow)

Hive (SQL)

Sqoop

Avro (Serialization)

MapReduce(Job Scheduling/ Execution System)

Hbase (Column DB)

HDFS(Hadoop Distributed File System)

  • Reference: Tom White’s Hadoop: The Definitive Guide
hadoop on azure
Azure Blob StorageHadoop on Azure

Name Node

  • On Premise Enterprise Content
  • Transactional DBs
  • On Prem logs
  • Internal sensors

Azure Blob Storage

Data Node

Data Node

Azure Blob Storage

Data Node

Data Node

SQL Azure

HDFS

  • Cloud Enterprise Content
  • Generated in Azure

Application end point

S3

  • Generated/stored elsewhere
  • What does Hadoop in the Cloud mean?
    • Where is HDFS?
    • Where is my data stored?
    • Azure Blob Storage vs. HDFS
  • 3rd Party Content
  • Azure Datamarket
  • Public content
  • Delivered online
detailed offerings
Detailed Offerings

INSIGHTS

Hive ODBC Driver & Hive Add-in for Excel

Integration with Microsoft PowerPivot

Hadoop based distribution for Windows Server & Azure

Strategic Partnership with Hortonworks

ENTERPRISE

READY

JavaScript framework for Hadoop

RTM of Hadoop connectors for SQL Server and PDW

BROADER

ACCESS

microsoft big data solution
Microsoft Big Data Solution

FAMILIAR END USER TOOLS

Excel with PowerPivot

Power View

Predictive Analytics

Embedded BI

BI PLATFORM

SSAS

SSRS

Microsoft EDW

Connectors

Hadoop On Windows Azure

Hadoop On Windows Server

UNSTRUCTURED & STRUCTURED DATA

Sensors

Devices

Bots

Crawlers

ERP

CRM

LOB

APPs

hadoop on windows insights to all users by activating new types of data
Hadoop on WindowsInsights to all users by activating new types of data

Differentiation

INSIGHTS

Integrate with Microsoft Business Intelligence

Choice of deployment on Windows Server + Windows Azure

Integrate with Windows Components (AD, Systems Center)

ENTERPRISE

READY

Easy installation and configuration of Hadoop on Windows

Simplified programming with . Net & Javascript integration

Integrate with SQL Server Data Warehousing

BROADER

ACCESS

  • Contributions proposed back to community distribution
summary
Summary

Hadoop is about massive compute and massive data

The code is brought to the data

Map -> Split the work

Reduce -> Combine the results

Relational databases vsHadoop?

Wrong question - Serve different needs

resources
Resources

http://www.hadooponazure.com/

http://hadoop.apache.org/

ad