Disclaimer
This presentation is the property of its rightful owner.
Sponsored Links
1 / 26

Disclaimer PowerPoint PPT Presentation


  • 53 Views
  • Uploaded on
  • Presentation posted in: General

Disclaimer.

Download Presentation

Disclaimer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Disclaimer

Disclaimer

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle Corporation.


How to set up a hadoop cluster with oracle solaris hol10182

How to Set Up a Hadoop Cluster with Oracle Solaris [HOL10182]

OrgadKimchiPrincipal Software Engineer


Agenda

Agenda

  • Lab Overview

  • Hadoop overview

  • The benefits of using Oracle Solaris technologies for a Hadoop cluster


Lab overview

Lab Overview

  • In this Hands-on-Lab we will preset and demonstrate using exercises how to set up a Hadoop cluster Using Oracle Solaris 11 technologies like: Zones, ZFS, DTrace  and Network Virtualization.

  • Key topics include the Hadoop Distributed File System and MapReduce.

  • We will also cover the Hadoop installation process and the cluster building blocks: NameNode, a secondary NameNode, and DataNodes.


Lab overview cont d

Lab Overview – Cont’d

  • During the lab users will learn how to load data into the Hadoop cluster and run Map-Reduce job.

  • This hands-on training lab is for system administrators and others responsible for managing Apache Hadoop clusters in production or development environments


Lab main topics

Lab Main Topics

    1. How to install Hadoop    2. Edit the Hadoop configuration files    3. Configure the Network Time Protocol    4. Create the Virtual Network Interfaces    5. Create the NameNode and the Secondary NameNode Zones    6. Configure the NameNode    7. Set Up SSH between the Hadoop cluster member    8. Format the HDFS File System    9. Start the Hadoop Cluster   10. Run a MapReduce Job   11. How to secure data at rest using ZFS encryption   12. Performance monitoring using Solaris DTrace


What is big data

What is Big Data

  • Big Data is both: Large and Variable Datasets + New Set of Technologies

  • Extremely large files of unstructured or semi-structured data

  • Large and highly distributed datasets that are otherwise difficult to manage as a single unit of information

  • That can economically acquire, organize, store, analyze and extract value from Big Data datasets – thus facilitating better, more informed business decisions


Disclaimer

Data is Everywhere!

Facts & Figures

  • 234M Web sites

  • Facebook

    • 500M Users

    • 40M photos per day

    • 30 billion new pieces of

    • content per month

  • 7M New sites in 2010

  • New York Stock Exchange

    • 1 TB of data per day

  • Web 2.0

    • 147M Blogs and growing

    • Twitter – 12TB of data per day

9


Introduction to hadoop

Introduction To Hadoop


What is hadoop

What is Hadoop ?

  • Originated at Google 2003

  • – Generation of search indexes and web scores

  • Top level Apache project, Consists of two key services

    1. Hadoop Distributed File System (HDFS), highly scalable, fault-tolerant , distributed

    2. MapReduce API (Java), Can be scripted in other languages

  • Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.


Components of hadoop

Components of Hadoop


Disclaimer

HDFS

  • HDFS is the file system responsible for storing data on the cluster

  • Written in Java (based on Google’s GFS)

  • Sits on top of a native file system (ext3, ext4, xfs, etc)

  • POSIX like file permissions model

  • Provides redundant storage for massive amounts of data

  • HDFS is optimized for large, streaming reads of files


The five hadoop daemons hadoop is comprised of five separate daemons

The Five Hadoop Daemons - Hadoop is comprised of five separate daemons

  • NameNode : Holds the metadata for HDFS

  • Secondary NameNode: Performs housekeeping functions for the NameNode

  • DataNode : Stores actual HDFS data blocks

  • JobTracker: Manages MapReduce jobs, distributes individual tasks to machines running the TaskTracker. Coordinates MapReduce stages.

  • TaskTracker: Responsible for instantiating and monitoring individual Map and Reduce tasks


Hadoop architecture

Hadoop Architecture


Mapreduce

Map:

Accepts input key/value pair

Emits intermediate key/value pair

Reduce:

Accepts intermediate key/value* pair

Emits output key/value pair

Partitioning

Function

MapReduce

R

E

D

U

C

E

M

A

P

Very

big

data

Result

16


Mapreduce example

Counting word occurrences in a document:

MapReduce Example

how many chucks could a woodchuck chuck if a woodchuck could chuck wood

4 Node Map

how,1 many,1 chucks,1 could,1

a,1 woodchuck,1 chuck,1

could,1 chuck,1 wood,1

if,1 a,1 woodchuck,1

Group by Key

2 Node Reduce

if,1 many,1 wood,1 woodchuck,1:1

a,1:1 chuck,1:1 chucks,1 could,1:1 how,1

Output

a,2 chuck,2 chucks,1 could,2 how,1 if,1 many,1 wood,1 woodchuck,2


Mapreduce functions

MapReduce Functions

  • MapReduce partitions data into 64MB chunks ( default )

  • Distributes data and jobs across thousands of nodes

  • Tasks scheduled based on location of data

  • Master writes periodic checkpoints

  • If map worker fails Master restarts job on new node

  • Barrier - no reduce can begin until all maps are complete

  • HDFS manages data replication for redundancy

  • MapReduce library does the hard work for us!


Rdbms compared to mapreduce

RDBMS compared to MapReduce


The benefits of using oracle solaris technologies for a hadoop cluster

The benefits of using Oracle Solaris technologies for a Hadoop cluster


The benefits of using oracle solaris zones for a hadoop cluster

The benefits of using Oracle Solaris Zones for a Hadoop cluster

  • Oracle Solaris Zones Benefits

  • Fast provision of new cluster members using the Solaris zones cloning feature

  • Very high network throughput between the zones for data node replication


The benefits of using oracle solaris zfs for a hadoop cluster

The benefits of using Oracle Solaris ZFS for a Hadoop cluster

  • Oracle Solaris ZFS Benefits

  • Immense data capacity,128 bit file system, perfect for big data-set

  • Optimized disk I/O utilization for better I/O performance with ZFS built-in compression

  • Secure data at rest using ZFS encryption


The benefits of using oracle solaris technologies for a hadoop cluster1

The benefits of using Oracle Solaris technologies for a Hadoop cluster

  • Multithread awareness - Oracle Solaris understands the correlation between cores and the threads, and it provides a fast and efficient thread implementation.

  • DTrace - comprehensive, advanced tracing tool for troubleshooting systematic problems in real time.

  • SMF – allow to build dependencies between Hadoop services (e.g. starting the MapReduce daemons after the HDFS daemons).


For more information

For more information

  • How to Set Up a Hadoop Cluster Using Oracle Solaris Zones

  • How to Build Native Hadoop Libraries for Oracle Solaris 11

  • Hadoop for Big Data Analytics on SPARC T5 Servers [CON4582]

    Thursday, Sep 26, 3:30 PM - 4:30 PM

    Moscone South - 304


Graphic section divider

Graphic Section Divider


  • Login