An internet traffic analysis method with mapreduce
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

An Internet Traffic Analysis Method with MapReduce PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on
  • Presentation posted in: General

An Internet Traffic Analysis Method with MapReduce. Youngseok Lee , Wonchul Kang and Hyeongu Son Chungnam National University. Presented By, Venkata Patlolla Old Dominion University CS 775 Distributed Systems, Dr. Mukkamala April 18, 2011. Agenda. Introduction Related Work

Download Presentation

An Internet Traffic Analysis Method with MapReduce

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An internet traffic analysis method with mapreduce

An Internet Traffic Analysis Method withMapReduce

Youngseok Lee, Wonchul Kang andHyeongu Son

Chungnam National University.

Presented By,

Venkata Patlolla

Old Dominion University

CS 775 Distributed Systems, Dr. Mukkamala

April 18, 2011


Agenda

Agenda

  • Introduction

  • Related Work

  • MapReduce-Based Flow Analysis

    • Overview

    • Flow Analysis Method with MapReduce

  • Performance Environment

    • Experimental Environment

    • Flow Statistics Computation time

    • Recovery of a single node failure

  • Conclusion


Introduction

Introduction

  • Flow based traffic monitoring methods are used by ISPs

    • Eg. Cisco NetFlow :easily monitors flows passing through routers and switches without observing each packet.

  • Netflow-compatible flow generators like “nProbe”, monitors packet stream in flow units.

  • As network is growing we need to monitor more routers & switches for security, quality of service and accounting reasons.

  • Typically, ISPs use high Performance servers with large storage system to collect and analyze flow data from many routers.

  • It is not easy to compute traffic statistics from many large flow files in short time. Packet sampling and aggregation is techniques are used to lessen continuous stream flow data.


An internet traffic analysis method with mapreduce

Cont..

  • Single server approach is not efficient when

    • analyzing flow data for large network (Tera and Peta-bytes)

    • when global internet worms or DDoS(Distributed denial of service) attack happens.

  • From Cluster file systems and cloud computing platform we achieve

    • Distributed parallel computing.

    • Fault tolerance.

  • Google, Yahoo, Facebook, Amazon are rigorously trying to develop Cluster file systems and cloud computing Platforms.


Introduction to mapreduce

Introduction to MapReduce

MapReduce: It is a software framework that supports distributed computing with two functions Map and Reduce on large datasets on clusters.

  • Google first developed MapReduce programming model for page ranking and web log analysis.

  • Yahoo released an open source system for cloud computing platform, called “Hadoop”.

  • Amazon provide Hadoop based cloud computing services called Elastic Compute Cloud(EC2) or simple storage service(S3).

  • Facebook also uses Hadoop to analyze web log in its network.

  • All these networks use cloud computing on cluster file systems as it provides fault tolerance to manage huge data easily.


Related work

Related work

  • Flow analysis tools such as flow-tools, Coral-Reef, flowscan are used for generating flow statistics such as port breakdown.

  • These tools run on single server with large storage system such as RAID or Network Attached Storage(NAS).

  • These tools are not efficient in processing tera or peta-byte flow data.

  • Analyzing traffic by parallel processing is done in many ways

    • One of it is DIPStorage uses P2P platform called Storage tanks. But, each tank with flow processing rule increase computation overhead

  • The MapReduce program developed by many is used to achieve less computation time and analyze huge amount of data.


Mapreduce based flow analysis

MapReduce-Based Flow Analysis

Architecture of flow measurement and analysis system


Architecture description

Architecture Description

  • Cloud Platform : Provides cluster file system and cloud computing functions.

  • Flow data from Routers are delivered to the clusters through unicasting or anycasting.

  • Master node operates cluster nodes to save and process the flow data.

  • Also the cluster configuration is handled by Master Node.

  • When flow data is achieved on to cluster filesystem , the Mapreduce flow analysis program is run on cloud platform.

  • Each cluster node architecture is as shown below.


Functional components of cluster node

Functional components of Cluster node


Functionality

Functionality

  • Flow collector: stores flow packets received into files and move them to local disk at cluster file systems periodically.

    • NetFlow packets from routers sent to cluster nodes in unicast. Uses UDP which is not reliable. We can use SCTP for reliablity.

    • Anycast can be used to provide load balancing with cluster nodes when receiving NetFlow packets.

  • Flow collector uses flow tools for NetFlow collecting and processing tools.

  • Mapper and Reducer will analyze flow data with Hadoop MapReduce library.

  • To manage huge data and to have fault tolerant service authors used HDFS.


An internet traffic analysis method with mapreduce

Cont..

  • HDFS follows write-once and read-many-times pattern.

  • HDFS has

    • Name node, manages filesystem metadata and provides management and control services .

    • Name node at master perform recovery and automatic backup of name nodes.

    • Data node, supplies block storage and retrieval services.


Flow analysis method with mapreduce

Flow Analysis method with MapReduce

  • MapReduce computation has Map and Reduce functions.

    • Map takes input key/value pair and produce intermediate key/value pairs.

    • Hadoop MapReduce library will group the intermediate values according to the same key.

    • Reduce will merge the intermediate values for smaller values.

  • To implement various flow analysis programs with MapReduce, we have to determine appropriate input key/value pairs.

  • Eg. : Analyze traffic by port breakdown. Which sums up the octet count for the port number. Key/value is (port,octet).

  • This is shown in following figure.


A mapreduce flow analysis program for destination port breakdown

A MapReduce flow analysis program for destination port breakdown


Explanation of example in detail

Explanation of example in detail

  • Input Flow Files : After flow data is sorted on local disk, we move raw NetFlow V5 to cluster filesystems, HDFS.

  • As Hadoop Mapper support only text files. We convert Netflow files to text. As text files are large we need to support binary files to input to mapper. Else, gzip files cannot be input.

  • Mapper : Reads each flow record split by newline. Each record have timestamp, port, Ip add., flag, octet count, packet count.

    • After reading, mapper filter out necessary flow attributes for a flow analysis job.

    • Flow analysis job sums up octet counts per destination port number, key/value pairs as (dst port, octets) is set.

    • The flow map task will write its temporary results on the local disk.


An internet traffic analysis method with mapreduce

Cont..

  • Reducer: input is fed into Reducer from Temp file i.e intermediate values generated by flow mappers.

    • list of octets belonging to the same destination port number will be summed up.

    • After merging octet values associated with the destination port, the flow reducer writes the octet value for each port number


Performance evaluation

Performance Evaluation

  • Testbed consisting of a master node and four data nodes. Each node has quadcore 2.83 GHz CPU, 4 GB memory, and 1.5 TB hard disk. HDFS is used for the cluster filesystem. All Hadoop nodes are connected with 1 Gigabit Ethernet cards.

  • 5 min flow file is not enough to asses the performance of MapReduce


An internet traffic analysis method with mapreduce

Cont..

  • Thus, to evaluate the flow statistics computation time for large data sets, we used input flow files collected forone day, one week, and one month.

  • The binary flow files are used inputs to flow-tools, whereas the text flow files to our MapReduce program.


Flow statistics computation time

Flow statistics computation time

  • Comparison between Flow tools on a single server and MapReduce Program.

  • Aim is to compute octet count for each destination port number.

  • Executed “flow-cat /flowdirectory/ | flow-stat -f 5 > result” commands of flow-tools to concatenate binary flow files stored in a directory and to calculate the flow statistics for the destination port.

  • MapReduce program reads text flow files and produces the octet count for each destination port.


Destination port breakdown completion time flow tools vs mapreduce

Destination port breakdown completion time:flow-tools vs. MapReduce


Recovery of a single node failure

Recovery of a single node failure


An internet traffic analysis method with mapreduce

Cont..


An internet traffic analysis method with mapreduce

Cont..

  • Under a large data set of 108.2 million flows, MapReduce with four data nodes spent only 1.5 times more seconds to complete the job by recovering Map/Reduce failures.

  • Through experiments, it is clear that the flow computation job could successfully finish against a single node failure through the Hadoop fault-tolerant service.


Conclusion

Conclusion

  • MapReduce-based flow analysis method for a large-scale networks that could analyze efficiently and quickly big flow data against failures.

  • Flow computation time could be dramatically improved by 72% compared with the typical flow analysis tools.

  • Faulttolerant service against a single machine failure could be easily provided by MapReduce-based flow analysis.

  • Improve a few drawbacks of the current MapReduce-based approach such as batch processing jobs or text input file formats, and to develop convenient flow analysis tools based on MapReduce.


Questions

QUESTIONS?


  • Login