1 / 10

Big data Analysis in R using Hadoop

Big data Analysis in R using Hadoop. HAMS Technologies www.hams.co.in director@hams.co.in priyank@hams.co.in vivek@hams.co.in. Data size is continuously increasing due to Multiple source of data Easy and fast availability of data sources

heman
Download Presentation

Big data Analysis in R using Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big data Analysis in R using Hadoop HAMS Technologies www.hams.co.in director@hams.co.in priyank@hams.co.in vivek@hams.co.in HAMS Technologies

  2. Data size is continuously increasing due to • Multiple source of data • Easy and fast availability of data sources • R is widely known and accepted language in field of statistical analysis and prediction techniques. But performing an analysis over Big data is always a challenging task. • MapReduce is a mechanism to handle large volume of data. Hadoop is one of open source MapReduce implementation provided by Apache. • R and Hadoop can be used together with the help of a package called RHadoop. HAMS Technologies

  3. Before moving further, We can take a look of MapReduce architecture and Hadoop implementation.. • General flow in MapReduce architecture • Create a clustered network • Load the data into cluster using Map (mapper task) • Fetch the processing data with help of Map (mapper task) • Aggregate the result with Reducer ( Reducer task) Local Data Local Data Local Data Map Map Map Partial Result-1 Partial Result-2 Partial Result-3 Reduce Aggregated Result HAMS Technologies

  4. General attributes of in MapReduce architecture • Distributed file system (DFS) • Data locality • Data redundancy for fault tolerance • Map tasks applied to partitioned data it scheduled so that input blocks are on same machine • Reducer tasks applied to process data partitioned by MAP task Local Data Local Data Local Data Map Map Map Partial Result-1 Partial Result-2 Partial Result-3 Reduce Aggregated Result HAMS Technologies

  5. Hadoop is an open source implementation of MapReduced architecture maintained by Apache Hadoop HDFS Hadoop Distributed file system MapReduce Job trackers Master nodes name node/s Job tracker node/s Data node/s Hive (Hadoop interactIVE) Data Node Data Node Data Node Data node/s Data node/s Data node/s Slave nodes Tracker node/s Tracker node/s Tracker node/s HAMS Technologies

  6. Hadoop-streaming allow to create and run MapReducde job as Mapper and/or as Reducer. • HDFS (Hadoop Distributed File System) is a clustered network used to store data. HDFS contain the script to replicate and track the different data blocks. HDFS write is show below. In same reverse manner we retrieve data from HDFS. I am having a file contains 3 blocks.. Where should I write these? Okey, Write these on data-node 1 ,2 and 3 hams.txt Block-1 1 2 Block-2 Name Node 3 Block-3 3 3 Data Node-1 Data Node-2 Data Node-3 Data Node-n Data node/s Data node/s Data node/s Data node/s Tracker node/s Tracker node/s Tracker node/s Tracker node/s HAMS Technologies

  7. HIVE (HadoopInteractIVE) support high level function for handling hadoop framework like hive.start(), hive.create().. .etc. • It provided functions like hive.stream() to support Hadoop-streaming • DFS functions in R like DFS.put(), DFS.list()… • To start working with HIVE • Download it from https://hadoop/apache.org/core • Configure files • mapred-site.xml – to configure mapReduce • Core-site.xml – to configure basic hadoop • Hdfs-site.xml – for configuration related to HDFS HAMS Technologies

  8. R and Hadoop: a package Rhadoop is available in R-Forge • In R prompt, • Hadoop package can be loaded as • Library(‘hive’); • To start hadoop • hive_start() • Put the data, list the data • DFS.put(‘source_data’, ‘/router_list’) • DFS.list(‘/router_list’); HAMS Technologies

  9. Hive stream can be initialized as • hive_stream(mapper = <mapper_function_name> reducer= <reducer_function_name> input = <routers> output = <routers_out> ) • Other important functions • DFS_put_object() • DFS_cat() • Hive_create() • Hive_get_parameter() HAMS Technologies

  10. Thank you Kindly drop us a mail at below mention address for any suggestion and clarification. We like to hear from you HAMS Technologies www.hams.co.in director@hams.co.in priyank@hams.co.in vivek@hams.co.in HAMS Technologies

More Related