Current Research and Issues in Big Data Handling: Challenges and Solutions

IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA Abisheik Kar Ramachandran Akhila Thota 11/19/2013

BIG DATA • What is BIG DATA ?

BIG DATA

Large – Is it really LARGE ?? • Google processes > 20 PB a day (2012) • Facebook has 2.5 PB of user data + 500 TB/day (2012) • 300 million photos are uploaded in Facebook Every day • eBay has 7.5 PB of user data + 50 TB/day (5/2009) • Nearly 35 Zeta Bytes of data !!!! • Need an Analogy ???

Large – Is it really LARGE ?? • 35 ZB of Data is enough data to fill a stack of DVD’s reaching halfway to Mars.

Who generates these data ?

BIG DATA – Issues and current Research Characteristics of Big Data

Characteristics of Big Data

Volume • Volume represents the amount of data. • 44x increase from 2009 to 2020

Velocity • Volume represents the speed in which data is being created, accessed or streamed. • Few decades ago, real time streaming was beyond our imagination but due to the advancement in technology, data is being streamed in real time today. • Data is being generated fast and should be processed fast.

Variety • Unstructured Data. • Text File, Audio File, Video File.

BIG DATA – Issues and current Research Big Data Issues

Big Data Issues • General Issues • Fundamental Issues • Storage & Transport issues • Processing issues • Management issues • Design Issues • High Availability • Privacy Issues

General Issues • Handling wide range of unstructured data combined with the size posses big threat in handling big data. • Since we are talking about zeta bytes of data, efficient mechanisms to store and retrieve data is a vital point. • In RDBMS, the data has to be stored in a form of table and retrieved using queries and RDBMS is designed to handle structured queries. • Contrast to RDBMS, Big data is a collection of huge unstructured data and it’s not possible to define them in table.

Fundamental Issues – Storage & Transport Issues • Due to the enormous amount of data created each second, storing these data becomes a major issue. • Storage Media is not able to cope up with the growth of data size. • To explain this, to process an exabyte of data on a single system, we would need nearly 25,000 disks. • With the current communication networks with transfer rate of 1 gigabytes per second and with an effective 80% sustainable transfer rate, transferring an exabyte of data would take nearly 2800 hours.

Fundamental Issues – Processing Issues • Considering the fact that the technology era has given way for huge amount of data, processing them becomes an essential part. • Effective processing of data in the range of exabytes would require extensive parallel processing capabilities and effective algorithm to handle them.

Management Issues • A decade ago, 1K MB of data was read at the rate of 4 MB/sec. • Now, we have reached a state where the speed has been raised from 4 MB/sec to 100 MB/sec. • Even with this speed, a system would take days to read zetabytes of data. • Reading the disks continuously, leads to another potential problem of hardware failure.

BIG DATA – Issues and current Research Available Technologies to handle these issues

Google File System (GFS) • Designed by Google to handle their big data. • GFS has two types of Nodes • Master Node • Chunk Nodes • Data will be divided into chunks and stored in chunk nodes. • The master node stores the metadata of all file chunks in the chunk nodes. • Lets see them in detail !!!

Google File System

Hadoop • Hadoop is an open source software used for distributed computing. • It can be used to query a large set of data and get the results faster using reliable and scalable architecture. • In a traditional non distributed architecture, data is stored in one server and any client program will access this central data server to retrieve the data. • This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data. • Every server has local computation and storage.

Hadoop

Hadoop Master node user Job tracker Slave node N Slave node 2 Slave node 1 Task tracker Task tracker Task tracker Workers Workers Workers

Design Issues and High Availability • The major questions arise while designing the big data system includes the following. • To decide what data is relevant to the system. • To decide the amount of data needed to successfully predict the result. • To decide the value of data in decision making process. • High Availability • To make sure system is available for the user. • Distributed systems gives good solution to this but needed more power.

Privacy Issues • Social Media growth and Big data usage in it has raised privacy concerns • Photos, location tracker etc., • Location based tagging and geo tagging focused photo sharing sites • CYBER SECURITY • Mobile Tracking • And lot more …..

Future Work • Big data is still an emerging field and lot of improvement can be done to increase the efficiency. • Systems like Google File System, Hadoop have taken a step further to solve these issues. • Efficient algorithms must be developed where processing of big data can be done faster. • In HDFS, if the job tracker machine fails then all the currently running jobs fails. There must be way to handle such scenarios.

References • Kaisler, S.; Armour, F.; Espinosa, J.A.; Money, W., "Big Data: Issues and Challenges Moving Forward," System Sciences (HICSS), 2013 46th Hawaii International Conference on , vol., no., pp.995,1004, 7-10 Jan. 2013 doi: 10.1109/HICSS.2013.645 • Gantz, J. and E. Reinsel. 2011. “Extracting Value from Chaos”, IDC’s Digital Universe Study, sponsored by EMC. • Stonebraker, M. and J. Hong. 2012. “Researchers' Big Data Crisis; Understanding Design and Functionality”, Communications of the ACM,55(2):10-11. • Smith, M.; Szongott, C.; Henne, B.; von Voigt, G., "Big data privacy issues in public social media," Digital Ecosystems Technologies (DEST), 2012 6th IEEE International Conference on, vol., no., pp.1,6, 18-20 June 2012doi: 10.1109/DEST.2012.6227909 . • Eldawy, A., R. Khandekar, and Wu Kun-Lung. 2012. Clustering Streaming Graphs. Paper read at Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, 18-21 June 2012. • Prokaj, J.; Xuemei Zhao; Jongmoo Choi; Medioni, G., "Big Data Scalability Issues in WAAS," Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on , vol., no., pp.399,406, 23-28 June 2013 doi: 10.1109/CVPRW.2013.67 • Apache Hadoop, http://hadoop.apache.org. • White, Tom. 2010. "Hadoop the Definitive Guide." In. Sebastopol: O'Reilly Media, Inc. http://www.UTXA.eblib.com/patron/FullRecord.aspx?p=590867.

Questions ???

Current Research and Issues in Big Data Handling: Challenges and Solutions

Current Research and Issues in Big Data Handling: Challenges and Solutions

Presentation Transcript

Soil Survey Data in the GIS

San Diego Regional GIS Council 2011 GIS Survey Results

Current Issues Research Paper

Senior Survey Data: Overview

Towards a Better Integration of Survey and Tax Data in the Unified Enterprise Survey

Pre-Course Survey

Survey of Current Braille Technologies

Paper Survey of DHT

Geospatial Issues update

Editing Challenges for New Data Collection Methods

Survey of current Sensor Network Data Management Frameworks

November 13, 2009

Overview of survey, data collection and data analysis challenges

LCME Student Survey UMSC Self Study

An Overview Gregory D. Weyland

Issues Survey Results

More Current Issues

ITAC Survey

The American Community Survey: Practical Applications with Current Data

CSE 2331/5331

GRAD 521, Research Data Management Winter 2014 – Lecture 15 Amanda L. Whitmire, Asst. Professor