1 / 30

IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA

IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA. Abisheik Kar Ramachandran Akhila Thota 11/19/2013. BIG DATA. What is BIG DATA ?. BIG DATA. Large – Is it really LARGE ??. Google processes > 20 PB a day (2012) Facebook has 2.5 PB of user data + 500 TB/day (2012)

Download Presentation

IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA Abisheik Kar Ramachandran Akhila Thota 11/19/2013

  2. BIG DATA • What is BIG DATA ?

  3. BIG DATA

  4. Large – Is it really LARGE ?? • Google processes > 20 PB a day (2012) • Facebook has 2.5 PB of user data + 500 TB/day (2012) • 300 million photos are uploaded in Facebook Every day • eBay has 7.5 PB of user data + 50 TB/day (5/2009) • Nearly 35 Zeta Bytes of data !!!! • Need an Analogy ???

  5. Large – Is it really LARGE ?? • 35 ZB of Data is enough data to fill a stack of DVD’s reaching halfway to Mars.

  6. Who generates these data ?

  7. BIG DATA – Issues and current Research Characteristics of Big Data

  8. Characteristics of Big Data

  9. Volume • Volume represents the amount of data. • 44x increase from 2009 to 2020

  10. Velocity • Volume represents the speed in which data is being created, accessed or streamed. • Few decades ago, real time streaming was beyond our imagination but due to the advancement in technology, data is being streamed in real time today. • Data is being generated fast and should be processed fast.

  11. Variety • Unstructured Data. • Text File, Audio File, Video File.

  12. BIG DATA – Issues and current Research Big Data Issues

  13. Big Data Issues • General Issues • Fundamental Issues • Storage & Transport issues • Processing issues • Management issues • Design Issues • High Availability • Privacy Issues

  14. General Issues • Handling wide range of unstructured data combined with the size posses big threat in handling big data. • Since we are talking about zeta bytes of data, efficient mechanisms to store and retrieve data is a vital point. • In RDBMS, the data has to be stored in a form of table and retrieved using queries and RDBMS is designed to handle structured queries. • Contrast to RDBMS, Big data is a collection of huge unstructured data and it’s not possible to define them in table.

  15. Fundamental Issues – Storage & Transport Issues • Due to the enormous amount of data created each second, storing these data becomes a major issue. • Storage Media is not able to cope up with the growth of data size. • To explain this, to process an exabyte of data on a single system, we would need nearly 25,000 disks. • With the current communication networks with transfer rate of 1 gigabytes per second and with an effective 80% sustainable transfer rate, transferring an exabyte of data would take nearly 2800 hours.

  16. Fundamental Issues – Processing Issues • Considering the fact that the technology era has given way for huge amount of data, processing them becomes an essential part. • Effective processing of data in the range of exabytes would require extensive parallel processing capabilities and effective algorithm to handle them.

  17. Management Issues • A decade ago, 1K MB of data was read at the rate of 4 MB/sec. • Now, we have reached a state where the speed has been raised from 4 MB/sec to 100 MB/sec. • Even with this speed, a system would take days to read zetabytes of data. • Reading the disks continuously, leads to another potential problem of hardware failure.

  18. BIG DATA – Issues and current Research Available Technologies to handle these issues

  19. Google File System (GFS) • Designed by Google to handle their big data. • GFS has two types of Nodes • Master Node • Chunk Nodes • Data will be divided into chunks and stored in chunk nodes. • The master node stores the metadata of all file chunks in the chunk nodes. • Lets see them in detail !!!

  20. Google File System

  21. Hadoop • Hadoop is an open source software used for distributed computing. • It can be used to query a large set of data and get the results faster using reliable and scalable architecture. • In a traditional non distributed architecture, data is stored in one server and any client program will access this central data server to retrieve the data. • This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data. • Every server has local computation and storage.

  22. Hadoop

  23. Hadoop Master node user Job tracker Slave node N Slave node 2 Slave node 1 Task tracker Task tracker Task tracker Workers Workers Workers

  24. Design Issues and High Availability • The major questions arise while designing the big data system includes the following. • To decide what data is relevant to the system. • To decide the amount of data needed to successfully predict the result. • To decide the value of data in decision making process. • High Availability • To make sure system is available for the user. • Distributed systems gives good solution to this but needed more power.

  25. Privacy Issues • Social Media growth and Big data usage in it has raised privacy concerns • Photos, location tracker etc., • Location based tagging and geo tagging focused photo sharing sites • CYBER SECURITY • Mobile Tracking • And lot more …..

  26. Future Work • Big data is still an emerging field and lot of improvement can be done to increase the efficiency. • Systems like Google File System, Hadoop have taken a step further to solve these issues. • Efficient algorithms must be developed where processing of big data can be done faster. • In HDFS, if the job tracker machine fails then all the currently running jobs fails. There must be way to handle such scenarios.

  27. References • Kaisler, S.; Armour, F.; Espinosa, J.A.; Money, W., "Big Data: Issues and Challenges Moving Forward," System Sciences (HICSS), 2013 46th Hawaii International Conference on , vol., no., pp.995,1004, 7-10 Jan. 2013 doi: 10.1109/HICSS.2013.645 • Gantz, J. and E. Reinsel. 2011. “Extracting Value from Chaos”, IDC’s Digital Universe Study, sponsored by EMC. • Stonebraker, M. and J. Hong. 2012. “Researchers' Big Data Crisis; Understanding Design and Functionality”, Communications of the ACM,55(2):10-11. • Smith, M.; Szongott, C.; Henne, B.; von Voigt, G., "Big data privacy issues in public social media," Digital Ecosystems Technologies (DEST), 2012 6th IEEE International Conference on, vol., no., pp.1,6, 18-20 June 2012doi: 10.1109/DEST.2012.6227909 . • Eldawy, A., R. Khandekar, and Wu Kun-Lung. 2012. Clustering Streaming Graphs. Paper read at Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, 18-21 June 2012. • Prokaj, J.; Xuemei Zhao; Jongmoo Choi; Medioni, G., "Big Data Scalability Issues in WAAS," Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on , vol., no., pp.399,406, 23-28 June 2013 doi: 10.1109/CVPRW.2013.67 • Apache Hadoop, http://hadoop.apache.org. • White, Tom. 2010. "Hadoop the Definitive Guide." In. Sebastopol: O'Reilly Media, Inc. http://www.UTXA.eblib.com/patron/FullRecord.aspx?p=590867.

  28. Questions ???

More Related