1 / 24

Buzz word

AGENDA. Buzz word. What is BIG DATA ?. Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional database management tools .

spike
Download Presentation

Buzz word

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AGENDA Buzz word

  2. What is BIG DATA ? • Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional database management tools. •  Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc. •  Traditional relational database management systems cannot deal with such large masses of data. • Examples : User updates over fb. • Clicks over the internet. • 3 V’s of big data ?.. Structured vs unstructured

  3. Volume • Volume refers to huge amount of data being generated every minute. • 90% of the data we have now is created in just past 2 years. • IP traffic by 2015 would turn 4X than what it is now. • 3 billion people would be online by 2015 . 2.7 zetabytes , hydron exp.

  4. Velocity • Velocity refers to SPEED at which new data is being generated and moves around. • It includes Real time working systems such as Online banking. • Need of low response time. • Technology “In-Memory Analytics” is employed to deal with data in motion. 90k youtube, 45k google/sec

  5. Variety • Variety refers to various datatypes which we can now use. • Earlier focus was on neat and structured data kept in form of tables in RDBMS. • 80% of data available now is unstructured data • Datatypes are anomalous varying from text to videos to audios to pictures etc Portable devices, sensors n Social media How we gain? Video..

  6. Transform problems into possibilities Big data analytics ..

  7. Big Data Analytics • It is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other real- time insights. • Use of Big Data Analytics – Google Search recommendations, Satyamevjayate. • Future scope – Genes reading for curing deadly diseases like cancer . Types of Analytics..

  8. Leading Technologies • Relational databases failed to store and process Big Data. • As a result, a new class of big data technology has emerged and is being used in many big data analytics environments.  • The technologies associated with big data analytics include : • Hadoop. • Mapreduce. • NoSQL.

  9. Hadoop • Hadoop is an open source framework. • Generally is Java-based programming framework . • Processing and storing of large data sets. • Distributed computing environment. • Components of hadoop • HDFS( hadoop distributed file system). • Mapreduce.

  10. HDFS (Hadoop Distributed File System) • HDFS stores data in DISTRIBUTED,SCALABLE and FAULT-TOLERANT WAY. • Name node have metadata about data on DataNodes. • DataNodes actually have data on them in form of blocks and they are capable of communicating .

  11. Hadoop SQL Any questions ???...

  12. Benefits of Hadoop • Copying same file over all (thousands) of nodes ? • doesn’t it seem like wastage of space ! • It actually is not a waste memory, because of 2 reasons: • If one node failed ,System would still work as data is never lost. • The query is scaled over nodes so it bring about faster results due to parallel processing • eg- Count all words of my twitter history to check what i talk about the most. •  The query is split across multiple servers with a criteria (here months), and the results are consolidated.

  13. Map-Reduce Algorithm • MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. • as in previous example twitter data was processed on different hosts on basis of months . • Hadoop is the physical implementation of Mapreduce . • It is combination of 2 java functions : Mapper() and Reducer(). • example: to check popularity of text. • use of word-count..

  14. MR - Word count example

  15. Mapper() and Reducer() • Mapper function maps the split files and provide input to reducer. • Mapper ( split_filename, file –contents): • for each word in file-contents: • emit (word , 1) • Reducer function clubs the input provided by mapper and produce output. • Reducer ( word , values): • sum=0; • for each value in values: • sum=sum + value • emit(word , sum) • can anyone think of any disadvantages??..

  16. Disadvantages of hadoop • There were 2 major disadvantages when hadoop was developed which now its strengths. • HDFS dependency on single Namenode • solution: A secondary Namenode is attached to Primary • Namenode. • MapReduce is a java framework and did not support sql queries • solution: Facebook developed HIVE which allowed scientists to work with sql on distributed database.

  17. NoSQL • Not only SQL. • Non- relational database management system. • Used where no fix schemas are required and data is scaled horizontally. • 4 Categories of Nosql databases: • Key-value pair • Columnar database • Graph databases • Document databases

  18. NoSQL Categories • KEY-VALUE PAIR • Keys used to get • Value from opaque • Data blocks. • Hash map. • Tremendously fast. • Drawback: • No provision for content based queries .

  19. DOCUMENT DATABASE • Again a key value store but value is in form of document. • Documents are not of fixed schemas. • documents can be nested. • Queries based on content as well as keys. • Use cases: blogging websites.

  20. COLUMNAR DATABASE • Works on attributes rather than tuples. • Key here is column name and value is contiguous column values. • Best for aggregation queries. • Trend : select (1 or 2 column’s values ) where ( same or the other column value ) = some value.

  21. GRAPH DATABASES • Is a collection of nodes • and edges. • Nodes represent data • while edges represent • link between them. • Most dynamic and • flexible. Base Vs Acid properties ..

  22. Data is the new oil Without Big data analysis companies are deaf and dumb , mere wanderers on web ... Like a cattle on the highway ! Thank you ! Keep dreaming BIG :D CONCLUSION

  23. References • Websites : • http://searchbusinessanalytics.techtarget.com/ • Experts sound off on big data , Analytics and its tools • http://www.ibmbigdatahub.com/infographic/four-vs-big-data Big data and analytics hub • https://bigdatauniversity.com/bdu-wp/bdu-course/hadoop-fundamentals-i-version-3/ • Hadoop fundamentals • Research papers : Dean J. and Ghemawat S., “MapReduce: Simplified Data Processing on Large Clusters”,“OSDI: Sixth Symposium on Operating System Design San Francisco, CA”, “2004”.

More Related