1 / 33

Big Data course

This course introduces the characteristics of Big Data, presents the 3V model, discusses data variety, velocity, and volume, and explores different sources of Big Data. It also provides an overview of the Apache Hadoop framework and other frameworks for handling Big Data.

ralejandra
Download Presentation

Big Data course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data course Imam Khomeini international University, 2019 Dr. Ali Khaleghi| Kamran Mahmoudi

  2. outline

  3. Session one Introduction to big data • Session objective: • Introducing the characteristics of Big Data • Presenting the 3 V Model for defining Big Data • Discussing the Variety of Data Structures, Velocity of Data generation and the Volume • Introducing different sources of Big Data • Introducing the frameworks for handling Big Data • Presenting a snap shot of the Apache Hadoop Framework • Technical information on Running Hadoop

  4. Big data, Massive Data, small data !! What's the difference? Large volume of a variety of UNSTRUCTURED or SEMI-STRUCTURED data sets that are produced and processed rapidly !

  5. Big Data 3(and more!) V characteristics and examples

  6. 4.6 billion camera phones world wide 30 billion RFID tags today (1.3B in 2005) • 12+ TBsof tweet data every day 100s of millions of GPS enabled devices sold annually ? TBs ofdata every day 2+ billion people on the Web by end 2011 • 25+ TBs oflog data every day 76 million smart meters in 2009… 200M by 2014

  7. Volume (Scale) • Data Volume • 44x increase from 2009 2020 • From 0.8 zettabytes to 35zb • Data volume is increasing exponentially Exponential increase in collected/generated data

  8. Variety (Complexity) • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data Social Network, Semantic Web (RDF), … • Streaming Data You can only scan the data once • A single application can be generating/collecting many types of data • Big Public Data (online, weather, finance, etc.) To extract knowledge all these types of data need to linked together

  9. Velocity (Speed) • Data is begin generated fast and need to be processed fast. • Online Data Analytics • Late decisions  missing opportunities • Examples E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction Iris flower data set Twitter Firehose (6,000 tweets per second)

  10. Real-time/Fast Data • The progress and innovation is no longer hindered by the ability to collect data • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data)

  11. Big scientific data • The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. CERN’s Large Hydron Collider (LHC) generates 15 PB a year

  12. Top 5 Big Data Platforms

  13. Search Interest over time

  14. Hadoop • First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data. • If your data can be processed in batch, and split into smaller processing jobs, spread across a cluster, and their efforts recombined, all in a logical manner, Hadoop will probably work just fine for you. https://www.kdnuggets.com/2016/03/top-big-data-processing-frameworks.html

  15. Independence - 2008 Hadoop Made its own Top-Level Project NDFS - 2004 Started to write open source version of GFS called Nutch Distributed File System GFS - 2003 Google Published the details about its distributed file system Nutch - 2002 Apache Nutch was Strarted as a part of the Lucene Project Leave Nutch - 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop MapReduce - 2004 MapReduce - 2005 Google published a paper introducing MapReduce Nutch algorithms reported to run with MapReduce & NDFS

  16. Ecosystem snapshot

  17. Hadoop’s Distributed File System [hdfs]

  18. Map Reduce

  19. Map-Reduce: an Example in IR

  20. HDFS Architecture

  21. Zookeeper

  22. What is Hive Adata warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HiveQL(HQL) is similar to SQL and it automatically translates SQL-like queries into MapReduce jobs

  23. HBase • HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.

  24. Apache solr

  25. Apache Sqoop • Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target

  26. Apache Flume Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.

  27. Apache Mahout • Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

  28. Apache Spark Scala, Java, Python, R, SQL ML Pipelines DataFrames Spark SQL Spark Streaming MLlib GraphX Spark Core Data Sources Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre) Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It can read/write from a range of data types and allows development in multiple languages.

  29. Hadoop up and running • Quick Start VMs • Cloud Based Implementations • Your own set-up

  30. Installing Hadoop • Setup Modes • Single Node : Name Node and Data Node are installed on the same single machine • Multi Node : At least two different machines, one running Name Node as Master the other Data Node as Slave • Requirements • Linux/Unix Operating System • Open SSH Server • Java Development Kit

  31. Minimum Configurations • To have a running cluster the following tools must be configured at least: • Hadoop Core (core-site.xml) • HDFS (hdfs-site.xml) • YARN (yarn-site.xml) • Map Reduce (mapred-site.xml)

  32. Installation steps • Set up the Linux Environment (Ubuntu 16.04) • Network Configuration • Set static IPv4 Address • Edit /etc/hosts • Setup SSH • Install Open SSH server • Configure password less SSH connection • Install Java • Install Hadoop • Extract binary distribution • Edit Configuration Files

  33. Hands on Installing a single node Hadoop instance

More Related