1 / 76

Big Data

Big Data. Introduction. The 5 Vs of Big Data Hadoop SQL on Hadoop Apache Spark. The 5 Vs of Big Data. Every minute: more than 300,000 tweets are created Netflix subscribers are streaming more than 70,000 hours of video at once Apple users download 30,000 apps

santosw
Download Presentation

Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data

  2. Introduction • The 5 Vs of Big Data • Hadoop • SQL on Hadoop • Apache Spark

  3. The 5 Vs of Big Data • Every minute: • more than 300,000 tweets are created • Netflix subscribers are streaming more than 70,000 hours of video at once • Apple users download 30,000 apps • Instagram users like almost two million photos • Big Data encompasses both structured and highly unstructured forms of data

  4. The 5 Vs of Big Data • Volume: the amount of data, also referred to the data “at rest” • Velocity: the speed at which data comes in and goes out, data “in motion” • Variety: the range of data types and sources that are used, data in its “many forms” • Veracity: the uncertainty of the data; data “in doubt” • Value: TCO and ROI of the data

  5. The 5 Vs of Big Data • Examples: • Large-scale enterprise systems (e.g., ERP, CRM, SCM) • Social networks (e.g., Twitter, Weibo, WeChat) • Internet of Things • Open data

  6. Hadoop • Open-source software framework used for distributed storage and processing of big datasets • Can be set up over a cluster of computers built from normal, commodity hardware • Many vendors offer their implementation of a Hadoop stack (e.g., Amazon, Cloudera, Dell, Oracle, IBM, Microsoft)

  7. Hadoop • History of Hadoop • The Hadoop stack

  8. History of Hadoop • Key building blocks: • Google File System: a file system that could be easily distributed across commodity hardware, while providing fault tolerance • Google MapReduce: a programming paradigm to write programs that can be automatically parallelized and executed across a cluster of different computers • Nutch web crawler prototype developed by Doug Cutting • Later renamed to Hadoop • In 2008, Yahoo! open-sourced Hadoop as “Apache Hadoop”

  9. The Hadoop Stack • Four modules: • Hadoop Common: a set of shared programming libraries used by the other modules • Hadoop Distributed File System (HDFS): a Java-based file system to store data across multiple machines • MapReduce framework: a programming model to process large sets of data in parallel • YARN (Yet Another Resource Negotiator): handles the management and scheduling of resource requests in a distributed environment

  10. Hadoop Distributed File System (HDFS) • Distributed file system to store data across a cluster of commodity machines • High emphasis on fault tolerance • HDFS cluster is composed of a NameNode and various DataNodes

  11. Hadoop Distributed File System (HDFS) • NameNode • A server that holds all the metadata regarding the stored files (i.e., registry) • Manages incoming file system operations • Maps data blocks (parts of files) to DataNodes • DataNode • Handles file read and write requests • Creates, deletes, and replicates data blocks among their disk drives • Continuously loop, asking the NameNode for instructions • Note: size of one data block is typically 64 megabytes

  12. Hadoop Distributed File System (HDFS)

  13. Hadoop Distributed File System (HDFS)

  14. Hadoop Distributed File System (HDFS)

  15. Hadoop Distributed File System (HDFS)

  16. Hadoop Distributed File System (HDFS) • HDFS provides a native Java API to allow for writing Java programs that can interface with HDFSString filePath = "/data/all_my_customers.csv";Configuration config = new Configuration();org.apache.hadoop.fs.FileSystem hdfs = org.apache.hadoop.fs.FileSystem.get(config);org.apache.hadoop.fs.Path path = new org.apache.hadoop.fs.Path(filePath);org.apache.hadoop.fs.FSDataInputStream inputStream = hdfs.open(path);byte[] received = new byte[inputStream.available()]; inputStream.readFully(received);

  17. Hadoop Distributed File System (HDFS) // ...org.apache.hadoop.fs.FSDataInputStream inputStream = hdfs.open(path);byte[] buffer=new byte[1024]; // Only handle 1KB at onceint bytesRead;while ((bytesRead = in.read(buffer)) > 0) { // Do something with the buffered block here}

  18. Hadoop Distributed File System (HDFS)

  19. MapReduce • Programming paradigm made popular by Google and subsequently implemented by Apache Hadoop • Focus on scalability and fault tolerance • A map–reduce pipeline starts from a series of values and maps each value to an output using a given mapper function

  20. MapReduce • High-level Python example • Map >>> numbers = [1,2,3,4,5]>>> numbers.map(lambda x : x * x) # Map a function to our list[1,4,9,16,25] • Reduce >>> numbers.reduce(lambda x : sum(x) + 1) # Reduce a list using given function16

  21. MapReduce • A MapReduce pipeline in Hadoop starts from a list of key–value pairs, and maps each pair to one or more output elements • The output elements are also key–value pairs • Next, the output entries are grouped so all output entries belonging to the same key are assigned to the same worker (e.g., physical machine) • These workers then apply the reduce function to each group, producing a new list of key–value pairs • The resulting, final outputs can then be sorted

  22. MapReduce • Reduce-workers can already get started on their work even although not all mapping operations have finished yet • Implications: • The reduce function should output the same key–value structure as the one emitted by the map function • The reduce function itself should be built in such a way that it provides correct results, even if called multiple times

  23. MapReduce • In Hadoop, MapReduce tasks are written in Java • To run a MapReduce task, a Java program is packaged as a JAR archive and launched as:hadoop jar myprogram.jar TheClassToRun [args...]

  24. MapReduce • Example: Java program to count the appearances of a word in a file import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { // Following fragments will be added here }

  25. MapReduce • Define mapper function as a class extending the built-in Mapper<KeyIn, ValueIn, KeyOut, ValueOut> class, indicating which type of key–value input pair we expect and which type of key–value output pair our mapper will emit

  26. MapReduce public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> { // Our input key is not important here, so it can just be any generic object. Our input value is a piece of text (a line) // Our output key will also be a piece of text (a word). Our output value will be an integer. public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Take the value, get its contents, convert to lowercase, // and remove every character except for spaces and a-z values: String document = value.toString().toLowerCase().replaceAll("[^a-z\\s]", ""); // Split the line up in an array of words String[] words = document.split(" "); // For each word... for (String word : words) { // "context" is used to emit output values // Note that we cannot emit standard Java types such as int, String, etc. Instead, we need to use // a org.apache.hadoop.io.* class such as Text (for string values) and IntWritable (for integers) Text textWord = new Text(word); IntWritable one = new IntWritable(1); // ... simply emit a (word, 1) key-value pair: context.write(textWord, one); } } }

  27. MapReduce

  28. MapReduce • Reducer function is specified as a class extending the built-in Reducer<KeyIn, ValueIn, KeyOut, ValueOut> class

  29. MapReduce public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; IntWritable result = new IntWritable(); // Summarise the values so far... for (IntWritable val : values) { sum += val.get(); } result.set(sum); // ... and output a new (word, sum) pair context.write(key, result); } }

  30. MapReduce

  31. MapReduce public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); // Set up a MapReduce job with a sensible short name: Job job = Job.getInstance(conf, "wordcount"); // Tell Hadoop which JAR it needs to distribute // to the workers. // We can easily set this using setJarByClass job.setJarByClass(WordCount.class); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); // What does the output look like? job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Our program expects two arguments, the first one is the input // file on HDFS // Tell Hadoop our input is in the form of TextInputFormat // (Every line in the file will become value to be mapped) TextInputFormat.addInputPath(job, new Path(args[0])); // The second argument is the output directory on // HDFS Path outputDir = new Path(args[1]); // Tell Hadoop what our desired output structure is FileOutputFormat.setOutputPath(job, outputDir); // Delete the output directory if it exists FileSystem fs = FileSystem.get(conf); fs.delete(outputDir, true); // Stop after our job has completed System.exit(job.waitForCompletion(true) ? 0 : 1); }

  32. MapReduce hadoop jar wordcount.jar WordCount /users/me/dataset.txt /users/me/output/

  33. MapReduce $ hadoop fs -ls /users/me/outputFound 2 items-rw-r—r-- 1 root hdfs 0 2017-05-20 15:11 /users/me/output/_SUCCESS-rw-r—r-- 1 root hdfs 2069 2017-05-20 15:11 /users/me/output/part-r-00000 $ hadoop fs -cat /users/me/output/part-r-00000and 2first 1is 3line 2second 1the 2this 3

  34. MapReduce • Constructing MapReduce programs requires a certain skillset in terms of programming • Tradeoffs in terms of speed, memory consumption, and scalability

  35. Yet Another Resource Negotiator (YARN) • Yet Another Resource Negotiator (YARN) distributes a MapReduce program across different nodes and takes care of coordination • Three important services • ResourceManager: a global YARN service that receives and runs applications (e.g., a MapReduce job) on the cluster • JobHistoryServer: keeps a log of all finished jobs • NodeManager: responsible for overseeing resource consumption on a node

  36. Yet Another Resource Negotiator (YARN)

  37. Yet Another Resource Negotiator (YARN)

  38. Yet Another Resource Negotiator (YARN)

  39. Yet Another Resource Negotiator (YARN)

  40. Yet Another Resource Negotiator (YARN) • Complex setup • Allows running programs and applications other than MapReduce

  41. SQL on Hadoop • MapReduce very complex when compared to SQL • Need for a more database-like setup on top of Hadoop

  42. SQL on Hadoop • HBase • Pig • Hive

  43. HBase • First Hadoop database inspired by Google’s Bigtable • Runs on top of HDFS • NoSQL-like data storage platform • No typed columns, triggers, advanced query capabilities, etc. • Offers a simplified structure and query language in a way that is highly scalable and can tackle large volumes

  44. HBase • Similar to RDBMSs, HBase organizes data in tables with rows and columns • HBase table consists of multiple rows • A row consists of a row key and one or more columns with values associated with them • Rows in a table are sorted alphabetically by the row key

  45. HBase • Each column in HBase is denoted by a column family and qualifier (separated by a colon, “:”) • A column family physically co-locates a set of columns and their values • Every row has the same column families, but not all column families need to have a value per row • Each cell in a table is hence defined by a combination of the row key, column family and column qualifier, and a timestamp

  46. HBase • Example: HBase table to store and query users • The row key will be the user id • column families:qualifiers • name:first • name:last • email (without a qualifier)

  47. HBase hbase(main):001:0> create 'users', 'name', 'email' 0 row(s) in 2.8350 seconds => Hbase::Table - users hbase(main):002:0> describe 'users' Table users is ENABLED users COLUMN FAMILIES DESCRIPTION {NAME => 'email', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', K EEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', C OMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '6 5536', REPLICATION_SCOPE => '0'} {NAME => 'name', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KE EP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', CO MPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65 536', REPLICATION_SCOPE => '0'} 2 row(s) in 0.3250 seconds

  48. HBase hbase(main):003:0> list 'users' TABLE users 1 row(s) in 0.0410 seconds => ["users"]

  49. HBase hbase(main):006:0> put 'users', 'seppe', 'name:first', 'Seppe' 0 row(s) in 0.0200 seconds hbase(main):007:0> put 'users', 'seppe', 'name:last', 'vanden Broucke' 0 row(s) in 0.0330 seconds hbase(main):008:0> put 'users', 'seppe', 'email', 'seppe.vandenbroucke@kuleuven' 0 row(s) in 0.0570 seconds hbase(main):009:0> scan 'users' ROW COLUMN+CELL seppe column=email:, timestamp=1495293082872, value=seppe.vanden broucke@kuleuven.be seppe column=name:first, timestamp=1495293050816, value=Seppe seppe column=name:firstt, timestamp=1495293047100, value=Seppe seppe column=name:last, timestamp=1495293067245, value=vanden Broucke 1 row(s) in 0.1170 seconds

  50. HBase hbase(main):011:0> get 'users', 'seppe' COLUMN CELL email: timestamp=1495293082872, value=seppe.vandenbroucke@kuleuven.be name:first timestamp=1495293050816, value=Seppe name:firstt timestamp=1495293047100, value=Seppe name:last timestamp=1495293067245, value=vanden Broucke 4 row(s) in 0.1250 seconds hbase(main):018:0> put 'users', 'seppe', 'email', 'seppe@kuleuven.be' 0 row(s) in 0.0240 seconds hbase(main):019:0> get 'users', 'seppe', 'email' COLUMN CELL email: timestamp=1495293303079, value=seppe@kuleuven.be 1 row(s) in 0.0330 seconds

More Related