1 / 94

Big Data Analytics Map-Reduce and Spark

Big Data Analytics Map-Reduce and Spark. ?. Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu. Guest Lecturer: Vikram Sreekanti. From SQL to Big Data (with SQL). A few weeks ago… Databases (Relational) Database Management Systems SQL: Structured Query Language Today

alva
Download Presentation

Big Data Analytics Map-Reduce and Spark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data AnalyticsMap-Reduce and Spark ? Slides by: Joseph E. Gonzalez jegonzal@cs.berkeley.edu Guest Lecturer: Vikram Sreekanti

  2. From SQL to Big Data (with SQL) • A few weeks ago… • Databases • (Relational) Database Management Systems • SQL: Structured Query Language • Today • More on databases and database design • Enterprise data management and the data lake • Introduction to distributed data storage and processing • Spark

  3. Data in the Organization A little bit of buzzword bingo!

  4. Inventory How we like to think of data in the organization

  5. Sales (Asia) The reality… Sales (US) Inventory Advertising

  6. Data Warehouse Operational Data Stores Sales (Asia) Collects and organizes historical data from multiple sources • Capture the now • Many different databases across an organization • Mission critical… be careful! • Serving live ongoing business operations • Managing inventory • Different formats (e.g., currency) • Different schemas (acquisitions ...) • Live systems often don’t maintain history Sales (US) Inventory Advertising We would like a consolidated, clean, historical snapshot of the data.

  7. Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources ETL Sales (US) ETL Data is periodicallyETLed into the data warehouse: • Extracted from remote sources • Transformed to standard schemas • Loaded into the (typically) relational (SQL) data system ETL Inventory Advertising

  8. Extract Transform  Load (ETL) Extract & Load: provides a snapshot of operational data • Historical snapshot • Data in a single system • Isolates analytics queries (e.g., Deep Learning) from business critical services (e.g., processing user purchase) • Easy! Transform: clean and prepare data for analytics in a unified representation • Difficult  often requires specialized code and tools • Different schemas, encodings, granularities

  9. Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources ETL Sales (US) ETL How is data organized in the Data Warehouse? ETL Inventory Advertising

  10. Example Sales Data • Big table: many columns and rows • Substantial redundancy  expensive to store and access • Make mistakes while updating • Could we organize the data more efficiently?

  11. Multidimensional Data Model SalesFact Table Locations 5 1 2 • Dimension • Tables Location Id • Multidimensional “Cube” of data 13 Products Product Id 12 11 1 2 3 Time Id 25 30 25 8 30 8 25 30 8 10 10 8 20 8 10 20 8 20 10 15 50 15 10 50 10 15 50 Time

  12. Multidimensional Data Model SalesFact Table Locations • Dimension • Tables • Fact Table • Minimizes redundant info • Reduces data errors • Dimensions • Easy to manage and summarize • Rename: Galaxy1  Phablet • Normalized Representation • How do we do analysis? • Joins! Products Time

  13. The Star Schema Products Time Sales Fact Table  This looks like a star … Locations

  14. The Star Schema  This looks like a star …

  15. The Star Schema  This looks like a star …

  16. The Snowflake Schema  This looks like a snowflake …?

  17. Which schema illustration would best organize this data? http://bit.ly/ds100-sp18-star Data(calid, student_name, year, major, major_grade_req, asg1_name, asg1_pts, asg1_score, asg1_grader_name asg2_name, asg2_pts, asg2_score, asg2_grader_name avg_grade) Majors Students Graders Graders Students Grades Grades Grades Graders (A) Students (B) (C) Assignments Assignments Majors

  18. Data(calid, student_name, year, major, major_grade_req, asg1_name, asg1_pts, asg1_score, asg1_grader_name asg2_name, asg2_pts, asg2_score, asg2_grader_name avg_grade) Which schema illustration would best organize this data? • Grades(calid, asg_name, grader_id, score) • Graders(grader_id, grader_name) • Student(calid, name, year, major_name, avg_grade) • Majors(major_name, grade_req) • Assignments(asg_name, asg_pts) Majors Students Graders Graders Students Grades Grades Grades Graders (A) Students (B) (C) Assignments Assignments Majors

  19. Online Analytics Processing (OLAP) Users interact with multidimensional data: • Constructing ad-hoc and often complex SQL queries • Using graphical tools that to construct queries • Sharing views that summarize data across important dimensions

  20. Cross Tabulation (Pivot Tables) • Aggregate data across pairs of dimensions • Pivot Tables:graphical interface to select dimensions and aggregation function (e.g., SUM, MAX, MEAN) • GROUP BY queries • Related to contingency tables and marginalization in stats. • What about many dimensions?

  21. * 5 Cube Operator Location Id 2 63 38 75 176 63 38 75 176 1 75 176 8 10 10 28 63 38 75 176 8 10 10 28 80 63 63 38 38 27 75 75 75 * 30 20 50 8 10 10 100 28 30 20 50 13 100 • Generalizes cross-tabulation to higher dimensions. 25 8 15 30 20 50 48 100 25 8 15 Product Id 12 48 25 8 15 48 11 • In SQL: • SELECT Item, Color, SUM(Quantity) ASQtySumFROM Furniture • GROUPBYCUBE (Item, Color); What is here? * Time Id 1 2 3

  22. OLAP Queries • Slicing:selecting a value for a dimension • Dicing:selecting a range of values in multiple dimension 8 25 3 30 33 9 25 30 33 3 9 25 30 8 3 25 8 20 10 7 42 8 8 10 42 7 7 8 5 20 13 6 15 5 15 10 50 10 50 6 5 15 2 42 6 15 7 2 33 42 42 5 5 8 10 10 10 10 9 2 2 30 20 20 50 50

  23. OLAP Queries • Rollup:Aggregating along a dimension • Drill-Down:de-aggregating along a dimension Day 3 Day 3 Day 2 Day 2 30 33 25 30 8 25 30 9 8 30 8 3 33 9 25 25 3 3 8 30 30 8 25 25 8 9 59 33 93 87 8 42 20 10 7 20 8 20 42 10 10 20 8 10 8 7 10 91 20 8 120 20 42 10 42 7 8 2 56 6 67 10 50 10 15 50 10 10 75 5 15 6 50 50 10 15 5 5 2 50 6 50 10 15 15 2 15 Day 1 Day 1 15 15 15 15 15 15 6 6 6 6 6 6 8 8 8 8 8 8 23 4 4 4 23 23 23 23 4 4 4 23 7 10 10 7 7 7 10 10 7 10 7 10 36 36 5 36 36 5 5 5 5 5 36 36

  24. Reporting and Business Intelligence (BI) • Use high-level tools to interact with their data: • Automatically generate SQL queries • Queries can get big! • Common!

  25. Data Warehouse Sales (Asia) ETL Collects and organizes historical data from multiple sources ETL Sales (US) ETL • So far … • Star Schemas • Data cubes • OLAP Queries ETL Inventory Advertising

  26. Data Warehouse ETL Sales (Asia) Collects and organizes historical data from multiple sources ETL ETL Sales (US) ETL • How do we deal with semi-structured and unstructured data? • Do we really want to force a schema on load? ETL ? Text/Log Data Photos & Videos Inventory Advertising It is Terrible!

  27. Data Warehouse How do we clean and organize this data? ETL Sales (Asia) Collects and organizes historical data from multiple sources ETL Depends on use … ETL Sales (US) ETL • How do we deal with semi-structured and unstructured data? • Do we really want to force a schema on load? ETL ? How do we load and process this data in a relational system? Text/Log Data Photos & Videos Depends on use… Can be difficult... Requires thought... Inventory Advertising It is Terrible!

  28. Data Warehouse Data Lake * How do we clean and organize this data? ETL Sales (Asia) ETL Store a copy of all the data • in one place • in its original “natural” form Enable data consumers to choose how to transform and use data. • Schema on Read Depends on use … ETL Sales (US) ETL ETL ? How do we load and process this data in a relation system? Text/Log Data Photos & Videos Depends on use … Can be difficult ... Requires thought ... Inventory Advertising What could go wrong? It is Terrible! *Still being defined…[Buzzword Disclaimer]

  29. The Dark Side of Data Lakes • Cultural shift: Curate  Save Everything! • Noise begins to dominate signal • Limited data governance and planning Example: hdfs://important/joseph_big_file3.csv_with_json • What does it contain? • When and who created it? • No cleaning and verification  lots of dirty data • New tools are more complex and old tools no longer work Enter the data scientist

  30. A Brighter Future for Data Lakes Enter the data scientist • Data scientists bring new skills • Distributed data processing and cleaning • Machine learning, computer vision, and statistical sampling • Technologies are improving • SQL over large files • Self describing file formats & catalog managers • Organizations are evolving • Tracking data usage and file permissions • New job title: data engineers

  31. How do we store and compute on large unstructured datasets • Requirements: • Handle very large files spanning multiple computers • Use cheap commodity devices that fail frequently • Distributed data processing quickly and easily • Solutions: • Distributed file systems  spread data over multiple machines • Assume machine failure is common  redundancy • Distributed computing  load and process files on multiple machines concurrently • Assume machine failure is common  redundancy • Functional programming computational pattern  parallelism

  32. Distributed File SystemsStoring very large files

  33. Fault Tolerant Distributed File Systems Big File How do we store and access very large files across cheapcommodity devices ? [Ghemawat et al., SOSP’03]

  34. Fault Tolerant Distributed File Systems Big File Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]

  35. Fault Tolerant Distributed File Systems A A A Big File B B B • Split the file into smaller parts. • How? • Ideally at record boundaries • What if records are big? C C D D C D Machine1 Machine 2 Machine 3 Machine 4 [Ghemawat et al., SOSP’03]

  36. Fault Tolerant Distributed File Systems Big File B B B C C D D C D Machine1 Machine 2 Machine 3 Machine 4 A A A [Ghemawat et al., SOSP’03]

  37. Fault Tolerant Distributed File Systems Big File C C D D C D Machine1 Machine 2 Machine 3 Machine 4 B A B A A B [Ghemawat et al., SOSP’03]

  38. Fault Tolerant Distributed File Systems Big File D D D Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B C [Ghemawat et al., SOSP’03]

  39. Fault Tolerant Distributed File Systems Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

  40. Fault Tolerant Distributed File Systems Big File • Split large files over multiple machines • Easily support massive files spanning machines • Read parts of file in parallel • Fast reads of large files • Often built using cheap commoditystorage devices Machine1 Machine 2 B C A B D C Machine 3 Machine 4 Cheap commodity storage devices will fail! A C A B D D [Ghemawat et al., SOSP’03]

  41. Fault Tolerant Distributed File Systems Failure Event Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

  42. Fault Tolerant Distributed File Systems Failure Event Big File Machine1 Machine 2 Machine 3 Machine 4 B C A B A C A B D C D D [Ghemawat et al., SOSP’03]

  43. Fault Tolerant Distributed File Systems Failure Event Big File Machine 2 Machine 4 A B A B A B C D C D [Ghemawat et al., SOSP’03]

  44. Fault Tolerant Distributed File Systems Failure Event B Big File A D C Machine 2 Machine 4 A B A B C D [Ghemawat et al., SOSP’03]

  45. Map-ReduceDistributed Aggregation Computing are very large files

  46. How would you computethe number of occurrences of each word in all the books using a team of people?

  47. Simple Solution

  48. Simple Solution 1) Divide Books Across Individuals

  49. Simple Solution 1) Divide Books Across Individuals 2) Compute Counts Locally

  50. Simple Solution 1) Divide Books Across Individuals 2) Compute Counts Locally 3) Aggregate Tables

More Related