1 / 58

COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University

COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University. Student Self-Introduction. Name I will try to remember your names. But if you have a Long name, please let me know how should I call you  Anything you want us to know. Course Overview.

otto-kline
Download Presentation

COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COP 6727:Advanced Database SystemsSpring 2013Dr. Tao LiFlorida International University

  2. Student Self-Introduction • Name • I will try to remember your names. But if you have a Long name, please let me know how should I call you  • Anything you want us to know COP6727

  3. Course Overview • Meeting time • Tuesday and Thursday 12:30pm – 13:45pm • Office hours: • Thursday 2:30pm – 4:30pm or by appointment • Course Webpage: • http://www.cs.fiu.edu/~taoli/class/CAP6727-S13/index.html COP6727

  4. Course Objectives • This is an advanced database course • Already taken COP5725 • Assume knowledge of the fundamental concepts of relational databases. • Cover the core principles and techniques of data and information management • Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications. COP6727

  5. Tentative Topics • Query processing and optimization • Transaction management • Database tuning • Data stream systems • Spatial databases • XML • Information retrieval and Web data management • Scalable data processing • Readings in recent developments in database systems and applications • SQL vs. non-SQL database • Nearest neighbor queries • High-dimensional indexing • Database retrieval and ranking • Stream processing • Big Data • Incremental and online query processing • Mobile database COP6727

  6. Assignments and Grading • Reading/Written Assignments • Programing Projects • Midterm Exam • Final Project/Presentations • Class attendance is mandatory. • Evaluation will be a subjective process • Effort is very important component • Regular In-class Students • Quizzes and Class Participation: 5% • Midterm Exam: 30% • Final Project: 30% • Assignments and Projects: 35% • Online Students • Midterm Exam: 30% • Final Project: 30% • Homework Assignments: 40% COP6727

  7. Text and References Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage .  In addition,  the course materials will also be drawn from recent research literature. COP6727

  8. Lecture 1 & 2 • Lecture 1 & 2: Introduction To MapReduce (Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials) COP6727

  9. Outline • Motivation for MapReduce • What is MapReduce? • What is Hadoop? • What is Hive? COP6727

  10. Motivation for MapReduce • The Big Data • How to handle big data? COP6727

  11. The Big Data • Big data is everywhere • Documents • Blogs (77 million Tumblr and 56.6 million WordPress as of 2012), Micro blogs, News, Reviews • Images • Instagram, Flickr (more than 6 billion images) • Videos • Youtube, All broadcast • Others • Map (Google Map) • Human Genome • aeronautics and space data COP6727

  12. Another view on “big” • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day COP6727

  13. Why do we care about those data? • Modeling and predicting information flow • Recommend/predict links in social networks • Relevance classification / information filtering • Sentiment analysis and opinion mining • Topic modeling and evolution • Measuring influence in social networks • Concept mapping • Search • … COP6727

  14. Big data analysis • Scalability (with reasonable cost) • Algorithms improvement • Intuitive way: divide and conquer COP6727

  15. Divide and Conquer COP6727

  16. Challenges • Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization? COP6727

  17. Challenges – Con’t • Data storage is not trivial • Traditional database is not reliable • Data volumes are massive • Reliably storing PBs of data is challenging • Disk/hardware/network failures • Probability of failure event increases with number of machines • For example: • 1000 hosts, each with 10 disks, a disk lasts 3 year • how many failures per day? COP6727

  18. What is MapReduce? • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop COP6727

  19. Workflow of Large Data Problem COP6727

  20. MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer COP6727

  21. MapReduce Flow COP6727

  22. An Example COP6727

  23. MapReduce paradigm – Con’t • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based • Combiners can combine Mapper output before sending to reducer • Reduce(k2, list(v2)) -> list(v3) COP6727

  24. MapReduce Flow COP6727

  25. MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning COP6727

  26. MapReduce is good at • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset COP6727

  27. MapReduce can do • Iterative jobs (e.g., PageRank, K-means Clustering) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high COP6727

  28. MapReduce is not good at • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records COP6727

  29. Summary of MapReduce • Simple programming model • Scalable, fault-tolerant • Ideal for (pre-)processing large volumes of data COP6727

  30. What is Hadoop? • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 COP6727

  31. Hadoop provides • Redundant, fault-tolerant data storage • Parallel computation framework • Job coordination COP6727

  32. Hadoop Stack COP6727

  33. Who uses Hadoop? • Yahoo! • Facebook • Last.fm • Rackspace • Digg • Apache Nutch • ... COP6727

  34. HDFS • The Hadoop Distributed File System • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts COP6727

  35. Some Concepts about HDFS • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks COP6727

  36. Write COP6727

  37. Read COP6727

  38. If a datanode failures • DNs check in with the NN to report health • Upon failure NN orders DNs to replicate under- replicated blocks COP6727

  39. Jobs and Tasks in Hadoop • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks COP6727

  40. Architecture COP6727

  41. How to handle failed tasks? • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task • First one to complete wins, other is killed COP6727

  42. Data locality • Move computation to the data • Moving data between nodes has a cost • Hadoop tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN COP6727

  43. Hadoop execution environment • Local machine (standalone or pseudo- distributed) • Virtual machine • Cloud (e.g. Amazon EC2) • Own cluster COP6727

  44. Demo: word count • Demo COP6727

  45. Homework • Write a Hadoop program to index the words within the text document dataset • Example: • Input: • Doc1: Hello World! • Doc2: Hello Java! • Expected output: • Hello \t Doc1 Doc2 • World \t Doc1 • Java \t Doc2 • Due: beginning of the class on 01/10 • If you have any questions, send emails to Jingxuan Li (jli003@cs.fiu.edu) COP6727

  46. Login Info • Below is the login information for our Hadoop cluster • Server: datamining-node03.cs.fiu.edu • U:dbstudent p:******* (announced during the class) • Gaining the access to the working directory in HDFS (Do not modify or remove the other directories!): hadoop fs -ls /user/dbstudent • Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset • Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID COP6727

  47. What is Hive? • Data warehousing tool on top of Hadoop • Originally developed at Facebook • Now a Hadoop sub-project • Data warehouse infrastructure • Execution: MapReduce • Storage: HDFS files • Large datasets, e.g. Facebook daily logs • 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009) • Hive QL: SQL-like query language COP6727

  48. Motivation • Missing components when using Hadoop MapReduce jobs to process data • Command-line interface for “end users” • Ad-hoc query support • … without writing full MapReduce jobs • Schema information COP6727

  49. Hive Applications • Log processing • Text mining • Document indexing • Customer-facing business intelligence (e.g., Google Analytics) • Predictive modeling, hypothesis testing COP6727

  50. Hive Components • Shell: allows interactive queries like MySQL shell connected to database • Also supports web and JDBC clients • Driver: session handles, fetch, execute • Compiler: parse, plan, optimize • Execution engine: DAG of stages (M/R, HDFS, or metadata) • Metastore: schema, location in HDFS COP6727

More Related