1 / 22

MapReduce Theory and Practice

MapReduce Theory and Practice. http://net.pku.edu.cn/~course/cs402/2010/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/15/2010. Last Course Review. Quiz. What are they? 数据 (data) Bit Byte 数据类型 (data types) 信息 (information). Data.

Download Presentation

MapReduce Theory and Practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduceTheory and Practice http://net.pku.edu.cn/~course/cs402/2010/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 7/15/2010

  2. Last Course Review

  3. Quiz • What are they? • 数据(data) • Bit • Byte • 数据类型(data types) • 信息(information)

  4. Data • The term data refers to groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. • Data (plural of "datum", which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. • Data are often viewed as the lowest level of abstraction from which information and knowledge are derived. • Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed.

  5. Bit • 位(英语:Bit),亦称二 进制位,指二进制中的一位,是信息的最小单位。Bit是Binary digit(二 进制数位)的缩写 • 假设一事件以A或B的方式发生,且A、B发生的概率相等,都为0.5,则一个二进位可用来代表A或B之一。 例如: • 二进位可以用来表示一个简单的正负 • 有两种状态的开关(如电灯开关) • 晶体管的通断 • 某根导线上电压的有无 • 一个抽像的逻辑上的是否

  6. Byte • 字节,英文名称是Byte。Byte是BinaryTerm的 缩写。一个字节代表八个比特。它是通常被作为计算机信息计量单位,不论被存储数据的类型为何。

  7. History of “Information” • Latin origin: a representation implanted in the mind-> idea • Language and Coding:hide information in messages and then decode them。 莫尔斯电码 • Mathematics: Shannon在channel transmission工作中,定义了一个message所包含的信息量为它在source中出现概率的log2 ,单位为’bits’。 • Logic and linguistics:communication-oriented sense of information涉及到semantic meaning语义, knowledge知识 • Society:information as something that is contained in the message used to inform. “information is the tennis ball of communication”

  8. How much data? 640Kought to be enough for anybody. Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERN’s LHC will generate 15 PB a year (??)

  9. “We are living in exponential times “

  10. Information Overloading • Political theorist Neil Postman spoke to the German Informatics Society in 1990, claiming that we are informing ourselves to death.  He argued that the development of computer technology is not as positive as it has been heralded to be.  With our focus on technology, we are forfeiting our humanity.  We are drowning in information that contains empty promises of improving our lives. (Postman 1990).

  11. 怎样应对信息过载?

  12. What’s matter with ME?! • What you want to do with 1000pcs, or even 100,000 pcs?

  13. Cloud is coming… Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law “Data Center is a Computer” Parallelism everywhere Massive Scalable Reliable Resource Management Data Management Programming Model & Tools

  14. What’s Mapreduce • Parallel/Distributed Computing Programming Model shuffle output Input split

  15. Word Frequencies in Web pages • 输入:one document per record • 用户实现mapfunction,输入为 • key = document URL • value = document contents • map输出 (potentially many) key/value pairs. • 对document中每一个出现的词,输出一个记录<word, “1”>

  16. Example continued: • MapReduce运行系统(库)把所有相同key的记录收集到一起 (shuffle/sort) • 用户实现reducefunction对一个key对应的values计算 • 求和sum • Reduce输出<key, sum>

  17. Homework Reading

  18. Checklist • What’s the title? • What’s the main point of view? • What’s the most impact on you?

  19. Introduction to Distributed System Design • How many times physicist occurs in this document? • Tell me something about Remote Procedure Calls • Tell me something about the types of failures that can occur in a distributed system

  20. Introduction to Parallel Programming and MapReduce • MASTER/WORKER technique • approximating pi • MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

  21. End

More Related