1 / 16

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer. Chapter 1. MapReduce. Programming model for distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers

kyne
Download Presentation

Data-Intensive Text Processing with MapReduce J. Lin & C. Dyer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Intensive Text Processing with MapReduceJ. Lin & C. Dyer Chapter 1

  2. MapReduce • Programming model for distributed computations on massive amounts of data • Execution framework for large-scale data processing on clusters of commodity servers • Developed by Google – built on old, principles of parallel and distributed processing • Hadoop – adoption of open-source implementation by Yahoo (now Apache project)

  3. Big Data • Big data – issue to grapple with • Web-scale synonymous with data-intensive processing • Public, private repositories of vast data • Behavior data important - BI

  4. 4th paradigm • Manipulate, explore, mine massive data – 4th paradigm of science (theory, experiments, simulations) • In CS, systems must be able to scale • Increases in capacity > improvements in bandwidth

  5. Problems/Solutions • NLP and IR • Data driven algorithmic approach to capture statistical regularities • Data – corpora (NLP), collections (IR) • representations of data -features (superficial, deep) • Method – algorithms • Examples • Is email spam or not? Is this word part of an address or location?

  6. Problems/Solutions • Who shot Lincoln? • NLP – sophisticated linguistics syntactic, semantic analysis • 2001- on left of “who shot Lincoln”, tally up, redundancy based approach • Probability distribution of sequence of words • Training, smoothing • Markov assumption • N-gram language model, conditional probability of a word is given by n-1 previous words

  7. MapReduce (MR) • MapReduce • level of abstraction and beneficial division of labor • Programming model – powerful abstraction separates what from how of data intensive processing

  8. Big Ideas behind MapReduce • Scale out not up • Purchasing symmetric multi-processing machines (SMP) with large number of processor sockets (100s), large shared memory (GBs) not cost effective • Why? Machine with 2x processors > 2x cost • Barroso & Holzle analysis using TPC benchmarks • SMP – communication order magnitude faster • Cluster of low end approach 4x more cost effective than high end • However, even low end only 10-50% utilization – not energy efficient

  9. Big Ideas behind MapReduce • Assume failures are common • Assume cluster machines mean-time failure 1000 days • 10,000 server cluster, 10 failures a day • MR copes with failure • Move processing to the data • MR assume architecture where processors/storage co-located • Run code on processor attached to data

  10. Big Ideas behind MapReduce • Process data sequentially not random • If 1TB DB with 1010, 100B records • If update 1% randomly, takes 1 month • If read entire DB and rewrites all records with updates sequentially, takes < 1 work day on single machine • Solid state won’t help • MR – designed for batch processing, trade latency for throughput

  11. Big Ideas behind MapReduce • Hide system-level details from application developer • Writing distributed programs difficult • Details across threads, processes, machines • Code running concurrently is unpredictable • Deadlocks, race conditions, etc. • MR isolates develop from system-level details • No locking, starvation, etc. • Well-defined interfaces • Separates what (programmer) from how (responsibility of execution framework) • Framework designed once and verified for correctness

  12. Big Ideas behind MapReduce • Seamless scalability • Given 2x data, algorithm takes at most 2x to run • Given cluster 2x large, take ½ time to run • The above is unobtainable for algorithms • 9 women can’t have a baby in 1 month • E.g. 2x programs takes longer • Degree of parallelization increases communication • MR small step toward attaining • Algorithm fixed, framework executes algorithm • If use 10 machines 10 hours, 100 machines 1 hour

  13. Motivation for MapReduce • Still waiting for parallel processing to replace sequential • Progress of Moore’s law - most problems could be solved by single computer, so ignore parallel, etc. • Around 2005, no longer true • Semiconductor industry ran out of opportunities to improve • Faster clocks cheaper pipelines, superscalar architecture • Then came multi-core • Not matched by advances in software

  14. Motivation • Parallel processing only way forward • MapReduce to the rescue • Anyone can download open source Hadoop implementation of MapReduce • Rent a cluster from a utility cloud • Process TB within the week • Multiple cores in a chip, multiple machines in a cluster

  15. Motivation • MapReduce: effective data analysis tool • First widely-adopted step away from von Neumann model • Can’t treat multi-core processor, cluster as conglomeration of many von Neumann machine image that communicates over network • Wrong abstraction • MR – organize computations not over individual machines, but over clusters • Datacenter is the computer

  16. Motivation • Previous models of parallel computation • PRAM • Arbitrary number of processors, share unbounded large memory, operate synchronously on shared input • LogP, BSP • MR most successful abstraction for large-scale resources • Manages complexity, hides details, presents well-defined behavior • Makes certain tasks easier, others harder • MapReduce first in new class of programming models

More Related