1 / 41

large scale data analysis

large scale data analysis

Download Presentation

large scale data analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Scale Data Analysiswith Map/Reduce http://www.allsoftsolution.in

  2. Contents • Map/Reduce • Dryad • •Sector/Sphere • Open source M/R frameworks &tools • Hadoop(Yahoo/Apache) • Cloud MapReduce(Accenture) • –Elastic MapReduce (Hadoop onAWS) • MR.Flow • Some M/Ralgorithms • Graph algorithms, Text Indexing &retrieval http://www.allsoftsolution.in

  3. Contents PartI Distributedcomputing frameworks http://www.allsoftsolution.in

  4. Scalability &Parallelisation • Scalabilityapproaches • Scale up (verticalscaling) • Only one direction of improvement (biggerbox) • Scale out (horizontalscaling) • Two directions – add more nodes + scale up eachnode • Can achieve x4 the performance of a similarly priced scale-up system (ref?) • Hybrid (“scale out in abox”) • Parallel algorithms...Not • Algorithms withstate • Dependencies from one iteration to another (recurrence,induction) http://www.allsoftsolution.in

  5. Parallelisationapproaches • Parallelizationapproaches • Taskdecomposition • Distribute coarse-grained(synchronisation wise) and computationally expensivetasks (otherwise too much coordination/management overhead) • Dependencies - execution order vs. datadependencies • Move the datato the processing (whenneeded) • Datadecomposition • Each parallel task works with a data partition assigned to it (nosharing) • Data has regular structure, i.e. chunks expected to need the same amount of processing time • Two criteria: granularity (size of chunk) and shape (dataexchange • between chunkneighbours) • Move the processingto thedata http://www.allsoftsolution.in

  6. Amdahl’slaw • Impossible to achieve linearspeedup • Maximum speedup is always bounded by the overheadfor • parallelisation and by the serial processingpart • Amdahl’slaw • max_speedup= • P: proportion of the program than can be parallelised (1-P still remains serial oroverhead) • N: number of processors / parallelnodes • Example: P=75% (i.e. 25% serial oroverhead) http://www.allsoftsolution.in

  7. Map/Reduce • Google (2005), US patent(2010) • General idea - co-locate data with computationnodes • Data decomposition (parallelization) – no data/orderdependencies • between tasks (except the Map-to-Reducephase) • Try to utilise data locality(bandwidth is$$$) • Implicit data flow(higher abstraction level thanMPI) • Partial failure handling (failed map/reduce tasks arere-scheduled) • Structure • Map - for each input (Ki,Vi) produce zero or more outputpairs • (Km,Vm) • Combine – optional intermediate aggregation (less M->R data transfer) • Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero ormore • output pairs(Kr,Vr) http://www.allsoftsolution.in

  8. Map/Reduce(2) (C) JimmyLin http://www.allsoftsolution.in

  9. Map/Reduce -examples • In otherwords… • Map = partitioning of the data (compute part of a problem across severalservers) • Reduce = processing of the partitions (aggregate the partial results • from all servers into a single resultset) • The M/R framework takes care of grouping of partitions bykey • Example: wordcount • Map (1 task per document in thecollection) • In:docx • Out: (term1, count1,x), (term2, count2,x),… • Reduce (1 task per term in thecollection) • In: (term1, < count1,x, count1,y, … count1,z>) • Out: (term1, SUM(count1,x, count1,y, …count1,z)) http://www.allsoftsolution.in

  10. Map/Reduce examples(2) • Example: Shortest path in graph(naïve) • Map: in (nodein, dist); out (nodeout, dist++) wherenodein->nodeout • Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder,MIN(dista,r, • distb,r, …,dustc,r)) • Multiple M/R iterations required, start with(nodestart,0) • Example: Inverted indexing (full textsearch) • Map • In:docx • out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx,pos2,x))… • Reduce • in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz,pos1,z)>) • out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), …(docz, • <pos1,z>)>) http://www.allsoftsolution.in

  11. Map/Reduce - examples(3) • Inverted index examplerundown • input • Doc1: “Why did the chicken cross theroad?” • Doc2: “The chicken and eggproblem” • Doc3: “Kentucky FriedChicken” • Map phase (3 paralleltasks) • – map1 => (“why”,(doc1,1)), (“did”,(doc1,2)),(“the”,(doc1,3)), • (“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7)) • – map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)),(“and”,(doc2,3)), • (“egg”,(doc2,4)), (“problem”,(doc2,5)) • – map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)),(“chicken”,(doc3,3)) http://www.allsoftsolution.in

  12. Map/Reduce - examples(4) • Inverted index example rundown(cont.) • Intermediate shuffle & sortphase • – (“why”,<(doc1,1)>), • – (“did”,<(doc1,2)>), • – (“the”, <(doc1,3), (doc1,6),(doc2,1)>) • – (“chicken”, <(doc1,4), (doc2,2),(doc3,3)>) • – (“cross”,<(doc1,5)>) • – (“road”,<(doc1,7)>) • – (“and”,<(doc2,3)>) • – (“egg”,<(doc2,4)>) • – (“problem”,<(doc2,5)>) • – (“kentucky”,<(doc3,1)>) • – (“fried”,<(doc3,2)>) http://www.allsoftsolution.in

  13. Map/Reduce - examples(5) • Inverted index example rundown(cont.) • Reduce phase (11 paralleltasks) • – (“why”,<(doc1,<1>)>), • – (“did”,<(doc1,<2>)>), • – (“the”, <(doc1, <3,6>), (doc2,<1>)>) • – (“chicken”, <(doc1,<4>), (doc2,<2>),(doc3,<3>)>) • – (“cross”,<(doc1,<5>)>) • – (“road”,<(doc1,<7>)>) • – (“and”,<(doc2,<3>)>) • – (“egg”,<(doc2,<4>)>) • – (“problem”,<(doc2,<5>)>) • – (“kentucky”,<(doc3,<1>)>) • – (“fried”,<(doc3,<2>)>) http://www.allsoftsolution.in

  14. Map/Reduce – pros &cons • Goodfor • Lots of input, intermediate & outputdata • Little or no synchronisationrequired • “Read once”, batch oriented datasets(ETL) • Badfor • Fast response time • Large amounts of shareddata • Fine-grained synchronisationrequired • CPU intensive operations (as opposed to dataintensive) http://www.allsoftsolution.in

  15. Dryad • Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/ • General purpose distributed executionengine • Focus on throughput, notlatency • Automatic management of scheduling, distribution &faulttolerance • Simple DAGmodel • Vertices -> processes (processingnodes) • Edges -> communication channels between theprocesses • DAG modelbenefits • Genericscheduler • No deadlocks /deterministic • Easier fault tolerance http://www.allsoftsolution.in

  16. Dryad DAGjobs (C) MichaelIsard http://www.allsoftsolution.in

  17. Dryad(3) • The job graph can mutate during execution(?) • Channel types (oneway) • Files on aDFS • Temporaryfile • Shared memoryFIFO • TCPpipes • Faulttolerance • Node fails =>re-run • Input disappears => re-run upstreamnode • Node is slow => run a duplicate copy at another node, get firstresult http://www.allsoftsolution.in

  18. Dryad architecture &components (C) MihaiBudiu http://www.allsoftsolution.in

  19. Dryadprogramming • C++ API (incl. Map/Reduceinterfaces) • SQL Integration Services(SSIS) • Many parallel SQL Server instances (each is a vertex in theDAG) • DryadLINQ • LINQ to Dryadtranslator • Distributedshell • Generalisation of the Unix shell &pipes • Many inputs/outputs perprocess! • Pipes span multiplemachines http://www.allsoftsolution.in

  20. Dryad vs.Map/Reduce (C) MihaiBudiu http://www.allsoftsolution.in

  21. Contents PartII Open SourceMap/Reduce frameworks http://www.allsoftsolution.in

  22. Hadoop • Apache Nutch (2004), Yahoo is currently the major contributor • http://hadoop.apache.org/ • Not only a Map/Reduceimplementation! • HDFS – distributedfilesystem • HBase – distributed columnstore • Pig – high level query language (SQLlike) • Hive – Hadoop based datawarehouse • ZooKeeper, Chukwa, Pipes/Streaming,… • Also available on AmazonEC2 • Largest Hadoop cluster – 25K nodes / 100K cores(Yahoo) http://www.allsoftsolution.in

  23. Hadoop -Map/Reduce • Components • Jobclient • JobTracker • Only one • Scheduling, coordinating, monitoring, failurehandling • TaskTracker • Many • Executes tasks received by the JobTracker • Sends “heartbeats” and progress reports back to the JobTracker • TaskRunner • The actual Map or Reduce task started in a separateJVM • Crashes & failures do not affect the Task Tracker on thenode! http://www.allsoftsolution.in

  24. Hadoop - Map/Reduce (2) (C) TomWhite http://www.allsoftsolution.in

  25. Hadoop - Map/Reduce (3) • Integrated withHDFS • Map tasks executed on the HDFS node where the data is (data locality => reducetraffic) • Data locality is not possible for Reducetasks • Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task(node) • Statusupdates • Task Runner => Task Tracker, progress updates every3s • Task Tracker => Job Tracker, heartbeat + progress for all local tasks every5s • If a task has no progress report for too long, it will beconsidered • failed andre-started http://www.allsoftsolution.in

  26. Hadoop - Map/Reduce (4) • Someextras • Counters • Gather stats about atask • Globally aggregated (Job Runner => Task Tracker => JobTracker) • M/R counters: M/R input records, M/R outputrecords • Filesystem counters: bytesread/written • Job counters: launched M/R tasks, failed M/R tasks,… • Joins • Copy the small seton each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” fromVUA) • Map side join – data is joined beforethe Map function, very efficientbut • less flexible (datasets must be partitioned & sorted in a particularway) • Reduce side join – more general but less efficient (Map generates (K,V) pairs using the joinkey) http://www.allsoftsolution.in

  27. Hadoop - Map/Reduce (5) • Built-in mappers andreducers • Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map output is the Taskoutput • FieldSelection – select a list of fields from the input dataset tobe • used as MRkeys/values • TokenCounterMapper, SumReducer – (remember the “word count” example?) • RegexMapper – matches a regex in the input key/valuepairs http://www.allsoftsolution.in

  28. CloudMapReduce • Accenture(2010) • http://code.google.com/p/cloudmapreduce/ • Map/Reduce implementation for AWS (EC2, S3,SimpleDB, • SQS) • fast (reported as up to 60 times faster than Hadoop/EC2 in some cases) • scalable & robust(no single point of bottleneck orfailure) • simple (3 KLOC) • Features • No need for centralised coordinator (JobTracker), just put jobstatus • in the cloud datastore(SimpleDB) • All data transfer & communication is handled by theCloud • All I/O and storage is handled by theCloud http://www.allsoftsolution.in

  29. Cloud MapReduce(2) (C) RickyHo http://www.allsoftsolution.in

  30. Cloud MapReduce(3) • Job clientworkflow • Store input data(S3) • Create a Map task for each data split & put it into theMapper • Queue(SQS) • Create Multiple Partition Queue(SQS) • Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue • Create the Output Queue(SQS) • Create a Job Request (ref to all queues) and put it intoSimpleDB • Start EC2 instances for Mappers &Reducers • Poll SimpleDB for jobstatus • When job complete download results fromS3 http://www.allsoftsolution.in

  31. Cloud MapReduce(4) • Mapperworflow • Dequeue a Map task from the MapperQueue • Fetch data fromS3 • Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may sharethe same partitionqueue! • When done remove Map task from MapperQueue • Reducerworkflow • Dequeue a Reeduce task from the ReducerQueue • Dequeue the (Km,Vm) pairs from the corresponding PartitionQueue • => several partitions may share the samequeue! • Perform a user defined reduce function and add output pairs (Kr,Vr) to the OutputQueue • When done remove the Reduce task from the ReducerQueue http://www.allsoftsolution.in

  32. MR.Flow • Web based M/Reditor • http://www.mr-flow.com • Reusable M/Rmodules • Execution & status monitoring (Hadoopclusters) http://www.allsoftsolution.in

  33. Contents PartIII SomeMap/Reduce algorithms http://www.allsoftsolution.in

  34. Generalconsiderations • Map execution order is notdeterministic • Map processing time cannot bepredicted • Reduce tasks cannot start before all Maps havefinished • (dataset needs to be fullypartitioned) • Not suitable for continuous inputstreams • There will be a spike in network utilisation after the Map/ • before the Reducephase • Number & size of key/valuepairs • Object creation & serialisation overhead (Amdahl’slaw!) • Aggregate partial results whenpossible! • Use Combiners http://www.allsoftsolution.in

  35. Graphalgorithms • Very suitable for M/Rprocessing • Data (graph node)locality • “spreading activation” type ofprocessing • Some algorithms with sequential dependency not suitable forM/R • Breadth-first search algorithms better thandepth-first • GeneralApproach • Graph represented by adjacencylists • Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result • Reduce task – aggregate values bykey • Perform multiple iterations (with a terminationcriteria) http://www.allsoftsolution.in

  36. Social NetworkAnalysis • Problem: recommend new friends (friend-of-a-friend,FOAF) • Maptask • U (target user) is fixed and its friends list copied to all clusternodes • (“copy join”); each cluster node stores part of the socialgraph • In: (X, <friendsX>), i.e. the local data for the cluster node • Out: • if (U, X) are friends => (U, <friendsX\friendsU>), i.e. the users whoare • friends of X but not already friends ofU • nilotherwise • Reducetask • In: (U, <<friendsA\friendsU>,<friendsB\friendsU>, … >), i.e. the FOAF lists for all users A, B, etc. who are friends withU • Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and Nis • its total number of occurrences in all FOAF lists (sort/rank theresult!) http://www.allsoftsolution.in

  37. PageRank withM/R (C) JimmyLin http://www.allsoftsolution.in

  38. Text Indexing &Retrieval • Indexing is very suitable forM/R • Focus on scalability, not on latency & responsetime • Batchoriented • Maptask • emit (Term, (DocID,position)) • Reducetask • Group pairs by Term and sort byDocID http://www.allsoftsolution.in

  39. Text Indexing & Retrieval(2) (C) JimmyLin http://www.allsoftsolution.in

  40. Text Indexing & Retrieval(3) • Retrieval not suitable forM/R • Focus on responsetime • Startup of Mappers & Reducers is usually prohibitivelyexpensive • Katta • http://katta.sourceforge.net/ • Distributed Lucene indexing with Hadoop(HDFS) • Multicast querying &ranking http://www.allsoftsolution.in

  41. Usefullinks • "MapReduce: Simplified Data Processing on LargeClusters" • “Dryad: Distributed Data-Parallel Programs fromSequential • BuildingBlocks” • “Cloud MapReduce TechnicalReport” • Data-Intensive Text Processing withMapReduce • Hadoop - The Definitive Guide http://www.allsoftsolution.in

More Related