Cloud computing other mapreduce issues
1 / 22

Cloud Computing Other Mapreduce issues - PowerPoint PPT Presentation

  • Uploaded on

Cloud Computing Other Mapreduce issues. Keke Chen. Outline. MapReduce performance tuning Comparing mapreduce with DBMS Negative arguments from the DB community Experimental study (DB vs. mapreduce). Web interfaces. Tracking mapreduce jobs http://localhost:50030/

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Cloud Computing Other Mapreduce issues' - ila-ross

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Cloud computing other mapreduce issues

Cloud ComputingOther Mapreduce issues

Keke Chen


  • MapReduce performance tuning

  • Comparing mapreduce with DBMS

    • Negative arguments from the DB community

    • Experimental study (DB vs. mapreduce)

Web interfaces
Web interfaces

  • Tracking mapreduce jobs http://localhost:50030/

  • http://localhost:50060/ - web UI for task tracker(s)

  • http://localhost:50070/ - web UI for HDFS name node(s)

    • lynx http://localhost:50030/

Configuration files
Configuration files

  • Directory /usr/local/hadoop/conf/

    • hdfs-site.xml

    • mapred-site.xml

    • core-site.xml

  • Some parameter tuning


  • Some tips


7 tips
7 tips

  • Configure your cluster correctly

  • Use LZO compression

  • Tune the number of maps/reduces

  • Combiner

  • Appropriate and compact Writables

  • Reuse Writables

  • Profiling tasks

Debates on mapreduce
Debates on mapreduce

  • Most opinions are from the database camp

  • Database techniques for distributed data store

    • Parallel databases

  • traditional database approaches

    • Scheme

    • SQL

Db guys initial response to mapreduce
DB guys’ initial response to MapReduce

  • “Mapreduce: a giant step backword”

    • A sub-optimal implementation, in that it uses brute force instead of indexing

    • Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago

    • Missing most of the features that are routinely included in current DBMS

    • Incompatible with all of the tools DBMS users have come to depend on

  • A giant step backward in the programming paradigm for large-scale data intensive applications

    • Schema

    • Separating schema from code

    • High-level language

  • Responses

    • MR handles large data having no schema

    • Takes time to clean large data and pump into a DB

    • There are high-level languages developed: pig, hive, etc

    • Some problems that SQL cannot handle

    • Unix style programming (pipelined processing) is used by many users

  • Responses

    • Google’s experience has shown it can scale well

    • Index is not possible if the data has no schema

    • Mapreduce is used to generate web index

    • Writing back to disk increases fault tolerance

  • Responses

    • The authors do not claim it is novel.

    • many users are already using similar ideas in their own distributed solutions

    • Mapreduce serves as a well developed library

  • Responses

    • Mapreduce is not a DBMS, designed for different purposes

    • In practice, it does not prevent engineers implementing solutions quickly

    • Engineers usually take more time to learn DBMS

    • DBMS does not scale to the level of mapreduce applications

  • Some important problems
    Some important problems depend on

    • Experimental study on scalability

    • High-level language

    Experimental study
    Experimental study depend on

    • Sigmod09 “A comparison of approaches to large scal data analysis”

      • Compare parallel SQL DBMS (anonymous DBMS-X and Vertica) and mapreduce (hadoop)

      • Tasks

        • Grep

        • Typical DB tasks: selection, aggregation, join, UDF

    Grep task
    Grep task depend on

    • 2 settings: 535M/node, 1TB/cluster

    Hadoop is much faster in loading data

    Grep task execution
    Grep: task execution depend on

    Hadoop is the slowest…

    Selection task
    Selection task depend on

    • Select pageURL, pageRank from rankings where pageRank > x

    Aggregation task
    Aggregation task depend on

    • Select sourceIP, sum(adRevenue) from Uservisits group by source IP

    Join task
    Join task depend on

    • Mapreduce takes 3 phases to do it

    Udf task
    UDF task depend on

    • Extract URLs from documents, then count

    Vertica does not support UDF functions; uses a special program

    Discussion depend on

    • System level

      • Easy to install/configure MR, more challenging to install parallel DBMSs

      • Available tools for performance tuning for parallel DBMS

      • MR takes more time in task start-up

      • Compression does not help MR

      • MR has short loading time, while DBMSs take some time to reorg the input data

      • MR has better fault tolerance

    Discussion depend on

    • User level

      • Easier to program MR

      • Cost to maintain MR code

      • Better tool support for SQL DBMS