1 / 8

CPS 216: Data-intensive Computing Systems

CPS 216: Data-intensive Computing Systems. Information about Project 1 Shivnath Babu. Project 1: Overview. Project 1 (Sept to late Nov): Processing collections of records: Systems like Pig, Hive, Jaql , Cascading, Cascalog , HadoopDB

kalb
Download Presentation

CPS 216: Data-intensive Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPS 216: Data-intensive Computing Systems Information about Project 1 ShivnathBabu

  2. Project 1: Overview • Project 1 (Sept to late Nov): • Processing collections of records: Systems like Pig, Hive, Jaql, Cascading, Cascalog, HadoopDB • Matrix and graph computations: Systems like Rhipe, Ricardo, SystemML, Mahout, Pregel, Hama • Data stream processing: Systems like Flume, FlumeJava, S4, STREAM, Scribe, STORM • Data serving systems: Systems like BigTable/HBase, Dynamo/Cassandra, CouchDB, MongoDB, Riak, VoltDB • Project 1 will have regular milestones. The final report will include: • What are properties of the data encountered? • What are concrete examples of workloads that are run? Develop a benchmark workload that you will implement and use in Step 5. • What are typical goals and requirements? • What are typical systems used, and how do they compare with each other? • Install some of these systems and do an experimental evaluation of 1, 2, 3, & 4 • Project 2 (Late Nov to end of class). Of your own choosing. Could be a significant new feature added to Project 1

  3. Group 1: Processing Collections of Records • Workloads: • See the “The Case for Evaluating MapReduce Performance Using Workload Suites” for pointers to a number of possible MapReduce workloads: (http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.html) • Citation 12 in the paper: Pavlo, Paulson, and others (comes with data) • TPC-H: http://www.tpc.org/tpch/ (comes with data) • If things work out: A real Hadoop+HBase workload that Akamai uses • Systems: • Hadoop • Pig • Hive • A hybrid system like: HadoopDB

  4. Group 2: Matrix and Graph Computations • Workloads: • Matrix computations, e.g., PLSA • Graph computations, e.g., PageRank • Machine-learning workloads (Are of interest to Groups 1 and 2) • Systems: • Hadoop • Spark / Twister • RHIPE • (Mahout)

  5. Group 3: Data Stream Processing • Workloads: • Behavioral Targeting: http://research.microsoft.com/apps/pubs/default.aspx?id=150002 • Linear Road Benchmark: http://pages.cs.brandeis.edu/~linearroad/ • Systems: • Hadoop • Flume and FlumeBase • Hadoop + HBase

  6. Group 4: Data Serving Systems • Workloads: • YCSB: https://github.com/brianfrankcooper/YCSB/wiki • YCSB++ • Systems (no need to do them all): • HDFS (not the full Hadoop) or MapR • HBase (Original design comes from Google BigTable) • Cassandra / Riak (Original design comes from Amazon Dynamo) • VoltDB (Parallel in-memory database) • CouchDB / MongoDB (Document Stores)

  7. Upcoming Milestones 1. Read about the workloads, performance goals, etc. Discuss within your group. Pick one workload or come up with your own. Write a report by Sept 23. You can do it as part of a group or on your own. 2. One part of programming assignment 2 will involve writing and running the workload using Hadoop/HDFS/MapR. This assignment will be done on Amazon EC2. Done individually. Group discussion is fine. 3. As part of Project 1 later on, you will compare the performance on Hadoop/HDFS/MapR seen in Step 2 Vs. the other systems you will use.

More Related