1 / 25

Cloud Computing

Cloud Computing. Other High-level parallel processing languages Keke Chen. Outline. sawzall Dryad and DraydLINQ (MS, abandoned) Hive. Sawzall. Simplify mapreduce programming Filters + aggregator. mapper. reducer. Example. reducers. Convert the input record to float. mappers. input.

Download Presentation

Cloud Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloud Computing Other High-level parallel processing languages Keke Chen

  2. Outline • sawzall • Dryad and DraydLINQ (MS, abandoned) • Hive

  3. Sawzall • Simplify mapreduce programming • Filters + aggregator mapper reducer

  4. Example reducers Convert the input record to float mappers

  5. input • Sawzall program works on a single record • As a filter filtering through the data stream • Input can be parsed to • Values, e.g., float • Data structure x: float = input; (variable : type = input)

  6. aggregators • definition • table agg_name of data_type/variable • Examples: • c: table collection of string; • S: table sample(100) of string; • S: table sum of {count: int, revenue: float} • More aggregators • Maximum, quantile, top, unique

  7. Indexed aggregators • similar to “group by”, the index is group id • Example t1: table sum[country: string] of int country: string = input; Emit t1[country] <- 1;

  8. More example Proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; Log_record: queryLogProto = input; Loc: Location = locationinfo(log_record.ip); Emit queries_per_degree[int(loc.lat)][int(loc.lon)]<-1

  9. Performance Single-CPU speed, Also 51 times slower than compiled C++

  10. Performance

  11. Dryad and DryadLINQ • Dryad provides a low-level parallel data flow processing interface • Acyclic data flow graphs • Data communication methods include pipes, file-based, message, shared-memory • DryadLINQ • A high level language for app developers • It hides the data flow details

  12. Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs

  13. V V V Runtime • Services • Name server • Daemon • Job Manager • Centralized coordinating process • User application to construct graph • Linked with Dryad libraries for scheduling vertices • Vertex executable • Dryad libraries to communicate with JM • User application sees channels in/out • Arbitrary application code, can use local FS

  14. Graph operators

  15. Hive • Developed by facebook (open source) • Mimic SQL language • Built on hadoop/mapreduce

  16. Hive data model: table etc. • Table • Similar to DB table • stored in hadoop directories • Builtin compression, serialization/deserialization • Partitions • Groups in the table • Subdirectory in the table directory • Buckets • Files in the partition directory • Key (column) based partition • /table/partition/bucket1

  17. Hive data model: Column type • integers, floating point numbers, generic strings, dates and booleans • nestable collection types: array and map.

  18. Metastore stores the schema of databases. It uses non HDFS data store Architecture

  19. Query processing • Steps (similar to DBMS) • Parse • Semantic analyzer • Logical plan generator (algebra tree) • Optimizer • Physical plan generator (to mapreduce jobs)

  20. Operations: DDL and DML • HiveQL: SQL like, with slightly different syntax • User defined filtering and aggregation functions • Java only • Map/reduce plugin for streaming process • Implemented with any language

  21. Example • Facebook status updates • Table: status_updates(userid int, status string,ds string) • profiles(userid int,school string,gender int) • Operations • Load data LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20') • Count status updates by school and by gender

  22. More query examples

  23. Query examples

  24. Query examples – using hadoopstreaming

More Related