SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets Prepared by : Kholoud Alsmearat

Outline • SCOPE • Large-scale Distributed Computing • Map-Reduce programming model • SCOPE / Cosmos • Input and Output • Select and Join • User Defined Operators • Process • Reduce • Combine • Importing script

SCOPE • Structured Computations Optimized for Parallel Execution • A declarative scripting language which its targeted for massive data analysis. • Easy to use: SQL-like syntax plus MapRecuce-like extensions • Highly extensible: • Fully integrated with .NET framework • Flexible programming style: nested expressions or a series of simple transformations

SCOPE cont.. • SCOPE borrows several features from SQL. • Users can easily define their own functions and implement their own versions of operators: • extractor : parsing rows from a file • Processor: row-wise processing.. • Reducer: group-wise processing. • Combiner: combining rows from two inputs .

Large-scale Distributed Computing • Due to the large size of the dataset, traditional parallel DB solution can be more expensive so some companies developed distributed data storage and processing system on large cluster of shared –nothing commodity machines. … . . . … … … . . . …

Map-Reduce • The Map-Reduce programming model • Good abstraction of group-by-aggregation operations • Map function -> grouping • Reduce function -> aggregation • Map-Reduce achieving parallel processing. Limitation: • For some applications it’s unnatural to use Map-Reduce model • such custom-code is error-prone and hardly reusable.

SCOPE / Cosmos • Cosmos Storage System • distributed storage subsystems designed to efficient store large sequential file. • Data is compressed and replicated • Cosmos Execution Environment An environment for deploying, • executing, and debugging distributed applications

An Example: QCount Compute the popular queries that have been requested at least 1000 times Scenario 1: SELECT query, COUNT(*) AS countFROM “search.log” USINGLogExtractorGROUP BY queryHAVING count> 1000ORDER BY count DESC; OUTPUTTO“qcount.result” Scenario 2: e = EXTRACT queryFROM “search.log” USINGLogExtractor; s1= SELECT query, COUNT(*) AS countFROMeGROUP BY query; s2 = SELECT query, countFROMs1 WHERE count> 1000; s3 = SELECT query, countFROMs2ORDER BY count DESC; OUTPUTs3TO “qcount.result” Every rowset has well-defined schema

Input and Output • EXTRACT and OUTPUT commands provide a relational abstraction of underlying data sources • Built-in/customized extractors and outputters (C# classes) EXTRACT column[:<type>] [, …] FROM<input_stream(s) > USING<Extractor> [(args)] [HAVING<predicate>] OUTPUT [<input>] TO<output_stream> [USING<Outputter> [(args)]] publicclassLineitemExtractor : Extractor { …public override Schema Produce(string[] requestedColumns, string[] args) { … } public overrideIEnumerable<Row> Extract(StreamReader reader, Row outputRow, string[] args) { … } }

Select and Join • Supports different Agg functions: COUNT, COUNTIF, MIN, MAX, SUM, AVG, STDEV, VAR, FIRST, LAST. • No subqueries (but same functionality available because of outer join) SELECT [DISTINCT] [TOP count] select_expression [AS<name>] [, …]FROM { <input stream(s)>USING<Extractor> | {<input> [<joined input> […]]} [, …] }[WHERE<predicate>][GROUP BY <grouping_columns> [, …] ][HAVING<predicate>][ORDER BY <select_list_item> [ASC | DESC] [, …]] joined input: <join_type>JOIN<input> [ON<equijoin>] join_type: [INNER | {LEFT | RIGHT | FULL} OUTER]

Deep Integration with .NET (C#) • SCOPE supports C# expressions and built-in .NET functions/library R1 = SELECT A+C AS ac, B.Trim() AS B1FROM RWHEREStringOccurs(C, “xyz”) > 2#CSpublic static intStringOccurs(stringstr, string ptrn){ … }#ENDCS

User Defined Operators • SCOPE supports three highly extensible commands: PROCESS, REDUCE, and COMBINE • Easy to customize by extending built-in C# components • Easy to reuse code in other SCOPE scripts.

Process • PROCESS command takes a rowset as input, processes each row, and outputs a sequence of rows (zero, one, multiple rows). • flexible command it’s enable user to implement processing that is difficult or impossible to express in SQL. PROCESS [<input>]USING<Processor> [ (args) ][PRODUCE column [, …]][WHERE<predicate> ][HAVING<predicate> ] publicclassMyProcessor: Processor { public override Schema Produce(string[] requestedColumns, string[] args, Schema inputSchema) { … } public overrideIEnumerable<Row> Process(RowSet input, Row outRow, string[] args) { … } }

Reduce • REDUCE command takes a grouped rowset, processes each group, and outputs zero, one, or multiple rows per group REDUCE [<input> [PRESORT column [ASC|DESC] [, …]]]ONgrouping_column [, …] USING<Reducer> [ (args) ][PRODUCE column [, …]][WHERE<predicate> ][HAVING<predicate> ] publicclassMyReducer: Reducer { public override Schema Produce(string[] requestedColumns, string[] args, Schema inputSchema) { … } public overrideIEnumerable<Row> Reduce(RowSet input, Row outRow, string[] args) { … } }

Combine • COMBINE command takes two matching input rowsets, combines them in some way, and outputs a sequence of rows COMBINE<input1> [AS<alias1>] [PRESORT …]WITH<input2> [AS<alias2>] [PRESORT …]ON<equality_predicate>USING<Combiner> [ (args) ]PRODUCE column [, …][HAVING<expression> ] COMBINE S1 WITH S2ON S1.A==S2.A AND S1.B==S2.B AND S1.C==S2.C USINGMyCombinerPRODUCE A, B, C publicclassMyCombiner: Combiner { public override Schema Produce(string[] requestedColumns, string[] args, Schema leftSchema, string leftTable, Schema rightSchema, string rightTable) { … } public overrideIEnumerable<Row> Combine(RowSet left, RowSet right, Row outputRow, string[] args) { … } }

Importing Scripts • Similar to SQL table function. • Improves reusability and allows parameterization • Provides a security mechanism IMPORT<script_file>[PARAMS<par_name> = <value> [,…]]

Life of a SCOPE Query Scope Queries . . . … … … Parser / Compiler / Security Optimizer . . . Job Manager

Example Query Plan (QCount) SELECT query, COUNT(*) AS countFROM “search.log” USINGLogExtractorGROUP BY queryHAVING count> 1000ORDER BY count DESC; OUTPUT TO “qcount.result” • Extract the input cosmos file • Partially aggregate at the rack level • Partition on “query” • Fully aggregate • Apply filter on “count” • Sort results in parallel • Merge results • Output as a cosmos file

Conclusions • SCOPE: a new scripting language for large-scale analysis • Strong resemblance to SQL: easy to learn and port existing applications • Very extensible • Fully benefits from .NET library • Supports built-in C# templates for customized operations • High-level declarative language • Implementation details (including parallelism, system complexity) are transparent to users

thanks for listening any Questions???

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets