Jaql → pipes Unix pipes for the JSON data model

Open Source! Jaql → pipesUnix pipes for the JSON data model Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Centerhttp://code.google.com/p/jaql/ http://jaql.org/

Goals for Jaql • Provide a simple, yet powerful language to manipulate semi-structured data. • Use JSON as a data model • Data is usually converted to/from JSON view • Most data has a natural JSON representation • Easily extended using Java, Python, JavaScript, … • Exploit massive parallelism using Hadoop

What is in the upcoming release? • User feedback on previous release • Too XQuery-like (yuck factor) • Too complex • Too composable, too nested, too verbose • Unclear what is parallelized • Next release (planned 10/30/2008) • Vastly simplified syntax • Inspired by Unix Pipes

A query is a pipeline sink source operator operator $people = file …; $greetings = file …; $people -> filter $.type = 'friendly‘ -> map { hello: $.name } -> write $greetings; // declare files // read input (json array) // find friendly people // keep just name // write output Operations listed in natural order vs last operation first one map job

Aggregate • Aggregate the input into a single value • Using push-based, streaming, combining API to aggregate functions $people -> filter by $.birthdate < date(‘1990-01-01’) -> aggregate count($); // count the older people one map / combine / reduce job

Partition • Partition one or more inputs • Send each individual partition through a sub-pipe • Merge the results $people -> filter by $.birthdate < date(‘1990-01-01’) -> partition by $t = $.type // partition the older people by type |- aggregate { type: $t, n: count($) } -|; // aggregate per partition one map / combine / reduce job

User-defined operators • Call user code • Similar to calling user program / script in Unix • Input and output are pipelined • Like “Hadoop streaming” $people -> myBestMatches($, 3); // pass “standard input” to external code Not Parallel!

partition “split” merge Per partition sub-pipe • Partition one or more inputs on a key • Send each partition through (duplicate) sub-pipe • Merge the results $people -> partition by $.type // partition people by type |- sort by $.rating // sort partition by rating -> top 100 // keep just the first 100 in partition -> myBestMatches($,3) -|; // find best matches per partition one map / reduce job

Partition by default • Run sub-pipe on each partition of the input • If input is a file, use its partition, else arbitrary • Expresses parallelism of user-defined operator $file -> partition by default // run per file partition |- buildPartialModel($) -| // partial model built per partition -> unifyModels($); // unify all partial the models into one one map job +serial unify

Join People: [ { id: 1, name: ‘Jack’ }, { id: 2, name: ‘Jill’ }, … ] Children: [{ id: 3, name: ‘Becky’, father: 1, mother: 2 }, …] $people = file …; $children = file …; join $people on $people.id, $children on $children.mother; [ { people: { id: 2, name: ‘Jill’ }, children: { id: 3, name: ‘Becky’, father: 1, mother: 2 } }, … ] • result is record with inputs as values • joins on multiple inputs with multiple conditions • Inner, left-, right-, full-outer joins one map / reduce job

Composite Operators • Join • Join two or more inputs on a key • Inner/outer/full • Multi-predicate, multi-way • Merge • Concatenate all inputs in any order • User-defined operator (function) • Union, Intersect, Difference… One input can comefrom current pipe. Examples: composite operator Remaining inputs are pipe variablesor nested pipes.

Composite sinks • Tee • Send each input item to all output pipes $people -> tee |- filter $.gender == ‘F’ -> write $women |- map { $.name } -> write $names -|; • Split • Send each input item to one pipe

Rough Unix analogs of Jaql Unix: stream of bytes / lines Jaql: stream of JSON items more structure / types

Summary • Unix pipes revolutionized scripting • If you know Unix pipes, you understand Jaql

Questions? Comments?

Jaql → pipes Unix pipes for the JSON data model

Jaql → pipes Unix pipes for the JSON data model

Presentation Transcript

Fundamentals of Heat Pipes

Gas Welding

ArcGIS Hydro Data Model

Module 3d: Flow in Pipes Manning’s Equation

UNIX Lecture 1

Viscous Flow in Pipes

Review

Chapter 5 VISCOUS FLOW: PIPES AND CHANNELS

SQL and SQAPL

HPCI Centre Presentation

安全操作系统

UNIX/Linux

Chapter 2. Object-Oriented Data Model

Water Flow in Pipes

The UNIX OS

The Awk Utility: Awk as a UNIX Tool

Introduction to Unix

REST + JSON APIs with JAX-RS