jaql pipes unix pipes for the json data model n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Jaql → pipes Unix pipes for the JSON data model PowerPoint Presentation
Download Presentation
Jaql → pipes Unix pipes for the JSON data model

Loading in 2 Seconds...

play fullscreen
1 / 15

Jaql → pipes Unix pipes for the JSON data model - PowerPoint PPT Presentation


  • 315 Views
  • Uploaded on

Open Source!. Jaql → pipes Unix pipes for the JSON data model. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ http://jaql.org/. Goals for Jaql.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Jaql → pipes Unix pipes for the JSON data model' - fawn


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
jaql pipes unix pipes for the json data model

Open Source!

Jaql → pipesUnix pipes for the JSON data model

Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata

IBM Almaden Research Centerhttp://code.google.com/p/jaql/

http://jaql.org/

goals for jaql
Goals for Jaql
  • Provide a simple, yet powerful language to manipulate semi-structured data.
    • Use JSON as a data model
      • Data is usually converted to/from JSON view
      • Most data has a natural JSON representation
  • Easily extended using Java, Python, JavaScript, …
  • Exploit massive parallelism using Hadoop
what is in the upcoming release
What is in the upcoming release?
  • User feedback on previous release
    • Too XQuery-like (yuck factor)
    • Too complex
      • Too composable, too nested, too verbose
    • Unclear what is parallelized
  • Next release (planned 10/30/2008)
    • Vastly simplified syntax
      • Inspired by Unix Pipes
a query is a pipeline
A query is a pipeline

sink

source

operator

operator

$people = file …;

$greetings = file …;

$people

-> filter $.type = 'friendly‘

-> map { hello: $.name }

-> write $greetings;

// declare files

// read input (json array)

// find friendly people

// keep just name

// write output

Operations listed in natural order vs last operation first

one map job

aggregate
Aggregate
  • Aggregate the input into a single value
    • Using push-based, streaming, combining API to aggregate functions

$people

-> filter by $.birthdate < date(‘1990-01-01’)

-> aggregate count($); // count the older people

one map / combine / reduce job

partition
Partition
  • Partition one or more inputs
  • Send each individual partition through a sub-pipe
  • Merge the results

$people

-> filter by $.birthdate < date(‘1990-01-01’)

-> partition by $t = $.type // partition the older people by type

|- aggregate { type: $t, n: count($) } -|; // aggregate per partition

one map / combine / reduce job

user defined operators
User-defined operators
  • Call user code
    • Similar to calling user program / script in Unix
  • Input and output are pipelined
    • Like “Hadoop streaming”

$people

-> myBestMatches($, 3); // pass “standard input” to external code

Not Parallel!

per partition sub pipe

partition

“split”

merge

Per partition sub-pipe
  • Partition one or more inputs on a key
  • Send each partition through (duplicate) sub-pipe
  • Merge the results

$people

-> partition by $.type // partition people by type

|- sort by $.rating // sort partition by rating

-> top 100 // keep just the first 100 in partition

-> myBestMatches($,3) -|; // find best matches per partition

one map / reduce job

partition by default
Partition by default
  • Run sub-pipe on each partition of the input
    • If input is a file, use its partition, else arbitrary
  • Expresses parallelism of user-defined operator

$file

-> partition by default // run per file partition

|- buildPartialModel($) -| // partial model built per partition

-> unifyModels($); // unify all partial the models into one

one map job +serial unify

slide10
Join

People: [ { id: 1, name: ‘Jack’ }, { id: 2, name: ‘Jill’ }, … ]

Children: [{ id: 3, name: ‘Becky’, father: 1, mother: 2 }, …]

$people = file …;

$children = file …;

join $people on $people.id,

$children on $children.mother;

[ { people: { id: 2, name: ‘Jill’ },

children: { id: 3, name: ‘Becky’, father: 1, mother: 2 } }, … ]

  • result is record with inputs as values
  • joins on multiple inputs with multiple conditions
  • Inner, left-, right-, full-outer joins

one map / reduce job

composite operators
Composite Operators
  • Join
    • Join two or more inputs on a key
    • Inner/outer/full
    • Multi-predicate, multi-way
  • Merge
    • Concatenate all inputs in any order
  • User-defined operator (function)
  • Union, Intersect, Difference…

One input can comefrom current pipe.

Examples:

composite

operator

Remaining inputs are pipe variablesor nested pipes.

composite sinks
Composite sinks
  • Tee
    • Send each input item to all output pipes

$people

-> tee

|- filter $.gender == ‘F’ -> write $women

|- map { $.name } -> write $names

-|;

  • Split
    • Send each input item to one pipe
rough unix analogs of jaql
Rough Unix analogs of Jaql

Unix: stream of bytes / lines

Jaql: stream of JSON items

more structure / types

summary
Summary
  • Unix pipes revolutionized scripting
  • If you know Unix pipes, you understand Jaql
slide15
Questions?

Comments?