cs 347 parallel and distributed data management notes x s4 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CS 347: Parallel and Distributed Data Management Notes X: S4 PowerPoint Presentation
Download Presentation
CS 347: Parallel and Distributed Data Management Notes X: S4

Loading in 2 Seconds...

play fullscreen
1 / 96

CS 347: Parallel and Distributed Data Management Notes X: S4 - PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on

CS 347: Parallel and Distributed Data Management Notes X: S4. Hector Garcia-Molina. Material based on:. S4. Platform for processing unbounded data streams general purpose distributed scalable partially fault tolerant (whatever this means!). Data Stream.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CS 347: Parallel and Distributed Data Management Notes X: S4' - lavender


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide3
S4
  • Platform for processing unbounded data streams
    • general purpose
    • distributed
    • scalable
    • partially fault tolerant(whatever this means!)

Lecture 9B

data stream
Data Stream

Terminology: event (data record), key, attribute

Question: Can an event have duplicate attributes for same key?(I think so...)

Stream unbounded, generated by say user queries,purchase transactions, phone calls, sensor readings,...

Lecture 9B

s4 processing workflow
S4 Processing Workflow

user specified

“processing unit”

Lecture 9B

inside a processing unit
Inside a Processing Unit

PE1

processing element

key=a

PE2

key=b

...

key=z

PEn

Lecture 9B

example
Example:
  • Stream of English quotes
  • Produce a sorted list of the top K most frequent words

Lecture 9B

processing nodes
Processing Nodes

processing node

key=a

hash(key)=1

key=b

...

...

hash(key)=m

...

Lecture 9B

dynamic creation of pes
Dynamic Creation of PEs
  • As a processing node sees new key attributes,it dynamically creates new PEs to handle them
  • Think of PEs as threads

Lecture 9B

failures
Failures
  • Communication layer detects node failures and provides failover to standby nodes
  • What happens events in transit during failure?(My guess: events are lost!)

Lecture 9B

how do we do db operations on top of s4
How do we do DB operations on top of S4?
  • Selects & projects easy!

  • What about joins?

Lecture 9B

what is a stream join
What is a Stream Join?
  • For true join, would need to storeall inputs forever! Not practical...
  • Instead, define window join:
    • at time t new R tuple arrives
    • it only joins with previous w S tuples

R

S

Lecture 9B

one idea for window join
One Idea for Window Join

code for PE:for each event e: if e.rel=R

[store in Rset(last w)

for s in Sset: output join(e,s) ]

else ...

R

S

“key” is

join attribute

Lecture 9B

one idea for window join1
One Idea for Window Join

code for PE:for each event e: if e.rel=R

[store in Rset(last w)

for s in Sset: output join(e,s) ]

else ...

R

S

“key” is

join attribute

Is this right???(enforcing window on a per-key value basis)

Maybe add sequence numbers

to events to enforce correct window?

Lecture 9B

another idea for window join
Another Idea for Window Join

code for PE:for each event e: if e.rel=R

[store in Rset(last w)

for s in Sset:

if e.C=s.C then output join(e,s) ]

else if e.rel=S ...

R

S

All R & S events

have “key=fake”;

Say join key is C.

Lecture 9B

another idea for window join1
Another Idea for Window Join

code for PE:for each event e: if e.rel=R

[store in Rset(last w)

for s in Sset:

if e.C=s.C then output join(e,s) ]

else if e.rel=S ...

R

S

All R & S events

have “key=fake”;

Say join key is C.

Entire join done in one PE;

no parallelism

Lecture 9B

final comment managing state of pe
Final Comment: Managing state of PE

state

Who manages state?S4: user doesMupet: System does

Is state persistent?

Lecture 9B

cs 347 parallel and distributed data management notes x hyracks

CS 347: Parallel and DistributedData ManagementNotes X: Hyracks

Hector Garcia-Molina

Lecture 9B

hyracks
Hyracks
  • Generalization of map-reduce
  • Infrastructure for “big data” processing
  • Material here based on:

Appeared in ICDE 2011

Lecture 9B

a hyracks data object
A Hyracks data object:
  • Records partitioned across N sites
  • Simple record schema is available (more than just key-value)

site 1

site 2

site 3

Lecture 9B

operations
Operations

distributionrule

operator

Lecture 9B

operations parallel execution
Operations (parallel execution)

op 1

op 2

op 3

Lecture 9B

example hyracks specification1
Example: Hyracks Specification

initial partitions

input & output isa set of partitions

mapping to new partitions

2 replicated partitions

Lecture 9B

notes
Notes
  • Job specification can be done manually or automatically

Lecture 9B

example activity node graph1
Example: Activity Node Graph

sequencing constraint

activities

Lecture 9B

example parallel instantiation1
Example: Parallel Instantiation

stage

(start after input stages finish)

Lecture 9B

library of operators
Library of Operators:
  • File reader/writers
  • Mappers
  • Sorters
  • Joiners (various types0
  • Aggregators
  • Can add more

Lecture 9B

library of connectors
Library of Connectors:
  • N:M hash partitioner
  • N:M hash-partitioning merger (input sorted)
  • N:M rage partitioner (with partition vector)
  • N:M replicator
  • 1:1
  • Can add more!

Lecture 9B

hyracks fault tolerance work in progress1
Hyracks Fault Tolerance: Work in progress?

save output to files

save output to files

The Hadoop/MR approach: save partial results to persistent storage;

after failure, redo all work to reconstruct missing data

Lecture 9B

hyracks fault tolerance work in progress2
Hyracks Fault Tolerance: Work in progress?

Can we do better?

Maybe: each process retains previous results until no longer needed?

Pi output: r1, r2, r3, r4, r5, r6, r7, r8

have "made way" to final result

current output

Lecture 9B

cs 347 parallel and distributed data management notes x pregel

CS 347: Parallel and DistributedData ManagementNotes X: Pregel

Hector Garcia-Molina

Lecture 9B

material based on1
Material based on:
  • In SIGMOD 2010
  • Note there is an open-source version of Pregel called GIRAPH

Lecture 9B

pregel
Pregel
  • A computational model/infrastructurefor processing large graphs
  • Prototypical example: Page Rank

a

d

PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)

x

b

e

f

Lecture 9B

pregel1

a

d

Pregel
  • Synchronous computation in iterations
  • In one iteration, each node:
    • gets messages from neighbors
    • computes
    • sends data to neighbors

x

PR[i+1,x] =f(PR[i,a]/na, PR[i,b]/nb)

b

e

f

Lecture 9B

pregel vs map reduce s4 hyracks
Pregel vs Map-Reduce/S4/Hyracks/...
  • In Map-Reduce, S4, Hyracks,...workflow separate from data
  • In Pregel, data (graph) drives data flow

Lecture 9B

pregel motivation
Pregel Motivation
  • Many applications require graph processing
  • Map-Reduce and other workflow systemsnot a good fit for graph processing
  • Need to run graph algorithms on many procesors

Lecture 9B

termination
Termination
  • After each iteration, each vertex votes tohalt or not
  • If all vertexes vote to halt, computation terminates

Lecture 9B

vertex compute simplified
Vertex Compute (simplified)
  • Available data for iteration i:
    • InputMessages: { [from, value] }
    • OutputMessages: { [to, value] }
    • OutEdges: { [to, value] }
    • MyState: value
  • OutEdges and MyState are remembered for next iteration

Lecture 9B

max computation
Max Computation
  • change := false
  • for [f,w] in InputMessages do if w > MyState.value then [ MyState.value := w change := true ]
  • if (superstep = 1) OR change then for [t, w] in OutEdges do add [t, MyState.value] to OutputMessages else vote to halt

Lecture 9B

page rank example1
Page Rank Example

iteration count

iterate thru InputMessages

MyState.value

shorthand: send same msg to all

Lecture 9B

architecture
Architecture

worker A

input

data 1

master

worker B

input

data 2

sample record:[a, value]

worker C

graph has nodes a,b,c,d...

Lecture 9B

architecture1
Architecture

worker A

input

data 1

vertexesa, b, c

master

worker B

input

data 2

vertexesd, e

worker C

partition graph andassign to workers

vertexesf, g, h

Lecture 9B

architecture2
Architecture

worker A

input

data 1

vertexesa, b, c

master

worker B

input

data 2

vertexesd, e

worker C

read input data

worker Aforwards inputvalues to appropriateworkers

vertexesf, g, h

Lecture 9B

architecture3
Architecture

worker A

input

data 1

vertexesa, b, c

master

worker B

input

data 2

vertexesd, e

worker C

run superstep 1

vertexesf, g, h

Lecture 9B

architecture4
Architecture

worker A

input

data 1

vertexesa, b, c

halt?

master

worker B

input

data 2

vertexesd, e

worker C

at end superstep 1,send messages

vertexesf, g, h

Lecture 9B

architecture5
Architecture

worker A

input

data 1

vertexesa, b, c

master

worker B

input

data 2

vertexesd, e

worker C

run superstep 2

vertexesf, g, h

Lecture 9B

architecture6
Architecture

worker A

input

data 1

vertexesa, b, c

master

worker B

input

data 2

vertexesd, e

worker C

checkpoint

vertexesf, g, h

Lecture 9B

architecture7
Architecture

worker A

vertexesa, b, c

master

worker B

vertexesd, e

checkpoint

worker C

write to stable store:

MyState, OutEdges,InputMessages(or OutputMessages)

vertexesf, g, h

Lecture 9B

architecture8
Architecture

worker A

vertexesa, b, c

master

worker B

vertexesd, e

if worker dies,

find replacement &restart fromlatest checkpoint

Lecture 9B

architecture9
Architecture

worker A

input

data 1

vertexesa, b, c

master

worker B

input

data 2

vertexesd, e

worker C

vertexesf, g, h

Lecture 9B

interesting challenge
Interesting Challenge
  • How best to partition graph for efficiency?

a

d

g

x

b

e

f

Lecture 9B

cs 347 parallel and distributed data management notes x bigtable hbase cassandra

CS 347: Parallel and DistributedData ManagementNotes X:BigTable, HBASE, Cassandra

Hector Garcia-Molina

Lecture 9B

sources
Sources
  • HBASE: The Definitive Guide, Lars George, O’Reilly Publishers, 2011.
  • Cassandra: The Definitive Guide, Eben Hewitt, O’Reilly Publishers, 2011.
  • BigTable: A Distributed Storage System for Structured Data, F. Chang et al, ACM Transactions on Computer Systems, Vol. 26, No. 2, June 2008.

Lecture 9B

lots of buzz words
Lots of Buzz Words!
  • “Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database that bases its distribution design on Amazon’s dynamo and its data model on Google’s Big Table.”
  • Clearly, it is buzz-word compliant!!

Lecture 9B

basic idea key value store
Basic Idea: Key-Value Store

Table T:

Lecture 9B

basic idea key value store1
Basic Idea: Key-Value Store
  • API:
    • lookup(key)  value
    • lookup(key range)  values
    • getNext  value
    • insert(key, value)
    • delete(key)
  • Each row has timestemp
  • Single row actions atomic(but not persistent in some systems?)
  • No multi-key transactions
  • No query language!

Table T:

keys are sorted

Lecture 9B

fragmentation sharding
Fragmentation (Sharding)

server 1

server 2

server 3

tablet

  • use a partition vector
  • “auto-sharding”: vector selected automatically

Lecture 9B

tablet replication
Tablet Replication

server 3

server 4

server 5

primary

backup

backup

  • Cassandra:Replication Factor (# copies)R/W Rule: One, Quorum, AllPolicy (e.g., Rack Unaware, Rack Aware, ...)Read all copies (return fastest reply, do repairs if necessary)
  • HBase: Does not manage replication, relies on HDFS

Lecture 9B

need a directory
Need a “directory”
  • Table Name: Key  Server that stores key Backup servers
  • Can be implemented as a special table.

Lecture 9B

tablet internals
Tablet Internals

memory

disk

Design Philosophy (?): Primary scenario is where all data is in memory.

Disk storage added as an afterthought

Lecture 9B

tablet internals1
Tablet Internals

tombstone

memory

flush periodically

disk

  • tablet is merge of all segments (files)
  • disk segments imutable
  • writes efficient; reads only efficient when all data in memory
  • periodically reorganize into single segment

Lecture 9B

column family
Column Family

Lecture 9B

column family1
Column Family
  • for storage, treat each row as a single “super value”
  • API provides access to sub-values(use family:qualifier to refer to sub-values e.g., price:euros, price:dollars )
  • Cassandra allows “super-column”: two level nesting of columns (e.g., Column A can have sub-columns X & Y )

Lecture 9B

vertical partitions
Vertical Partitions

can be manually implemented as

Lecture 9B

vertical partitions1
Vertical Partitions

column family

  • good for sparse data;
  • good for column scans
  • not so good for tuple reads
  • are atomic updates to row still supported?
  • API supports actions on full table; mapped to actions on column tables
  • API supports column “project”
  • To decide on vertical partition, need to know access patterns

Lecture 9B

failure recovery bigtable hbase
Failure Recovery (BigTable, HBase)

ping

master node

tablet server

sparetablet server

memory

write ahead logging

log

GFS or HFS

Lecture 9B

failure recovery cassandra
Failure recovery (Cassandra)
  • No master node, all nodes in “cluster” equal

server 1

server 3

server 2

Lecture 9B

failure recovery cassandra1
Failure recovery (Cassandra)
  • No master node, all nodes in “cluster” equal

access any table in clusterat any server

server 1

server 3

server 2

that server sends requeststo other servers

Lecture 9B

cs 347 parallel and distributed data management notes x memcached

CS 347: Parallel and DistributedData ManagementNotes X: MemCacheD

Hector Garcia-Molina

Lecture 9B

memcached
MemCacheD
  • General-purpose distributed memory caching system
  • Open source

Lecture 9B

what memcached should be but ain t
What MemCacheD Should Be (but ain't)

get_object(X)

distributed cache

cache 1

cache 2

cache 3

data source 1

data source 2

data source 3

Lecture 9B

what memcached should be but ain t1
What MemCacheD Should Be (but ain't)

get_object(X)

distributed cache

cache 1

cache 2

cache 3

x

data source 1

data source 2

data source 3

x

Lecture 9B

what memcached is as far as i can tell
What MemCacheD Is (as far as I can tell)

put(cache 1, myName, X)

get_object(cache 1, MyName)

cache 1

cache 2

cache 3

data source 1

data source 2

data source 3

x

Lecture 9B

what memcached is
What MemCacheD Is

each cache ishash table of(name, value) pairs

put(cache 1, myName, X)

get_object(cache 1, MyName)

Can purge MyName whenever

cache 1

cache 2

cache 3

no connection

data source 1

data source 2

data source 3

x

Lecture 9B

cs 347 parallel and distributed data management notes x zookeeper

CS 347: Parallel and DistributedData ManagementNotes X: ZooKeeper

Hector Garcia-Molina

Lecture 9B

zookeeper
ZooKeeper
  • Coordination service for distributed processes
  • Provides clients with high throughput, high availability, memory only file system

client

ZooKeeper

/

client

a

b

client

c

d

client

client

e

znode /a/d/e (has state)

Lecture 9B

zookeeper servers
ZooKeeper Servers

client

server

client

state replica

client

server

state replica

client

client

server

state replica

Lecture 9B

zookeeper servers1
ZooKeeper Servers

client

server

read

client

state replica

client

server

state replica

client

client

server

state replica

Lecture 9B

zookeeper servers2
ZooKeeper Servers

writes totally ordered:

used Zab algorithm

client

server

write

client

state replica

propagate & sych

client

server

state replica

client

client

server

state replica

Lecture 9B

failures1
Failures

if your server dies,

just connect to a different one!

client

server

client

state replica

client

server

state replica

client

client

server

state replica

Lecture 9B

zookeeper notes
ZooKeeper Notes
  • Differences with file system:
    • all nodes can store data
    • storage size limited
  • API: insert node, read node, read children, delete node, ...
  • Can set triggers on nodes
  • Clients and servers must know all servers
  • ZooKeeper works as long as a majority of servers are available
  • Writes totally ordered; read ordered w.r.t. writes

Lecture 9B

cs 347 parallel and distributed data management notes x kestrel

CS 347: Parallel and DistributedData ManagementNotes X: Kestrel

Hector Garcia-Molina

Lecture 9B

kestrel
Kestrel
  • Kestrel server handles a set of reliable, ordered message queues
  • A server does not communicate with other servers (advertised as good for scalability!)

server

client

queues

server

queues

client

Lecture 9B

kestrel1
Kestrel
  • Kestrel server handles a set of reliable, ordered message queues
  • A server does not communicate with other servers (advertised as good for scalability!)

log

put q

server

client

queues

server

queues

client

get q

log

Lecture 9B