cleaning and processing physical time series data n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Cleaning and Processing Physical Time-Series Data PowerPoint Presentation
Download Presentation
Cleaning and Processing Physical Time-Series Data

Loading in 2 Seconds...

play fullscreen
1 / 20

Cleaning and Processing Physical Time-Series Data - PowerPoint PPT Presentation


  • 132 Views
  • Uploaded on

Cleaning and Processing Physical Time-Series Data. Stephen Dawson-Haggerty Computer Science Division , University of California, Berkeley stevedh@eecs.berkeley.edu. Introduction. Lots of sMAP frontends/apps growing up Enabled by architectural decoupling. Services Communicating with sMAP.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Cleaning and Processing Physical Time-Series Data' - chapa


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cleaning and processing physical time series data

Cleaning and Processing Physical Time-Series Data

Stephen Dawson-Haggerty

Computer Science Division, University of California, Berkeley

stevedh@eecs.berkeley.edu

introduction
Introduction

Lots of sMAP frontends/apps growing up

  • Enabled by architectural decoupling

Local Summer Retreat 2012

services communicating with smap
Services Communicating with sMAP

Many deployments share

common infrastructure

6lowpan networks

sMAP

sMAP

RS-458 bus

control

web

models

mgmt

sMAP

BacNET/IP

Archiver

RDBMS

TSDB

Lines of decoupling

Local Summer Retreat 2012

s licr tags generate multiple views
slicr: Tags Generate Multiple Views

[ { tag : "Metadata/SourceName",

restrict: "has Metadata/Extra/EndUse"},

{ tag: "Metadata/Extra/EndUse"},

{ tag: "Metadata/Extra/Category",

defaultSubStream: "Properties/UnitofMeasure = 'mW'",

seriesLabel:["Metadata/Location/Room", "Metadata/Extra/Load"]},

{ tag: "Metadata/Extra/ProductType",

defaultSubStream: "Properties/UnitofMeasure = 'mW'",

seriesLabel:["Metadata/Location/Room", "Metadata/Extra/Load"]},

{ tag: "Metadata/Instrument/PartNumber",

defaultSubStream: "Properties/UnitofMeasure = 'mW'",

seriesLabel:["Metadata/Instrument/PartNumber",

"Metadata/Location/Room", "Metadata/Extra/Load”]},

"Properties/UnitofMeasure”

]

Local Summer Retreat 2012

slide5

query

aggregate

resample

streaming pipeline

insert

Time-series Interface

Bucketing

Compression

Storage mapper

RPC

readingdb

Key-Value Store

SQL

Page Cache

Lock Manager

Storage Alloc.

MySQL

Local Summer Retreat 2012

motivation
Motivation
  • Very common operations
    • Resample/subsample
    • Aggregate
    • Filter/smooth
    • Exploratory interactive analysis
    • Visualization
    • Recalibration/post-calibration
  • Get data into MATLAB
  • What are the right data access primitives with these facts in mind?

Local Summer Retreat 2012

larry ellison s query
Larry Ellison’s query
  • Extract data from 100 streams
  • Interpolate onto a 5-minute time basis
  • Combine the streams into a matrix
  • Filter missing data
  • Load into MATLAB/R/numpy

applymissing < paste < window(count, field="minute", width=15)

to data in ("4/20/2012", "4/21/2012")

where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’

Local Summer Retreat 2012

design goals
Design goals
  • Replace error-prone and poorly-performing application code
    • Windowed filters, merging
  • Optimize common data cleaning operations
    • i.e., materialize subsampled time-series for low latency
  • Enable actions on streaming, processed data
    • Work seamlessly on historical + streaming

Local Summer Retreat 2012

approaches
Approaches
  • SQL:2003 windowing functions

SELECT OVER (ORDER BY time ROWS 10 PRECEEDING)

  • Language toolkits approaches
    • Python/pandas
    • R
    • Stata
    • MATLAB
  • Distributed frameworks
    • Pig, Hive
  • Database Approaches
    • SciDB

Local Summer Retreat 2012

processing model
Processing model

pipes of operators

  • unix philosophy
  • process metadata alongside data

each stage defines a new set of distillate streams

Local Summer Retreat 2012

what is an operator
What is an operator
  • An operator reads a set of input streams
  • And produces a set of distillate streams
    • May mutate any of the dimensions
    • Each output stream is uniquely named

Example: unit

Read a set of input streams and apply a common set of unit conversions

unit (S, T, W)  (S, T, W)

Local Summer Retreat 2012

processing model1
Processing model

streams

op

time

Dimensionality:

- S: Streams

- T: Time

- W: “Width” (= 2)

Type: OAT

Unit: C

ID: 1

Type: OAT

Unit: Deg F

ID: 2

Type: OAT

Unit: C

ID: 3

Local Summer Retreat 2012

operator construction
Operator construction
  • Specialize: bind to arguments

op = add(10)

  • Instantiate: bind to stream meta-data; generate new metadata

op([{id: 1, unit: “C”}, {id: 2, unit: “Deg F”}])

 [{id: 47, unit: “C”}]

  • Process

op([[[1337628952, 23]], [[1337628950, 70]]])

Local Summer Retreat 2012

operators are first class
Operators are first class
  • Pass operators as arguments
    • window(mean, field=“minute”, width=15)
    • Apply the mean operator to windows in time 15 minutes long
    • This produces time vectors with a common timebase

applysum(axis=1) < paste < window(count, field="minute", width=15)

to data in ("4/20/2012", "4/21/2012")

where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’

Local Summer Retreat 2012

other operators work on streams
Other operators work on streams
  • For instance, merging on timestamps
    • paste: (S, T, W)  (1, |unique(T)|, sum(W))

applysum(axis=1) < paste < window(count, field="minute", width=15)

to data in ("4/20/2012", "4/21/2012")

where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’

Local Summer Retreat 2012

specialization
Specialization
  • Operators specialize their type based on args
    • sum(axis=0): (S, T, W)  (S, 1, W)
    • sum(axis=1): (S, T, W)  (S, T, 2)

applysum(axis=1) < paste < window(count, field="minute", width=15)

to data in ("4/20/2012", "4/21/2012")

where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’

Local Summer Retreat 2012

operators can be pipelined
Operators can be pipelined
  • Since the outputs are just more streams

sum(axis=1) < paste < window(count, field="minute", width=15)

  • Each operator mutates the output metadata
    • By default, you get the intersection of the shared metadata

applysum(axis=1) < paste < window(count, field="minute", width=15)

to data in ("4/20/2012", "4/21/2012")

where Metadata/Extra/System = 'datacenter' and Properties/UnitofMeasure = 'kW’

Local Summer Retreat 2012

implementation
Implementation
  • Integrated into query language for locating/selecting streams
    • Data exploration and selection (Winter ‘12)
  • Integrated with backend time-series database (readingdb)
    • Range queries of time-series data at line rates (Summer ‘11)
  • Streaming output to clients

Local Summer Retreat 2012

insights and key questions
Insights and key questions
  • Pushing tuples through operators is a good fit
    • But need to allow batching for efficiency
    • Heavy use of c libraries
  • Maintain provenance as metadata
  • Working with the time axis is different from the other axes
    • When do you produce a record?
  • Many optimizations are possible
    • Materialize subsampled view of data for interactive exploration

Local Summer Retreat 2012

status
Status

Part of sMAP 2.0.300rc1!

http://code.google.com/p/smap-data

Local Summer Retreat 2012