slide1
Download
Skip this Video
Download Presentation
PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

Loading in 2 Seconds...

play fullscreen
1 / 16

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber. The big picture. Plethora of high-volume data streams, from which valuable information can be extracted. Call-detail data, web logs, provisioning streams, tcpdump data, etc . Desired operations:

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber' - yelena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1
PADS:

Processing Arbitrary Data Streams

Kathleen Fisher

Robert Gruber

the big picture
The big picture
  • Plethora of high-volume data streams, from which valuable information can be extracted.
    • Call-detail data, web logs, provisioning streams, tcpdump data, etc.
  • Desired operations:
    • Programmatic manipulation
    • Format translation (into XML, relational database, etc.)
    • Declarative interaction
      • Filtering, querying, aggregation, statistical profiling
technical challenges
Technical challenges
  • Data arrives “as is.”
    • Format determined by data source, not consumers.
    • Often has little documentation.
    • Some percentage of data is “buggy.”
  • Often streams have high volume.
    • Detect relevant errors (without necessarily halting program)
    • Control how data is read (e.g. read header but skip body vs. read entire record).
  • Parsing routines must be written to support any of the desired operations.
why not use c perl shell scripts
Why not use C / Perl / Shell scripts… ?

Problems with hand-coded parsers:

  • Writing them is time consuming and error prone.
  • Reading them a few months later is difficult.
  • Maintaining them in the face of even small format changes can be difficult.
  • Programs break in subtle and machine-specific ways (endien-ness, word-sizes).
  • Such programs are often incomplete, particularly with respect to errors.
solution pads system in progress
Solution: PADS System (In Progress)

One person writes declarative description of data source:

  • Physical format information
  • Semantic constraints.

Many people use PADS data description and generated library.

PADS system generates

  • C library interface for processing data.
    • Reading ( original / binary / XML / …)
    • Writing ( original / binary / XML / … )
    • Accumulators
  • Application for querying stream.
pads language
PADS language
  • Can describe ASCII, EBCDIC (Cobol) , binary, and mixed data formats.
  • Allows arbitrary boolean constraint expressions to describe expected properties of data.
  • Type-based model: each type indicates how to read associated data.
  • Provides rich and extensible set of base types.
    • Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8
    • Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:)
  • Supports user-defined compound types to describe file structure:
    • Pstruct, Parray, Punion, Ptypedef, Penum
pads compiler
PADS compiler
  • Converts description to C header and implementation files.
  • For each built-in/user-defined type:
    • Functions (read, accumulate, write, test data generation)
    • In-memory representation
    • Error description
    • Mask (check constraints, set representation, suppress printing)
  • Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.
example clf web log
Example: CLF web log
  • Common Log Format from Web Protocols and Practice.
  • Fields:
    • IP address of remote host, either resolved (as above) or symbolic
    • Remote identity (usually ‘-’ to indicate name not collected)
    • Authenticated user (usually ‘-’ to indicate name not collected)
    • Time associated with request
    • Request (request method, request-uri, and protocol version)
    • Response code
    • Content length

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

example clf web log in pads
Example: CLF web log in PADS

PrecordPstruct http_weblog {

host client; /- Client requesting service

' '; auth_id remoteID; /- Remote identity

' '; auth_id auth; /- Name of authenticated user

“ [”; Pdate(:']':) date; /- Timestamp of request

“] ”; http_request request; /- Request

' '; Puint16_FW(:3:) response; /- 3-digit response code

' '; Puint32 contentLength; /- Bytes in response

};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

padsl example user constraint
PADSL example: user constraint

int checkVersion(http_v version, method_t meth) {

if ((version.major == 1) && (version.minor == 1)) return 1;

if ((meth == LINK) || (meth == UNLINK)) return 0;

return 1;

}

Pstruct http_request {

'\"'; method_t meth; /- Request method

' '; Pstring(:' ':) req_uri; /- Requested uri.

' '; http_v version : checkVersion(version, meth);

/- HTTP version number of request

'\"';

};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

padsl example arrays and unions
PADSL example: arrays and unions

ParraynIP {

Puint8 [4] : Psep == '.';

};

Parray sIP {

Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' ';

}

Punionhost {

nIP resolved; /- 135.207.23.32

sIP symbolic; /- www.research.att.com

};

Punionauth_id {

Pchar unauthorized : unauthorized == '-';

/- non-authenticated http session

Pstring(:' ':) id;

/- login supplied during authentication

};

207.136.97.50- - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

generated type declarations
Generated type declarations

typedefstruct {

host client; /* Client requesting service */

auth_id remoteID; /* Remote identity */

} http_weblog;

typedefstruct {

host_m client;

auth_id_m remoteID;

} http_weblog_m;

typedefstruct {

int nerr;

int errCode;

PDC_loc loc;

int panic;

host_ed client;

auth_id_ed remoteID;

…;

} http_weblog_ed;

sample use
Sample use

PDC_t *pdc;

http_weblog entry;

http_weblog_m mask;

http_weblog_ed ed;

PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */);

PDC_IO_fopen(pdc, fileName);

... call init functions ...

http_weblog_mask(&mask, PCheck & PSet);

while (!PDC_IO_at_EOF(pdc)) {

http_weblog_read(pdc, &mask, &ed, &entry);

if (ed.nerr != 0) { ... Error handling ... }

... Process/query entry ...

};

... call cleanup functions ...

PDC_IO_fclose(pdc);

PDC_close(pdc);

related work
Related work
  • ASN.1, ASDL
    • Describe logical representation, generate physical.
  • DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000]
    • Binary only
    • Stop on first error
pads to do
PADS to do
  • Allow library generation to be customized with application-specific information:
    • Repair errors, ignore certain fields, customize in-memory representation, etc.
  • Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel).
  • Support data translation
    • Requires mapping from one in-memory representation to another.
  • Develop user-base and integrate feedback.
    • What would you want in such a tool?
getting pads
Getting PADS

PADS will be available shortly for download with a non-commercial-use license.

http://www.research.att.com/projects/pads

ad