PADS:
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber. The big picture. Plethora of high-volume data streams, from which valuable information can be extracted. Call-detail data, web logs, provisioning streams, tcpdump data, etc . Desired operations:

Download Presentation

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Pads processing arbitrary data streams kathleen fisher robert gruber

PADS:

Processing Arbitrary Data Streams

Kathleen Fisher

Robert Gruber


The big picture

The big picture

  • Plethora of high-volume data streams, from which valuable information can be extracted.

    • Call-detail data, web logs, provisioning streams, tcpdump data, etc.

  • Desired operations:

    • Programmatic manipulation

    • Format translation (into XML, relational database, etc.)

    • Declarative interaction

      • Filtering, querying, aggregation, statistical profiling


Technical challenges

Technical challenges

  • Data arrives “as is.”

    • Format determined by data source, not consumers.

    • Often has little documentation.

    • Some percentage of data is “buggy.”

  • Often streams have high volume.

    • Detect relevant errors (without necessarily halting program)

    • Control how data is read (e.g. read header but skip body vs. read entire record).

  • Parsing routines must be written to support any of the desired operations.


Why not use c perl shell scripts

Why not use C / Perl / Shell scripts… ?

Problems with hand-coded parsers:

  • Writing them is time consuming and error prone.

  • Reading them a few months later is difficult.

  • Maintaining them in the face of even small format changes can be difficult.

  • Programs break in subtle and machine-specific ways (endien-ness, word-sizes).

  • Such programs are often incomplete, particularly with respect to errors.


Solution pads system in progress

Solution: PADS System (In Progress)

One person writes declarative description of data source:

  • Physical format information

  • Semantic constraints.

    Many people use PADS data description and generated library.

    PADS system generates

  • C library interface for processing data.

    • Reading ( original / binary / XML / …)

    • Writing ( original / binary / XML / … )

    • Accumulators

  • Application for querying stream.


Pads language

PADS language

  • Can describe ASCII, EBCDIC (Cobol) , binary, and mixed data formats.

  • Allows arbitrary boolean constraint expressions to describe expected properties of data.

  • Type-based model: each type indicates how to read associated data.

  • Provides rich and extensible set of base types.

    • Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8

    • Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:)

  • Supports user-defined compound types to describe file structure:

    • Pstruct, Parray, Punion, Ptypedef, Penum


Pads compiler

PADS compiler

  • Converts description to C header and implementation files.

  • For each built-in/user-defined type:

    • Functions (read, accumulate, write, test data generation)

    • In-memory representation

    • Error description

    • Mask (check constraints, set representation, suppress printing)

  • Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.


Example clf web log

Example: CLF web log

  • Common Log Format from Web Protocols and Practice.

  • Fields:

    • IP address of remote host, either resolved (as above) or symbolic

    • Remote identity (usually ‘-’ to indicate name not collected)

    • Authenticated user (usually ‘-’ to indicate name not collected)

    • Time associated with request

    • Request (request method, request-uri, and protocol version)

    • Response code

    • Content length

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013


Example clf web log in pads

Example: CLF web log in PADS

PrecordPstruct http_weblog {

host client; /- Client requesting service

' '; auth_id remoteID; /- Remote identity

' '; auth_id auth; /- Name of authenticated user

“ [”; Pdate(:']':) date; /- Timestamp of request

“] ”; http_request request; /- Request

' '; Puint16_FW(:3:) response; /- 3-digit response code

' '; Puint32 contentLength; /- Bytes in response

};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013


Padsl example user constraint

PADSL example: user constraint

int checkVersion(http_v version, method_t meth) {

if ((version.major == 1) && (version.minor == 1)) return 1;

if ((meth == LINK) || (meth == UNLINK)) return 0;

return 1;

}

Pstruct http_request {

'\"'; method_t meth; /- Request method

' '; Pstring(:' ':) req_uri; /- Requested uri.

' '; http_v version : checkVersion(version, meth);

/- HTTP version number of request

'\"';

};

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013


Padsl example arrays and unions

PADSL example: arrays and unions

ParraynIP {

Puint8 [4] : Psep == '.';

};

Parray sIP {

Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' ';

}

Punionhost {

nIP resolved; /- 135.207.23.32

sIP symbolic; /- www.research.att.com

};

Punionauth_id {

Pchar unauthorized : unauthorized == '-';

/- non-authenticated http session

Pstring(:' ':) id;

/- login supplied during authentication

};

207.136.97.50- - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013


Generated type declarations

Generated type declarations

typedefstruct {

host client; /* Client requesting service */

auth_id remoteID; /* Remote identity */

} http_weblog;

typedefstruct {

host_m client;

auth_id_m remoteID;

} http_weblog_m;

typedefstruct {

int nerr;

int errCode;

PDC_loc loc;

int panic;

host_ed client;

auth_id_ed remoteID;

…;

} http_weblog_ed;


Sample use

Sample use

PDC_t *pdc;

http_weblog entry;

http_weblog_m mask;

http_weblog_ed ed;

PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */);

PDC_IO_fopen(pdc, fileName);

... call init functions ...

http_weblog_mask(&mask, PCheck & PSet);

while (!PDC_IO_at_EOF(pdc)) {

http_weblog_read(pdc, &mask, &ed, &entry);

if (ed.nerr != 0) { ... Error handling ... }

... Process/query entry ...

};

... call cleanup functions ...

PDC_IO_fclose(pdc);

PDC_close(pdc);


Related work

Related work

  • ASN.1, ASDL

    • Describe logical representation, generate physical.

  • DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000]

    • Binary only

    • Stop on first error


Pads to do

PADS to do

  • Allow library generation to be customized with application-specific information:

    • Repair errors, ignore certain fields, customize in-memory representation, etc.

  • Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel).

  • Support data translation

    • Requires mapping from one in-memory representation to another.

  • Develop user-base and integrate feedback.

    • What would you want in such a tool?


Getting pads

Getting PADS

PADS will be available shortly for download with a non-commercial-use license.

http://www.research.att.com/projects/pads


  • Login