1 / 16

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber. The big picture. Plethora of high-volume data streams, from which valuable information can be extracted. Call-detail data, web logs, provisioning streams, tcpdump data, etc . Desired operations:

thelma
Download Presentation

PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PADS: Processing Arbitrary Data Streams Kathleen Fisher Robert Gruber

  2. The big picture • Plethora of high-volume data streams, from which valuable information can be extracted. • Call-detail data, web logs, provisioning streams, tcpdump data, etc. • Desired operations: • Programmatic manipulation • Format translation (into XML, relational database, etc.) • Declarative interaction • Filtering, querying, aggregation, statistical profiling

  3. Technical challenges • Data arrives “as is.” • Format determined by data source, not consumers. • Often has little documentation. • Some percentage of data is “buggy.” • Often streams have high volume. • Detect relevant errors (without necessarily halting program) • Control how data is read (e.g. read header but skip body vs. read entire record). • Parsing routines must be written to support any of the desired operations.

  4. Why not use C / Perl / Shell scripts… ? Problems with hand-coded parsers: • Writing them is time consuming and error prone. • Reading them a few months later is difficult. • Maintaining them in the face of even small format changes can be difficult. • Programs break in subtle and machine-specific ways (endien-ness, word-sizes). • Such programs are often incomplete, particularly with respect to errors.

  5. Solution: PADS System (In Progress) One person writes declarative description of data source: • Physical format information • Semantic constraints. Many people use PADS data description and generated library. PADS system generates • C library interface for processing data. • Reading ( original / binary / XML / …) • Writing ( original / binary / XML / … ) • Accumulators • … • Application for querying stream.

  6. PADS language • Can describe ASCII, EBCDIC (Cobol) , binary, and mixed data formats. • Allows arbitrary boolean constraint expressions to describe expected properties of data. • Type-based model: each type indicates how to read associated data. • Provides rich and extensible set of base types. • Pa_uint8, Pa_int8, Pa_uint16, …, Pe_uint8, …, Pb_int8, …, Pint8 • Pstring(:term-char:), Pstring_FW(:size:), Pstring_RE(:reg_exp:) • Supports user-defined compound types to describe file structure: • Pstruct, Parray, Punion, Ptypedef, Penum

  7. PADS compiler • Converts description to C header and implementation files. • For each built-in/user-defined type: • Functions (read, accumulate, write, test data generation) • In-memory representation • Error description • Mask (check constraints, set representation, suppress printing) • Reading invariant: If mask is check and set and error description reports no errors, then in-memory representation satisfies all constraints in data description.

  8. Example: CLF web log • Common Log Format from Web Protocols and Practice. • Fields: • IP address of remote host, either resolved (as above) or symbolic • Remote identity (usually ‘-’ to indicate name not collected) • Authenticated user (usually ‘-’ to indicate name not collected) • Time associated with request • Request (request method, request-uri, and protocol version) • Response code • Content length 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

  9. Example: CLF web log in PADS PrecordPstruct http_weblog { host client; /- Client requesting service ' '; auth_id remoteID; /- Remote identity ' '; auth_id auth; /- Name of authenticated user “ [”; Pdate(:']':) date; /- Timestamp of request “] ”; http_request request; /- Request ' '; Puint16_FW(:3:) response; /- 3-digit response code ' '; Puint32 contentLength; /- Bytes in response }; 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

  10. PADSL example: user constraint int checkVersion(http_v version, method_t meth) { if ((version.major == 1) && (version.minor == 1)) return 1; if ((meth == LINK) || (meth == UNLINK)) return 0; return 1; } Pstruct http_request { '\"'; method_t meth; /- Request method ' '; Pstring(:' ':) req_uri; /- Requested uri. ' '; http_v version : checkVersion(version, meth); /- HTTP version number of request '\"'; }; 207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

  11. PADSL example: arrays and unions ParraynIP { Puint8 [4] : Psep == '.'; }; Parray sIP { Pstring(:"[. ]":) [] : Psep == '.' && Pterm == ' '; } Punionhost { nIP resolved; /- 135.207.23.32 sIP symbolic; /- www.research.att.com }; Punionauth_id { Pchar unauthorized : unauthorized == '-'; /- non-authenticated http session Pstring(:' ':) id; /- login supplied during authentication }; 207.136.97.50- - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

  12. Generated type declarations typedefstruct { host client; /* Client requesting service */ auth_id remoteID; /* Remote identity */ … } http_weblog; typedefstruct { host_m client; auth_id_m remoteID; … } http_weblog_m; typedefstruct { int nerr; int errCode; PDC_loc loc; int panic; host_ed client; auth_id_ed remoteID; …; } http_weblog_ed;

  13. Sample use PDC_t *pdc; http_weblog entry; http_weblog_m mask; http_weblog_ed ed; PDC_open(&pdc, 0 /* PADS disc */, 0 /* PADS IO disc */); PDC_IO_fopen(pdc, fileName); ... call init functions ... http_weblog_mask(&mask, PCheck & PSet); while (!PDC_IO_at_EOF(pdc)) { http_weblog_read(pdc, &mask, &ed, &entry); if (ed.nerr != 0) { ... Error handling ... } ... Process/query entry ... }; ... call cleanup functions ... PDC_IO_fclose(pdc); PDC_close(pdc);

  14. Related work • ASN.1, ASDL • Describe logical representation, generate physical. • DataScript [Back: CGSE 2002] & PacketTypes [McCann & Chandra: SIGCOMM 2000] • Binary only • Stop on first error

  15. PADS to do • Allow library generation to be customized with application-specific information: • Repair errors, ignore certain fields, customize in-memory representation, etc. • Explore declarative querying via integration with XQuery (joint work with Mary Fernandez and Ricardo Medel). • Support data translation • Requires mapping from one in-memory representation to another. • Develop user-base and integrate feedback. • What would you want in such a tool?

  16. Getting PADS PADS will be available shortly for download with a non-commercial-use license. http://www.research.att.com/projects/pads

More Related