1 / 35

Ad Hoc Data: From Uggh to Smug

Ad Hoc Data: From Uggh to Smug. David Walker Princeton University. 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r 00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'

arien
Download Presentation

Ad Hoc Data: From Uggh to Smug

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ad Hoc Data: From Uggh to Smug David Walker Princeton University 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r 00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............' 00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste 00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I......... 00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6............... 00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux..... 00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail 00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man............. :- 

  2. Ad Hoc Data is Everywhere • Lots of data in databases ==> even more data that isn’t • Ad Hoc Data: sets of semi-structured data files for which standard data processing tools are unavailable • Tasks:“getting the data into a database” (and other kinds of transformations), data cleaning, querying, editing, parsing... • Troubles:error prone, limited documentation, evolving formats, huge volume, ... Router Configs Network Monitoring Web Logs Billing Info Cosmology Data

  3. Two New Systems • Anne: A “Mark-up Language” for Ad Hoc Data [PLDI 2010] • with Qian Xi (Princeton) • Forest: A Language for Specifying Environmental Assumptions • with Kathleen Fisher (AT&T) • Nate Foster (Princeton) • Kenny Zhu (Jiao Tong Shanghai University)

  4. Anne: A Context-free Mark-up Language for Ad Hoc Data[PLDI 2010] Qian Xi

  5. The Problem 207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 polux.entelchile.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 ... What is the fastest, most reliable way to go from data like this: To a parse tree like this: And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, xml converter, ... EntryList Entry Message IP ... ... Sort Protocol Code Size 207.136.97.49 URL /turkey/amnty1.gif HTTP/1.0 200 3013 GET

  6. Our Solution: Anne • Develop a “mark-up language” for ordinary text • programmers annotate raw text using a set of “grammatical directives” • a simple, predictable algorithm generates a complete grammar & processing tools from directives + the surrounding raw data Pros: • really easy to use • directives are simple -- applied when & where needed • you can do it at 3am • predictable • documentation and tools may be generated automatically Cons: • not completely automatic • but I’m skeptical any other more magical bullet exists anyway

  7. Document: 207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar:

  8. Document: Edit document to add directives {Entry:207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ – ‘ ‘ ‘”’ word ... int ‘ ‘ int Default tokenization of tagged data Non-terminal name drawn from directive

  9. Document: Second directive {Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: New grammar rule ID ::= ‘-’ Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int Default grammar now incluldes new non-terminal

  10. Document: multiple identical name occurrences imply union of grammars {Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: union of grammars ID ::= ‘-’ + word Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int

  11. Document: = denotes presence of constant string {Entry:207.136.97.49 – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: ID ::= ‘-’ + word Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int

  12. Document: $ directs the system to infer a terminating symbol a space follows the closing brace {Entry:{Loc$:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: any string terminated by a space Loc ::= {[^ ]*} ID ::= ‘-’ + word Entry ::= Loc ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int

  13. Interjection: The Config File • A config file provides a mechanism for defining regular expressions and giving them names • def is an internal definition • exp is an exported named regular expression • The default config file provides regular expressions for common systems data (IP, dates, times, URL, email, ... ) default.config: def db [0-9][0-9] def zone [+-][0-1][0-9]00 def ampm am\|AM\|pm\|PM def trip [0-9][0-9][0-9]\|[0-9][0-9]\|[0-9] ... exp Time {db}:{db}:{db}\([ ]*{ampm}\)?\([ \t]+{zone}\)? exp IP {trip}\.{trip}\.{trip}\.{trip}

  14. Document: pre-defined token {Entry:{IP:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gi .... 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Definition drawn from config file IP ::= ... from config file ... ID ::= ‘-’ + word Entry ::= IP ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int

  15. XML Generation & Debugging

  16. Other Features • Most features inspired by similar constructs found in PADS • Enumerations • Recursion (context-freedom) • Kleene Star • with optional element definitions, separators, and terminators) • Options • Prioritized Unions • Assertions • Tables • Generated Artifacts: • PADS description (and from there, the PADS tool suite) • XML & CSS for debugging • Semantics: connections to Relevance Logic [see PLDI 10]

  17. Repetition (1) Kleene Star with elements separated by ‘|’ and defined by first element {Record*[|]:9152271|9152271|1|0|0|0|0|1} Elem ::= int Record ::= (Elem (‘|’ Elem)* )? Kleene Star with elements separated by ‘|’ and defined by Item Repetition (2) {Record/Item*[|]:9152271|{Item:9152271}|1|0|0|0|0|1} Item ::= int Record ::= (Item (‘|’ Item)* )?

  18. ? denotes optional data Optional Data {Record/Item*[|]:9152271|{Item?:9152271}|1|0||0||1} Item ::= int? Record ::= (Item (‘|’ Item)* )? missing elelments Assertions & Context-Freedom ! claims underlying data will satisfy nonterminal Parens {Parens?:({Parens!:(((())))})} Parens ::= (’(‘ Parens ‘)’)?

  19. Table (1) {E#:Jason Blake, 78 25 38 63 -2 Alexei Ponikarovsky, 82 23 38 61 6 ...} Row ::= Word ‘ ‘ Word ‘,’ ‘\t’ int ... Record ::= Row (NL Row)* Table (2) {E#h:Name GP Goals Assists Points +/- Jason Blake, 78 25 38 63 -2 Alexei Ponikarovsky, 82 23 38 61 6 ...} Row ::= ... Header ::= ‘Name’ ‘\t’ ... Record ::= Header NL Row*

  20. Forest:A SpecificationLanguagefor EnvironmentalAssumptions[work in progress!] Kathleen Fisher Nate Foster Kenny Zhu

  21. PADS Web Site

  22. Various causes for errors: • Missing files • Directories/files in wrong locations • Wrong permissions • Links to wrong targets

  23. If only we could... • Describe required file and directory structure, including permissions, etc. • Check that the actual file system matches the spec. • Eliminate a whole class of errors!

  24. CORAL Monitoring System • Monitoring system for an “Internet-scale, self-organizing, web-content distribution network” developed by Mike Freedman, Princeton.

  25. Observations on Monitoring • Coral is similar to other monitoring systems: PlanetLab and a multitude of systems at AT&T. • Often a configuration file specifies which hosts to monitor, what data to collect, and how often. • File and directory names encode meta-data. • Want to ask questions such as: • what was the total load on planetlab1 last week? • on what days and at what times are files are missing? • what is the maximum memory usage? • Answering questions requires formulating queries both in terms of the contents of files and the structure of the file system (directory names, files names)

  26. Other Possible Examples • File Hierarchy Standard (FHS) for unix-like installations • Haskell code base, PADS Source Tree • source code, data, examples, executables, ... • Cabal system for GHC libraries • Disk cache for browser history, IMAP mail • Scientific data sets • CVS, SVN, other source control systems

  27. To Do! • We need a language not just for specifying the contents (formats) of ad hoc data files but also for the structure of file system fragments • specify files • directory structure • dependencies (config files determine file system structure) • meta-data (permissions, sizes, owners, modification times) • The Plan • Build such a specification language on top of PADS • Generate a checker from the specifications • Interface that allows programs to slurp up specified data from the file system • Stand-alone tools: query engine, monitor, etc...

  28. Back to CORAL

  29. Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -}

  30. Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -} ptype date_d(t::pdate) = pdirectory { corald is "corald.log" :: corald_t <| timestamp >= t |>; coraldns is "nssrv.log" :: dns_t <| timestamp >= t |>; coralweb is "websrv.log" :: web_t <| timestamp >= t |>; probe is "probed.log" :: probe_t <| timestamp >= t |>; time :: pdate = t; }

  31. Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -} ptype date_d(t::pdate) = pdirectory { ... as before ... } ptype host_d = pdirectory { times is [t::date_d(t) | t <- pdate]; }

  32. Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -} ptype host_d(h::phostname, t::pdate) = pdirectory { ... as before ... } ptype host_d () = pdirectory { hosts is [t::date_d(t) | t <- pdate]; } ptype coral_d () = pdirectory { hostNames is “Config” :: conf_t; hosts is [h::host_d | h <= hostNames]; }

  33. Current & Future Plans • Designing a semantics based on a classical logic of trees • We considered using one of the substructural (“separating”) tree logics but we discarded it as the substructural logics gave us the wrong defaults & made the system harder to design and understand (especially in the presence of parent pointers) • Building a “file system parser” & tool generation infrastructure in Haskell • Leverage type-directed programming. • Leverage laziness in loading structures. • Envision a collection of file system management tools based on descriptions • valid –desc d -- check for conformance to d • ls –desc d -- list files described by d • grep pattern –desc d -- grep for pattern in files described by d • mv –desc d foo bar -- move files described by d rooted at foo to bar • Thinking about a query engine & continuous monitoring system • Considering extensions to handle other elements of the programming environment: environment variables

  34. The End

More Related