1 / 37

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data. Kathleen Fisher David Walker Peter White Kenny Zhu. December 2008 . Learning: Goals & Approach. Visual Information. End-user tools. Email. struct { ........ ...... ........... }. ASCII log files.

lysa
Download Presentation

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Dirt to Shovels:Inferring PADS descriptions from ASCII Data Kathleen Fisher David Walker Peter White Kenny Zhu December 2008

  2. Learning: Goals & Approach Visual Information End-user tools Email struct { ........ ...... ........... } ASCII log files Binary Traces Raw Data Data Description CSV XML Standard formats & schema; Problem: Producing useful tools for ad hoc data takes a lot of time. Solution: A learning system to generate data descriptions and tools automatically .

  3. PADS Reminder Inferred data formats are described using PADS types. • Base types describe atomic data: • Pint8, Puint8, … //-123,44 • Pstring(:’|’:) // hello | • Pstring_FW(:3:) // catdog • Pdate, Ptime, Pip, … • Type constructors describe structured data: • sequences: Pstruct, Parray, • choices: Punion, Penum, Pswitch • constraints: arbitrary predicates to describe expected properties.

  4. Format inference overview XML XMLifier Raw Data Accumlator Analysis Report Chunking Process PADS Compiler Tokenization Structure Discovery IR to PADS Printer PADS Description Scoring Function Format Refinement

  5. Chunking Process • Convert raw input into sequence of “chunks.” • Supported divisions: • Various forms of “newline” • File boundaries • Also possible: user-defined “paragraphs”

  6. Tokenization • Tokens expressed as regular expressions. • Basic tokens • Integer, white space, punctuation, strings • Distinctive tokens • IP addresses, dates, times, MAC addresses, ...

  7. Histograms

  8. Clustering Group clusters with similar frequency distributions Cluster 1 Cluster 2 Cluster 3 Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

  9. Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

  10. Find subcontexts Tokens in selected cluster: Quote(2)CommaWhite

  11. Then Recurse...

  12. Inferred type

  13. Finding arrays Single cluster with high coverage, but wide distribution.

  14. String [] sep(‘|’) Partitioning Selected tokens for array cluster:String Pipe Context 1,2:String * Pipe Context 3:String

  15. Scoring • Goal: A quantitative metric to evaluate the quality of inferred descriptions and drive refinement. • Challenges: • Underfitting. Pstring(Peof) describes data, but is too general to be useful. • Overfitting. Type that exhaustively describes data (‘H’, ‘e’, ‘r’, ‘m’, ‘i’, ‘o’, ‘n’, ‘e’,…) is too precise to be useful. • Sweet spot: Reward compact descriptions that predict the data well.

  16. Minimum Description Length (MDL) • Standard metric from machine learning. • Cost of transmitting the syntax of a description plus the cost of transmitting the data given the description: • cost(T,d) = • complexity(T) + complexity(d|T) • Functions defined inductively over the structure of the type T and data d respectively. • Normalized MDL gives compression factor. • Scoring function triggers rewriting rules.

  17. Format Refinement • Format refinement rewrites descriptions to: • Optimize information-theoretic complexity • Simplify presentation • Merge adjacent structures and unions • Improve precision • Identify constant values • Introduce enumerations • Refine types • Find termination conditions for strings • Determine integer sizes (8 bit, 32 bit, etc) • Identify array separators & terminators • Detect dependencies • Array lengths • Switched unions

  18. Testing and Evaluation • Evaluated overall results qualitatively • Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation • Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity • Evaluated accuracy quantitatively • Implemented infrastructure to use generated accumulator programs to determine inferred description error rates • Evaluated performance quantitatively • Tokenization & rough structure inference perform well: less than 1 second on 300K • Dependency analysis can take a long time on complex format (but can be cut down easily).

  19. Benchmark Formats

  20. Execution Times SD: structure discovery Ref: refinement Tot: total HW: hand-written

  21. Training Time

  22. Normalized MDL Scores SD: structure discovery Ref: refinement HW: hand-written

  23. Training Accuracy

  24. Type Complexity and Min. Training Size

  25. Technical Summary • Format inference is feasible for many ASCII data formats • Our current tools infer sufficient structure that descriptions may be piped into the PADS compiler and used to generate tools for XML conversion and simple statistical analysis. Email struct { ........ ...... ........... } ASCII log files Binary Traces CSV XML

  26. Thanks & Acknowledgements • Collaborators • Kenny Zhu (Princeton) • Peter White (Galois) • Other contributors • Alex Aiken (Stanford) • David Blei (Princeton) • David Burke (Galois) • Vikas Kedia (Stanford) • John Launchbury (Galois) • Rob Shapire (Princeton)

  27. Problem: Tokenization • Technical problem: • Different data sources assume different tokenization strategies • Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions • Matching tokenization of underlying data source can make a big difference in structure discovery. • Current solution: • Parameterize learning system with customizable configuration files • Automatically generate lexer file & basic token types • Future solutions: • Use existing PADS descriptions and data sources to learn probabilistic tokenizers • Incorporate probabilities into sophisticated back-end rewriting system • Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead

  28. Scaling to Larger Data Sets (In progress) • Original algorithm keeps entire data set in memory, so won’t scale to large data sets. • Proposed conceptual architecture to permit incremental learning:

  29. “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” …

  30. struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , ” “ union union structure discovery int alpha int alpha

  31. struct struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , , ” ” (id1) (id2) “ union “ union union union structure discovery tagging/ table gen int int (id3) alpha alpha (id4) int alpha int (id5) alpha (id6) id1 id2 id3 id4 id5 id6 1 1 0 -- 24 -- 2 2 -- foo -- beg ... ... ... ... ... ...

  32. struct struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , , ” ” (id1) (id2) “ union “ union union union structure discovery tagging/ table gen int int (id3) alpha alpha (id4) int alpha int (id5) alpha (id6) id1 id2 id3 id4 id5 id6 1 1 0 -- 24 -- 2 2 -- foo -- beg ... ... ... ... ... ... constraint inference id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”)

  33. struct struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , , ” ” (id1) (id2) “ union “ union union union structure discovery tagging/ table gen int int (id3) alpha alpha (id4) int alpha int (id5) alpha (id6) id1 id2 id3 id4 id5 id6 1 1 0 -- 24 -- 2 2 -- foo -- beg ... ... ... ... ... ... struct constraint inference ” “ union id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”) struct struct rule-based structure rewriting , , int 0 alpha-string alpha-string more accurate: -- first int = 0 -- rules out “int , alpha-string” records

More Related