From Dirt to Shovels: Inferring PADS descriptions from ASCII Data

From Dirt to Shovels:Inferring PADS descriptions from ASCII Data Kathleen Fisher David Walker Peter White Kenny Zhu July 2007

Government stats "MSN","YYYYMM","Publication Value","Publication Unit","Column Order" "TEAJBUS",197313,-0.456483,Quadrillion Btu,4 "TEAJBUS",197413,-0.482265,Quadrillion Btu,4 "TEAJBUS",197513,-1.066511,Quadrillion Btu,4 "TEAJBUS",197613,-0.177807,Quadrillion Btu,4 "TEAJBUS",197713,-1.948233,Quadrillion Btu,4 "TEAJBUS",197813,-0.336538,Quadrillion Btu,4 "TEAJBUS",197913,-1.649302,Quadrillion Btu,4 "TEAJBUS",198013,-1.0537,Quadrillion Btu,4

Train Stations Southern California Regional Railroad Authority,"Los Angeles, CA", U,45,46,46,47,49,51,U,45,46,46,47,49,51 Connecticut Department of Transportation ,"New Haven, CT", U,U,U,U,U,U,8,U,U,U,U,U,U,8 Tri-County Commuter Rail Authority ,"Miami, FL", U,U,U,U,U,U,18,U,U,U,U,U,U,18 Northeast Illinois Regional Commuter Railroad Corporation,"Chicago, IL",226,226,226,227,227,227,227,91,104,104,111,115,125,131 Northern Indiana Commuter Transportation District,"Chicago, IL",18,18,18,18,18,18,20,7,7,7,7,7,7,11 Massachusetts Bay Transportation Authority,"Boston, MA", U,U,117,119,120,121,124,U,U,67,69,74,75,78 Mass Transit Administration - Maryland DOT ,"Baltimore, MD", U,U,U,U,U,U,42,U,U,U,U,U,U,22 New Jersey Transit Corporation ,"New York, NY",158,158,158,162,162,162,167,22,22,41,46,46,46,51

Web logs 207.136.97.49 - - [15/Oct/2006:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 207.136.97.49 - - [15/Oct/2006:18:46:51 -0700] "GET /turkey/clear.gif HTTP/1.0" 200 76 207.136.97.49 - - [15/Oct/2006:18:46:52 -0700] "GET /turkey/back.gif HTTP/1.0" 200 224 207.136.97.49 - - [15/Oct/2006:18:46:52 -0700] "GET /turkey/women.html HTTP/1.0" 200 17534 208.196.124.26 - Dbuser [15/Oct/2006:18:46:55 -0700] "GET /candatop.html HTTP/1.0" 200 - 208.196.124.26 - - [15/Oct/2006:18:46:57 -0700] "GET /images/done.gif HTTP/1.0" 200 4785 www.att.com - - [15/Oct/2006:18:47:01 -0700] "GET /images/reddash2.gif HTTP/1.0" 200 237 208.196.124.26 - - [15/Oct/2006:18:47:02 -0700] "POST /images/refrun1.gif HTTP/1.0" 200 836 208.196.124.26 - - [15/Oct/2006:18:47:05 -0700] "GET /images/hasene2.gif HTTP/1.0" 200 8833 www.cnn.com - - [15/Oct/2006:18:47:08 -0700] "GET /images/candalog.gif HTTP/1.0" 200 - 208.196.124.26 - - [15/Oct/2006:18:47:09 -0700] "GET /images/nigpost1.gif HTTP/1.0" 200 4429 208.196.124.26 - - [15/Oct/2006:18:47:09 -0700] "GET /images/rally4.jpg HTTP/1.0" 200 7352 128.200.68.71 - - [15/Oct/2006:18:47:11 -0700] "GET /amnesty/usalinks.html HTTP/1.0" 143 10329 208.196.124.26 - - [15/Oct/2006:18:47:11 -0700] "GET /images/reyes.gif HTTP/1.0" 200 10859

Learning: Goals & Approach Visual Information End-user tools Email struct { ........ ...... ........... } ASCII log files Binary Traces Raw Data Data Description CSV XML Standard formats & schema; Problem: Producing useful tools for ad hoc data takes a lot of time. Solution: A learning system to generate data descriptions and tools automatically .

PADS Reminder • Provides rich base type library; many specialized for systems data. • Pint8, Puint8, … //-123,44 Pstring(:’|’:) // hello | Pstring_FW(:3:) // catdogPdate, Ptime, Pip, … • Provides type constructors to describe data source structure: • sequences: Pstruct, Parray, • choices: Punion, Penum, Pswitch • constraints: allow arbitrary predicates to describe expected properties. Inferred data formats are described using a specialized language of types PADS compiler generates stand-alone tools including xml-conversion, Xquery support & statistical analysis directly from data descriptions.

Go to demo

Format inference overview XML XMLifier Raw Data Accumlator Analysis Report Chunking Process Tokenization PADS Description PADS Compiler Structure Discovery IR to PADS Printer Scoring Function Format Refinement

Chunking Process • Convert raw input into sequence of “chunks.” • Supported divisions: • Various forms of “newline” • File boundaries • Also possible: user-defined “paragraphs”

Tokenization • Tokens expressed as regular expressions. • Basic tokens • Integer, white space, punctuation, strings • Distinctive tokens • IP addresses, dates, times, MAC addresses, ...

Histograms

Clustering Group clusters with similar frequency distributions Cluster 1 Cluster 2 Cluster 3 Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

Find subcontexts Tokens in selected cluster: Quote(2)CommaWhite

Then Recurse...

Inferred type

Finding arrays Single cluster with high coverage, but wide distribution.

String [] sep(‘|’) Partitioning Selected tokens for array cluster:String Pipe Context 1,2:String * Pipe Context 3:String

Format Refinement • Rewrite format description to: • Optimize information-theoretic complexity • Simplify presentation • Merge adjacent structures and unions • Improve precision • Identify constant values • Introduce enumerations and dependencies • Fill in missing details • Find completions where structure discovery stops • Refine types • Termination conditions for strings • Integer sizes • Identify array element separators & terminators

“0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” …

struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , ” “ union union structure discovery int alpha int alpha

struct struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , , ” ” (id1) (id2) “ union “ union union union structure discovery tagging/ table gen int int (id3) alpha alpha (id4) int alpha int (id5) alpha (id6) id1 id2 id3 id4 id5 id6 1 1 0 -- 24 -- 2 2 -- foo -- beg ... ... ... ... ... ...

struct struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , , ” ” (id1) (id2) “ union “ union union union structure discovery tagging/ table gen int int (id3) alpha alpha (id4) int alpha int (id5) alpha (id6) id1 id2 id3 id4 id5 id6 1 1 0 -- 24 -- 2 2 -- foo -- beg ... ... ... ... ... ... constraint inference id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”)

struct struct “0, 24” “foo, beg” “bar, end” “0, 56” “baz, middle” “0, 12” “0, 33” … , , ” ” (id1) (id2) “ union “ union union union structure discovery tagging/ table gen int int (id3) alpha alpha (id4) int alpha int (id5) alpha (id6) id1 id2 id3 id4 id5 id6 1 1 0 -- 24 -- 2 2 -- foo -- beg ... ... ... ... ... ... struct constraint inference ” “ union id3 = 0 id1 = id2 (first union is “int” whenever second union is “int”) struct struct rule-based structure rewriting , , int 0 alpha-string alpha-string more accurate: -- first int = 0 -- rules out “int , alpha-string” records

Scoring • Goal: A quantitative metric to evaluate the quality of inferred descriptions and drive refinement. • Challenges: • Underfitting. Pstring(Peof) describes data, but is too general to be useful. • Overfitting. Type that exhaustively describes data (‘H’, ‘e’, ‘r’, ‘m’, ‘i’, ‘o’, ‘n’, ‘e’,…) is too precise to be useful. • Sweet spot: Reward compact descriptions that predict the data well.

Minimum Description Length • Standard metric from machine learning. • Cost of transmitting the syntax of a description plus the cost of transmitting the data given the description: cost(T,d) = complexity(T) + complexity(d|T) • Functions defined inductively over the structure of the type T and data d respectively. • Normalized MDL gives compression factor. • Scoring function triggers rewriting rules.

Testing and Evaluation • Evaluated overall results qualitatively • Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation • Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity • Evaluated accuracy quantitatively • Implemented infrastructure to use generated accumulator programs to determine inferred description error rates • Evaluated performance quantitatively • Tokenization & rough structure inference perform well: less than 1 second on 300K • Dependency analysis can take a long time on complex format (but can be cut down easily).

Benchmark Formats

Execution Times SD: structure discovery Ref: refinement Tot: total HW: hand-written

Training Time

Normalized MDL Scores SD: structure discovery Ref: refinement HW: hand-written

Training Accuracy

Type Complexity and Min. Training Size

Problem: Tokenization • Technical problem: • Different data sources assume different tokenization strategies • Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions • Matching tokenization of underlying data source can make a big difference in structure discovery. • Current solution: • Parameterize learning system with customizable configuration files • Automatically generate lexer file & basic token types • Future solutions: • Use existing PADS descriptions and data sources to learn probabilistic tokenizers • Incorporate probabilities into back-end rewriting system • Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead

Structure Discovery Analysis • Usually identifies top-level structure sufficiently well to be of some use • When tokenization is accurate, this phase performs well • When tokenization is inaccurate, this phase performs less well • Descriptions are more complex than hand-coded ones • Intuitively: one or two well-chosen tokens in a hand-coded description is represented by complex combination of unions, options, arrays and structures • Technical Problems: • When to give up & bottom out • Choosing between unions and arrays • Current Solutions: • User-specified recursion depth • Structs prioritized over arrays, which are prioritized over unions • Future Solutions: • Information-theory-driven bottoming out • Expand infrastructure to enable “search” and evaluation of several options

Scoring Analysis • Technical Problem: It is unclear how to weigh type complexity vs data complexity to predict human preference in description structure • Current Solution: • Final type complexity and final data complexity are weighted equally in the total cost function • However, final data complexity grows linearly with the amount of data used in the experiment • Future Solutions: • Observation: some of our experiments suggest that humans weight type complexity more heavily than data complexity • introduce a hyper parameter h and perform experiments, varying h until cost of inferred results and expert descriptions match expectations: • cost = h*type-complexity + data-complexity • Bottom Line: Information theory is a powerful and general tool, but more research is needed to tune it to our application domain

Related work • Grammar Induction • Extracting Structure from Web Pages [Arasu & Hector-Molena, SigMod, 2003]. • Language Identification in the Limit [Gold, Information and Control, 1968]. • Grammatical Inference for Information Extraction and Visualization on the Web [Hong, PhD Thesis, Imperial College, 2003]. • Current Trends in Grammatical Inference [Higuera, LNCS, 2001]. • Functional dependencies • Tane: An Efficient Algorithm for Discovering Functional and Approximate Dependencies [Huhtal et al, Computer Journal, 1999]. • Information Theory • Information Theory, Inference, and Learning Algorithms [Mackay, Cambridge University Press, 2003]. • Advances in Minimum Description Length [Grünwald, MIT Press, 2004].

Technical Summary • Format inference is feasible for many ASCII data formats • Our current tools infer sufficient structure that descriptions may be piped into the PADS compiler and used to generate tools for XML conversion and simple statistical analysis. Email struct { ........ ...... ........... } ASCII log files Binary Traces CSV XML

Thanks & Acknowledgements • Collaborators • Kenny Zhu (Princeton) • Peter White (Galois) • Other contributors • Alex Aiken (Stanford) • David Blei (Princeton) • David Burke (Galois) • Vikas Kedia (Stanford) • John Launchbury (Galois) • Rob Shapire (Princeton) www.padsproj.org

From Dirt to Shovels: Inferring PADS descriptions from ASCII Data