1 / 26

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data. David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu. www.padsproj.org. Data, data, everywhere.

grant
Download Presentation

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Dirt to Shovels:Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu www.padsproj.org

  2. Data, data, everywhere • AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data” • Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available • not free text; not html; not xml • Common problems: no documentation, evolving formats, huge volume, error-filled ... Router Configs Network Monitoring Web Logs Billing Info Call Details

  3. Data, data, everywhere 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/dd@grp.org/confirm HTTP/1.0" 200 941 234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409 240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178 188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082 214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 - web server common log format

  4. Data, data, everywhere 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001 649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982 AT&T phone call provisioning data

  5. Data, data, everywhere HA00000000START OF TEST CYCLE aA00000001BXYZ U1AB0000040000100B0000004200 HE00000005START OF SUMMARY f 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000 HF00000007END OF SUMMARY k00000008LYXW B1KB0000065G0000009900100000001000020000 HB00000009END OF TEST CYCLE www.opradata.com

  6. Data, data, everywhere format-version: 1.0 date: 11:11:2005 14:24 auto-generated-by: DAG-Edit 1.419 rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO:0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc] is_a: GO:0048308 ! organelle inheritance is_a: GO:0048311 ! mitochondrion distribution www.geneontology.org

  7. Goal Visual Information End-user tools Billing Info ASCII log files Call Detail Raw Data CSV XML Standard formats & schema We want to create this arrow

  8. Half-way there: The PADS System 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07] “Ad Hoc” Data Source PADS Data Description PADS Runtime System (I/O, Error Handling) PADS Compiler Generated Libraries (Parsing, Printing, Traversal) XML Converter Data Profiler Graphing Tool Query Engine Custom App generic description- directed programs coded once ? XML Analysis Report Graph Information

  9. PADS Language Overview • Rich base type library: • integers:Pint8, Puint32, … • strings:Pstring(’|’), Pstring_FW(3), ... • systems data:Pdate, Ptime, Pip, … • Type constructors describe complex data sources: • sequences:Pstruct, Parray, • choices:Punion, Penum, Pswitch • constraints: arbitrary predicates describe expected semantic properties • parameterization: allows definition of generic descriptions Data formats are described using a specialized language of types A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.

  10. The Last Mile: The PADS System 2.0 Raw Data XML XMLifier Profiler Analysis Report Format Inference Engine Chunking & Tokenization Chunking & Tokenization Structure Discovery Structure Discovery PADS Data Description Format Refinement Scoring Function PADS Compiler

  11. Chunking Process • Convert raw input into sequence of “chunks.” • Supported divisions: • Various forms of “newline” • File boundaries • Also possible: user-defined “paragraphs”

  12. Tokenization • Tokens/Base types expressed as regular expressions. • Basic tokens • Integer, white space, punctuation, strings • Distinctive tokens • IP addresses, dates, times, MAC addresses, ...

  13. Histograms

  14. Clustering Group clusters with similar frequency distributions Cluster 1 Cluster 2 Cluster 3 Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

  15. Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

  16. Find subcontexts Tokens in selected cluster: Quote(2)CommaWhite

  17. Then Recurse...

  18. Inferred type

  19. Structure Discovery Review • Compute frequency distribution for each token. • Cluster tokens with similar frequency distributions. • Create hypothesis about data structure from cluster distributions • Struct • Array • Union • Basic type (bottom out) • Partition data according to hypothesis & recurse • Once structure discovery is complete, later phases massage & rewrite candidate description to create final form “123, 24” “345, begin” “574, end” “9378, 56” “12, middle” “-12, problem” …

  20. Testing and Evaluation • Evaluated overall results qualitatively • Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation • Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity • Evaluated accuracy quantitatively • For many formats: 95%+ accuracy from 5% of available data • Evaluated performance quantitatively • Hours to days to hand-write formats • after fixing the format, appears to scale linearly with data size • <1 min on 300K data

  21. Technical Summary [www.padsproj.org] • PADS 1.0 is an effective implementation framework for many data processing tasks • PADS 2.0 improves programmer productivity further by automatically inferring formats & generating many tools & libraries Email struct { ........ ...... ........... } ASCII log files Binary Traces CSV XML

  22. End

  23. Execution Time SD: structure discovery Ref: refinement Tot: total HW: hand-written

  24. Training Time

  25. Minimum Necessary Training Sizes

  26. Problem: Tokenization • Technical problem: • Different data sources assume different tokenization strategies • Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions • Matching tokenization of underlying data source can make a big difference in structure discovery. • Current solution: • Parameterize learning system with customizable configuration files • Automatically generate lexer file & basic token types • Future solutions: • Use existing PADS descriptions and data sources to learn probabilistic tokenizers • Incorporate probabilities into sophisticated back-end rewriting system • Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead

More Related