From dirt to shovels fully automatic tool generation from ascii data
Download
1 / 26

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data - PowerPoint PPT Presentation


  • 123 Views
  • Uploaded on

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data. David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu. www.padsproj.org. Data, data, everywhere.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data' - snow


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
From dirt to shovels fully automatic tool generation from ascii data l.jpg

From Dirt to Shovels:Fully Automatic Tool Generation from ASCII Data

David Walker

Pamela Dragosh

Mary Fernandez

Kathleen Fisher

Andrew Forrest

Bob Gruber

Yitzhak Mandelbaum

Peter White

Kenny Q. Zhu

www.padsproj.org


Data data everywhere l.jpg
Data, data, everywhere

  • AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data”

  • Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available

    • not free text; not html; not xml

  • Common problems: no documentation, evolving formats, huge volume, error-filled ...

Router Configs

Network Monitoring

Web Logs

Billing Info

Call Details


Data data everywhere3 l.jpg
Data, data, everywhere

207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30

tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/[email protected]/confirm HTTP/1.0" 200 941

234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409

240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178

188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082

214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 -

web server

common log format


Data data everywhere4 l.jpg
Data, data, everywhere

9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291

9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001

649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982

AT&T

phone call provisioning data


Data data everywhere5 l.jpg
Data, data, everywhere

HA00000000START OF TEST CYCLE

aA00000001BXYZ U1AB0000040000100B0000004200

HE00000005START OF SUMMARY

f 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000

HF00000007END OF SUMMARY

k00000008LYXW B1KB0000065G0000009900100000001000020000

HB00000009END OF TEST CYCLE

www.opradata.com


Data data everywhere6 l.jpg
Data, data, everywhere

format-version: 1.0

date: 11:11:2005 14:24

auto-generated-by: DAG-Edit 1.419 rev 3

default-namespace: gene_ontology

subsetdef: goslim_goa "GOA and proteome slim"

[Term]

id: GO:0000001

name: mitochondrion inheritance

namespace: biological_process

def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc]

is_a: GO:0048308 ! organelle inheritance

is_a: GO:0048311 ! mitochondrion distribution

www.geneontology.org


Slide7 l.jpg
Goal

Visual Information

End-user

tools

Billing Info

ASCII log files

Call Detail

Raw Data

CSV

XML

Standard formats & schema

We want to create this arrow


Half way there the pads system 1 0 fg pldi 05 fmw popl 06 mfwfg popl 07 l.jpg
Half-way there: The PADS System 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07]

“Ad Hoc” Data Source

PADS Data

Description

PADS Runtime System

(I/O, Error Handling)

PADS

Compiler

Generated Libraries

(Parsing, Printing, Traversal)

XML

Converter

Data

Profiler

Graphing

Tool

Query

Engine

Custom

App

generic

description-

directed

programs

coded

once

?

XML

Analysis

Report

Graph

Information


Pads language overview l.jpg
PADS Language Overview

  • Rich base type library:

    • integers:Pint8, Puint32, …

    • strings:Pstring(’|’), Pstring_FW(3), ...

    • systems data:Pdate, Ptime, Pip, …

  • Type constructors describe complex data sources:

    • sequences:Pstruct, Parray,

    • choices:Punion, Penum, Pswitch

    • constraints: arbitrary predicates describe expected semantic properties

    • parameterization: allows definition of generic descriptions

Data formats are described using a specialized language of types

A formal semantics gives meaning to descriptions in terms of both

external format and internal data structures generated.


The last mile the pads system 2 0 l.jpg
The Last Mile: The PADS System 2.0

Raw Data

XML

XMLifier

Profiler

Analysis

Report

Format

Inference

Engine

Chunking &

Tokenization

Chunking &

Tokenization

Structure

Discovery

Structure

Discovery

PADS Data

Description

Format

Refinement

Scoring

Function

PADS

Compiler


Chunking process l.jpg
Chunking Process

  • Convert raw input into sequence of “chunks.”

  • Supported divisions:

    • Various forms of “newline”

    • File boundaries

  • Also possible: user-defined “paragraphs”


Tokenization l.jpg
Tokenization

  • Tokens/Base types expressed as regular expressions.

  • Basic tokens

    • Integer, white space, punctuation, strings

  • Distinctive tokens

    • IP addresses, dates, times, MAC addresses, ...



Clustering l.jpg
Clustering

Group clusters with similar frequency distributions

Cluster 1

Cluster 2

Cluster 3

Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height.

Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.


Partition chunks l.jpg
Partition chunks

In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.


Find subcontexts l.jpg
Find subcontexts

Tokens in selected cluster:

Quote(2)CommaWhite




Structure discovery review l.jpg
Structure Discovery Review

  • Compute frequency distribution for each token.

  • Cluster tokens with similar frequency distributions.

  • Create hypothesis about data structure from cluster distributions

    • Struct

    • Array

    • Union

    • Basic type (bottom out)

  • Partition data according to hypothesis & recurse

  • Once structure discovery is complete, later phases massage & rewrite candidate description to create final form

“123, 24”

“345, begin”

“574, end”

“9378, 56”

“12, middle”

“-12, problem”


Testing and evaluation l.jpg
Testing and Evaluation

  • Evaluated overall results qualitatively

    • Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation

    • Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity

  • Evaluated accuracy quantitatively

    • For many formats: 95%+ accuracy from 5% of available data

  • Evaluated performance quantitatively

    • Hours to days to hand-write formats

    • after fixing the format, appears to scale linearly with data size

    • <1 min on 300K data


Technical summary www padsproj org l.jpg
Technical Summary [www.padsproj.org]

  • PADS 1.0 is an effective implementation framework for many data processing tasks

  • PADS 2.0 improves programmer productivity further by automatically inferring formats & generating many tools & libraries

Email

struct {

........

......

...........

}

ASCII log files

Binary Traces

CSV

XML



Execution time l.jpg
Execution Time

SD: structure

discovery

Ref: refinement

Tot: total

HW: hand-written




Problem tokenization l.jpg
Problem: Tokenization

  • Technical problem:

    • Different data sources assume different tokenization strategies

    • Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions

    • Matching tokenization of underlying data source can make a big difference in structure discovery.

  • Current solution:

    • Parameterize learning system with customizable configuration files

    • Automatically generate lexer file & basic token types

  • Future solutions:

    • Use existing PADS descriptions and data sources to learn probabilistic tokenizers

    • Incorporate probabilities into sophisticated back-end rewriting system

      • Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead


ad