Kathleen fisher at t labs research www padsproj org
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Kathleen Fisher* AT&T Labs Research padsproj PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

PADS: A System for Managing Ad Hoc Data. Kathleen Fisher* AT&T Labs Research www.padsproj.org. *And many many others…. Kenny Zhu. Dr. Zhu has been one of the main contributors to the PADS project.

Download Presentation

Kathleen Fisher* AT&T Labs Research padsproj

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Kathleen fisher at t labs research www padsproj org

PADS:

A System for Managing Ad Hoc Data

Kathleen Fisher*

AT&T Labs Research

www.padsproj.org

*And many many others…


Kenny zhu

Kenny Zhu

  • Dr. Zhu has been one of the main contributors to the PADS project.

  • He is finishing his Post Doc at Princeton and looking for jobs, both in North America and Asia.

http://www.cs.princeton.edu/~kzhu/


Data data everywhere

Data, Data, Everywhere!

Incredible amounts of data stored in well-behaved formats:

Databases:

Tools

  • Schema

  • Browsers

  • Query Languages

  • Standards

  • Libraries

  • Books, documentation

  • Training courses

  • Conversion tools

  • Vendor support

  • Consultants...

XML:


We re not always so lucky

We’re not always so lucky!

Vast amounts of chaotic ad hoc data:

Tools

  • Perl

  • Awk

  • C

  • ...


Web logs

Web Logs

207.136.97.49 - - [15/Oct/2006:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

207.136.97.49 - - [15/Oct/2006:18:46:51 -0700] "GET /turkey/clear.gif HTTP/1.0" 200 76

207.136.97.49 - - [15/Oct/2006:18:46:52 -0700] "GET /turkey/back.gif HTTP/1.0" 200 224

207.136.97.49 - - [15/Oct/2006:18:46:52 -0700] "GET /turkey/women.html HTTP/1.0" 200 17534

208.196.124.26 - Dbuser [15/Oct/2006:18:46:55 -0700] "GET /candatop.html HTTP/1.0" 200 -

208.196.124.26 - - [15/Oct/2006:18:46:57 -0700] "GET /images/done.gif HTTP/1.0" 200 4785

www.att.com - - [15/Oct/2006:18:47:01 -0700] "GET /images/reddash2.gif HTTP/1.0" 200 237

208.196.124.26 - - [15/Oct/2006:18:47:02 -0700] "POST /images/refrun1.gif HTTP/1.0" 200 836

208.196.124.26 - - [15/Oct/2006:18:47:05 -0700] "GET /images/hasene2.gif HTTP/1.0" 200 8833

www.cnn.com - - [15/Oct/2006:18:47:08 -0700] "GET /images/candalog.gif HTTP/1.0" 200 -

208.196.124.26 - - [15/Oct/2006:18:47:09 -0700] "GET /images/nigpost1.gif HTTP/1.0" 200 4429

208.196.124.26 - - [15/Oct/2006:18:47:09 -0700] "GET /images/rally4.jpg HTTP/1.0" 200 7352

128.200.68.71 - - [15/Oct/2006:18:47:11 -0700] "GET /amnesty/usalinks.html HTTP/1.0" 143 10329

208.196.124.26 - - [15/Oct/2006:18:47:11 -0700] "GET /images/reyes.gif HTTP/1.0" 200 10859


Haskell hi files

Haskell HI Files

00000000: 0001 face 0000 0073 0400 0000 3600 0000 .......s....6...

00000010: 3000 0000 3500 0000 3000 0000 0000 0000 0...5...0.......

00000020: 0001 0000 0000 0100 0000 0043 0001 0000 ...........C....

00000030: 0002 0200 0000 0200 0000 0300 0000 0200 ................

00000040: 0000 0400 0000 4800 0100 0000 0200 0000 ......H.........

00000050: 0502 0000 0000 0006 0000 0000 0007 0000 ................

00000060: 0001 0000 0000 6800 0000 0000 006f 0000 ......h......o..

00000070: 0000 0100 0000 0800 0000 0968 6173 6b65 ...........haske

00000080: 6c6c 3938 0000 0007 4350 5554 696d 6500 ll98....CPUTime.

00000090: 0000 0462 6173 6500 0000 0847 4843 2e42 ...base....GHC.B

000000a0: 6173 6500 0000 0e47 4843 2e46 6f72 6569 ase....GHC.Forei

000000b0: 676e 5074 7200 0000 0e53 7973 7465 6d2e gnPtr....System.

000000c0: 4350 5554 696d 6500 0000 0a67 6574 4350 CPUTime....getCP

000000d0: 5554 696d 6500 0000 1063 7075 5469 6d65 UTime....cpuTime

000000e0: 5072 6563 6973 696f 6e Precision


Ad hoc data from at t

Ad Hoc Data from AT&T


And many others

And Many Others...

  • Gene ontology data

  • Cosmology data

  • Financial trading data

  • Telecom billing data

  • Router config files

  • System logs

  • Call detail data

  • Netflow packets

  • DNS packets

  • Java JAR files

  • Jazz recording info

  • ...


Why a data description language

Why a data description language?

  • Ad hoc data is difficult to manage

    • Data arrives “as is” in a wide-variety of encodings and formats.

    • Documentation is out of data or non-existent.

    • Data is buggy and potentially malicious.

    • Processing must detect errors and respond in application-specific ways.

    • Data sources often have high volume.

  • Existing solutions are insufficient

    • Lex/Yacc-like technologies target language syntax, rather than data.

    • Hand-coded C/Perl programs are time-consuming to produce, brittle with respect to changes, and fail to handle errors well.

  • Data description languages (DDLs) address these issues

    • Data expert writes declarative description rather than a parser.

    • Description serves as living documentation.

    • Parser exhaustively detects errors without cluttering user code.

    • Parser can be proven correct with respect to its handling of buggy data.

    • From declarative specification, compiler can generate auxiliary tools.

Data description languages facilitate managing ad hoc data.


The pads c data description language

The PADS/C Data Description Language

  • Provides rich and extensible set of base types for describing atomic data.

    • Pint8, Puint8, …// -123, 44

    • Pstring(:’|’:)// hello| Pstring_FW(:3:) // catdog Pstring_ME(:”/a*/”:)// aaaaaab

    • Pdate, Ptime, Pip, …

  • Provides type constructors to describe structured data, by analogy with C:

  • Pstruct, Parray, Punion, Ptypedef, Penum

  • Allows arbitrary predicates to describe expected properties.

  • Compiler generates parser, printer, and other useful tools in a type directed fashion.

In the PADS/C DDL, each piece of data is described by a type, which specifies the physical format and semantic constraints of the data.

PADS uses a type metaphor to declaratively describe ad hoc data.


Common log format in pads c

Parray Phostname{

Pstring_SE(:"/[. ]/":) [] : Psep('.')

&& Pterm(Pnosep);

};

Punion host {

Pip ip; /- 135.207.23.32

Phostname host; /- www.research.att.com

};

Punion auth_id {

Pchar unauthorized : unauthorized == '-';

Pstring(:' ':) id;

};

Penum method {

GET, PUT, POST, HEAD,

DELETE, LINK, UNLINK

};

Pstruct version {

"HTTP/";

Puint8 major; '.';

Puint8 minor;

};

int chkVersion(version v, method m) {

if ((v.major == 1) && (v.minor == 0)) return 1;

if ((m == LINK) || (m == UNLINK)) return 0;

return 1;

};

Pstruct request {

'\"'; method meth;

' '; Pstring(:' ':) req_uri;

' '; version version :

chkVersion(version, meth);

'\"';

};

Ptypedef Puint16_FW(:3:) response :

response x => { 100 <= x && x < 600};

Punion length {

Pchar unavailable : unavailable == '-';

Puint32 len;

};

PrecordPstruct entry {

host client;

' '; auth_id remoteID;

' '; auth_id auth;

" ["; Pdate(:']':) date;

"] "; request request;

' '; response response;

' '; length length;

};

PsourceParray clf {

entry [];

}

Common Log Format in PADS/C

A complete PADS/C description of the web server log data shown in the box:

207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

PADS allows concise, precise, and intuitive data specifications.


Pads parsing and printing

PADS Parsing and Printing

From a data description, the PADS compiler generates

  • a parser, which maps raw input data and a mask to a pair of an in-memory representation and a parse descriptor.

  • a parse descriptor, which records meta-data about a parse, including location and error information.

  • a mask, which allows dynamic customization of parser behavior.

    PADS has a formal semantics, so we can prove formal

    properties about the generated parsers, such as:

  • If the mask specifies “check all properties” and “set all representations,” and the parse descriptor indicates no errors, then the in-memory representation is correct.

  • Malicious data cannot corrupt the parser.

    The PADS compiler also generates a printer, which maps an in-memory rep and a parse descriptor back to raw form. We’d like printing and parsing to be inverses, but that is a hard problem in general…

PADS uses meta-data to manage buggy or malicious data.


Leverage

Parser

Formatter

Statistical Analysis

Tools

PADS data description

PADS

Compiler

Xquery integration

Translator to XML

Visualization Tools

Leverage!

Given a data description, the computer essentially understands the data. We can leverage that understanding to generate many tools beyond a parser:

Type directed programming provides this leverage.

For each base type, we have to specify the desired behavior.

The compiler then lifts the behavior to all structured types.

Type-directed programming allows generation of useful tools from descriptions.


Learning goals approach

Learning: Goals & Approach

Visual Information

End-user

tools

Email

struct {

........

......

...........

}

ASCII log files

Binary Traces

Raw Data

Data Description

CSV

XML

Standard formats & schema;

Problem: Producing useful tools for ad hoc data takes a lot of time.

Solution: A learning system to generate data descriptions and tools automatically.


Format inference overview

Format Inference Overview

XML

XMLifier

Raw Data

Accumlator

Analysis

Report

Chunking

Process

Tokenization

PADS

Description

PADS

Compiler

Structure

Discovery

IR to PADS

Printer

Scoring

Function

Format

Refinement


Possible additional material

Possible Additional Material

  • PADS in More Depth: The language, the tools, the semantics. [PLDI 05, POPL 06, POPL 07, PADL 08] (long talk).

  • Format Inference: Basic algorithm, small demo, and experimental evaluation [POPL 08](long talk).

  • In Progress: (short talk)

    • Improving format inference by learning tokenizations [PADL 09]

    • Taking steps towards making inference incremental.

  • Learning Demo: Perhaps better offline.


Contributors

Contributors

  • AT&T: Yitzhak Mandelbaum, Mary Fernandez, and Andrew Forest

  • Princeton: David Walker, Kenny Zhu, Qian Xi

  • Galois: Peter White and David Burke

  • Penn: Nate Foster and Michael Greenberg


Motivation token ambiguity problem tap

Motivation: Token Ambiguity Problem (TAP)

  • Given a string, there are multiple ways to tokenize it.

  • Example 1: 127.0.0.1

  • IP

  • Float Dot Float

  • Int Dot Int Dot Int Dot Int

    Example 2:

  • Message

  • Word White Word White Word White... White URL

  • Word White Quote Filepath Quote White Word White...


How does learnpads deal with tap

How does learnPADS deal with TAP ?

  • Tokenization Phase:

    • Take the first, longest match.

Float

  • A fixed order is assigned by the end user.

  • We have no order to pick.

Int

ID

Path

As a result, the current learning system:

can’t have ambiguous base tokens – Message, Text, ID.

sometimes produces descriptions that are too precise.


Scaling to larger data sets

Scaling to Larger Data Sets

  • Original algorithm keeps entire data set in memory, so won’t scale to large data sets.

  • Proposed conceptual architecture to permit incremental learning:


  • Login