The structure of computer scientific revolutions
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

The Structure of (Computer) Scientific Revolutions PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

The Structure of (Computer) Scientific Revolutions. Michael Franklin UC Berkeley & Amalgamated Insight. Dow Jones Enterprise Ventures May 2006. Data Management: Then. Structured Data Processing. Data Management: Now. The Structure Spectrum. Structured data (schema-first)

Download Presentation

The Structure of (Computer) Scientific Revolutions

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The structure of computer scientific revolutions

The Structure of (Computer) Scientific Revolutions

Michael Franklin

UC Berkeley

&

Amalgamated Insight

Dow Jones Enterprise Ventures

May 2006


Data management then

Data Management: Then

Structured Data

Processing

Michael Franklin

Dow Jones EV Summit May 2006


Data management now

Data Management: Now

Michael Franklin

Dow Jones EV Summit May 2006


The structure spectrum

The Structure Spectrum

  • Structured data (schema-first)

    • regular, known, conforming, …

    • e.g., Relational database

  • Unstructured data (schema-never) freeform, irregular,

    • e.g., plain text, images, audio, …

  • Semi-structured data (schema-later)

    • Provides structural information, but less constrained. e.g., XML, tagged text/media

Michael Franklin

Dow Jones EV Summit May 2006


Whither structured data

Whither Structured Data?

  • Conventional Wisdom: ~20% of data is structured currently.

  • Consumer apps, enterprise search, media apps are placing downward pressure on this.

Michael Franklin

Dow Jones EV Summit May 2006


A contrarian view

A Contrarian View?

Two reasons why structured data is where the action will be:

  • The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!!

  • The Data Integration quagmire: structure provides crucial cues for making data usable.

Michael Franklin

Dow Jones EV Summit May 2006


The new landscape

The New Landscape

Bell’s Law:Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect

  • Mainframes 1960s

  • Minicomputers 1970s

  • Microcomputers/PCs 1980s

  • Web-based computing 1990s

  • Devices (Cell phones, PDAs, wireless sensors, RFID) 2000’s

Enabling a new generation of applications for

Operational Visibility, monitoring, and alerting.

Michael Franklin

Dow Jones EV Summit May 2006


Data streams data flood

Data Streams  Data Flood

PoS System

Barcodes

Phones

Sensors

RFID

  • Exponential data growth

  • New challenges: continuous, inter-connected, distributed, physical

  • Shrinking business cycles

  • More complex decisions

Inventory

Transactional

Systems

Telematics

Clickstream

Michael Franklin

Dow Jones EV Summit May 2006


State of the art

State of the Art

  • Custom-coded implementations that are expensive and often unsuccessful.

  • Can we develop the right infrastructure to support large-scale data streaming apps?

Michael Franklin

Dow Jones EV Summit May 2006


High fan in systems

High Fan In Systems

  • A data management infrastructure for large-scale data streaming environments.

  • UniformDeclarative Framework

    • Every node is a data stream processor that speaks SQL-ese

       stream-oriented queries at all levels

    • Hierarchical, stream-based views as an organizing principle.

    • Can impose a “view” over messy devices.

Michael Franklin

Dow Jones EV Summit May 2006


Hifi taming the data flood

HiFi - Taming the Data Flood

Hierarchical Aggregation

• Spatial

• Temporal

Headquarters

Regional Centers

In-network Stream

Query Processing

and Storage

Warehouses, Stores

Fast Data

Path vs.

Slow Data

Path

Dock doors, Shelves

Receptors

Michael Franklin

Dow Jones EV Summit May 2006


Device issues example

Device Issues: example

Shelf RIFD Test - Ground Truth

Michael Franklin

Dow Jones EV Summit May 2006


Actual rfid readings

Actual RFID Readings

“Restock every time inventory goes below 5”

Michael Franklin

Dow Jones EV Summit May 2006


Query based data cleaning

Query-based Data Cleaning

Smooth

CREATE VIEW smoothed_rfid_stream AS

(SELECT receptor_id, tag_id

FROM cleaned_rfid_stream

[range by ’5 sec’,

slide by ’5 sec’]

GROUP BY receptor_id, tag_id

HAVING count(*) >= count_T)

Point

Michael Franklin

Dow Jones EV Summit May 2006


Query based data cleaning1

Query-based Data Cleaning

Arbitrate

CREATE VIEW arbitrated_rfid_stream AS

(SELECT receptor_id, tag_id

FROM smoothed_rfid_stream rs

[range by ’5 sec’,

slide by ’5 sec’]

GROUP BY receptor_id, tag_id

HAVING count(*) >= ALL

(SELECT count(*)

FROM smoothed_rfid_stream

[range by ’5 sec’,

slide by ’5 sec’]

WHERE tag_id = rs.tag_id

GROUP BY receptor_id))

Smooth

Point

Michael Franklin

Dow Jones EV Summit May 2006


After query based cleaning

After Query-based Cleaning

“Restock every time inventory goes below 5”

Michael Franklin

Dow Jones EV Summit May 2006


Once you have the right abstractions

Once you have the right abstractions…

  • “Soft Sensors”

  • Quality and lineage

  • Optimization (power, etc.)

  • Pushdown of external validation information

  • Data archiving

  • Model-based sensing

  • Imperative processing

Michael Franklin

Dow Jones EV Summit May 2006


Data integration

Data Integration

  • Integration is the ultimate schema-first problem.

  • Structure is both a key enabler and a key impediment here.

Michael Franklin

Dow Jones EV Summit May 2006


Search vs query

Search vs. Query

What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

Michael Franklin

Dow Jones EV Summit May 2006


Search vs query1

Search vs. Query

Michael Franklin

Dow Jones EV Summit May 2006


Search vs query2

Search vs. Query

What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

Michael Franklin

Dow Jones EV Summit May 2006


Search vs query3

Search vs. Query

  • “Search” can return only what’s been previously “stored”.

Michael Franklin

Dow Jones EV Summit May 2006


The structure of computer scientific revolutions

Also…

  • What if you wanted to find out the average donation of actors to each candidate?

  • What if you wanted to compare actor donations this campaign to the last one?

  • What if you wanted to find out who gave the most to each candidate?

  • What if you wanted to know where the information came from, and how old it was?

Michael Franklin

Dow Jones EV Summit May 2006


A deep web query approach

A “Deep-Web” Query Approach

SELECT y.name,f.occupation,…

FROM Yahoo_Actors y, FECInfo f

WHERE y.name = f.name

Michael Franklin

Dow Jones EV Summit May 2006


Yahoo actors join fecinfo

“Yahoo Actors” JOIN “FECInfo”

Q: Did it Work?

Michael Franklin

Dow Jones EV Summit May 2006


The fundamental tradeoff

Level of

Functionality

Time (and cost)

The Fundamental Tradeoff

Structure enables computers to help users manipulate and maintain the data.

Semi-Structured

(schema-later)

Structured

(schema-first)

Unstructured (schema-less)

Michael Franklin

Dow Jones EV Summit May 2006


Dataspaces

Dataspaces*

  • Deal with all the data from an enterprise – in whatever form

  • Data co-existence

    no integrated schema, no single warehouse

  • Pay-as-you-go services

    • Keyword search is bare minimum.

    • Data manipulation and increased consistency as you add work.

* “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.

Michael Franklin

Dow Jones EV Summit May 2006


Dataspaces vs databases

Data Coexistence

Autonomous Sources

Search, Browse, Approximate Answer

Best Effort Guarantees

Single Schema

Centralized Administration

Structured Query

Strict Integrity Constraints

Dataspaces vs. Databases

Michael Franklin

Dow Jones EV Summit May 2006


The world of dataspaces

The World of Dataspaces

Web Search

Far

Virtual Organization

Administrative

Proximity

Federated DBMS

Near

Desktop Search

DBMS

High

Low

Semantic Integration

Michael Franklin

Dow Jones EV Summit May 2006


Conclusions

Conclusions

  • Structured data not going away.

    • In fact, there will be lots more of it.

    • and it must be processed as fast as it is created.

  • Structure is crucial for successful data integration and manipulation.

    • Much effort will be expended to add structural information to text and media.

  • Traditional (structured) database technology is not up to the task.

  • Great opportunities for innovation.

    • HiFi and Dataspaces are examples.

Michael Franklin

Dow Jones EV Summit May 2006


  • Login