the structure of computer scientific revolutions n.
Skip this Video
Loading SlideShow in 5 Seconds..
The Structure of (Computer) Scientific Revolutions PowerPoint Presentation
Download Presentation
The Structure of (Computer) Scientific Revolutions

play fullscreen
1 / 30
Download Presentation

The Structure of (Computer) Scientific Revolutions - PowerPoint PPT Presentation

lluvia
95 Views
Download Presentation

The Structure of (Computer) Scientific Revolutions

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. The Structure of (Computer) Scientific Revolutions Michael Franklin UC Berkeley & Amalgamated Insight Dow Jones Enterprise Ventures May 2006

  2. Data Management: Then Structured Data Processing Michael Franklin Dow Jones EV Summit May 2006

  3. Data Management: Now Michael Franklin Dow Jones EV Summit May 2006

  4. The Structure Spectrum • Structured data (schema-first) • regular, known, conforming, … • e.g., Relational database • Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, … • Semi-structured data (schema-later) • Provides structural information, but less constrained. e.g., XML, tagged text/media Michael Franklin Dow Jones EV Summit May 2006

  5. Whither Structured Data? • Conventional Wisdom: ~20% of data is structured currently. • Consumer apps, enterprise search, media apps are placing downward pressure on this. Michael Franklin Dow Jones EV Summit May 2006

  6. A Contrarian View? Two reasons why structured data is where the action will be: • The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! • The Data Integration quagmire: structure provides crucial cues for making data usable. Michael Franklin Dow Jones EV Summit May 2006

  7. The New Landscape Bell’s Law:Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect • Mainframes 1960s • Minicomputers 1970s • Microcomputers/PCs 1980s • Web-based computing 1990s • Devices (Cell phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications for Operational Visibility, monitoring, and alerting. Michael Franklin Dow Jones EV Summit May 2006

  8. Data Streams  Data Flood PoS System Barcodes Phones Sensors RFID • Exponential data growth • New challenges: continuous, inter-connected, distributed, physical • Shrinking business cycles • More complex decisions Inventory Transactional Systems Telematics Clickstream Michael Franklin Dow Jones EV Summit May 2006

  9. State of the Art • Custom-coded implementations that are expensive and often unsuccessful. • Can we develop the right infrastructure to support large-scale data streaming apps? Michael Franklin Dow Jones EV Summit May 2006

  10. High Fan In Systems • A data management infrastructure for large-scale data streaming environments. • UniformDeclarative Framework • Every node is a data stream processor that speaks SQL-ese  stream-oriented queries at all levels • Hierarchical, stream-based views as an organizing principle. • Can impose a “view” over messy devices. Michael Franklin Dow Jones EV Summit May 2006

  11. HiFi - Taming the Data Flood Hierarchical Aggregation • Spatial • Temporal Headquarters Regional Centers In-network Stream Query Processing and Storage Warehouses, Stores Fast Data Path vs. Slow Data Path Dock doors, Shelves Receptors Michael Franklin Dow Jones EV Summit May 2006

  12. Device Issues: example Shelf RIFD Test - Ground Truth Michael Franklin Dow Jones EV Summit May 2006

  13. Actual RFID Readings “Restock every time inventory goes below 5” Michael Franklin Dow Jones EV Summit May 2006

  14. Query-based Data Cleaning Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T) Point Michael Franklin Dow Jones EV Summit May 2006

  15. Query-based Data Cleaning Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id)) Smooth Point Michael Franklin Dow Jones EV Summit May 2006

  16. After Query-based Cleaning “Restock every time inventory goes below 5” Michael Franklin Dow Jones EV Summit May 2006

  17. Once you have the right abstractions… • “Soft Sensors” • Quality and lineage • Optimization (power, etc.) • Pushdown of external validation information • Data archiving • Model-based sensing • Imperative processing • … Michael Franklin Dow Jones EV Summit May 2006

  18. Data Integration • Integration is the ultimate schema-first problem. • Structure is both a key enabler and a key impediment here. Michael Franklin Dow Jones EV Summit May 2006

  19. Search vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign? Michael Franklin Dow Jones EV Summit May 2006

  20. Search vs. Query Michael Franklin Dow Jones EV Summit May 2006

  21. Search vs. Query What if you wanted to find out which actors donated to John Kerry’s presidential campaign? Michael Franklin Dow Jones EV Summit May 2006

  22. Search vs. Query • “Search” can return only what’s been previously “stored”. Michael Franklin Dow Jones EV Summit May 2006

  23. Also… • What if you wanted to find out the average donation of actors to each candidate? • What if you wanted to compare actor donations this campaign to the last one? • What if you wanted to find out who gave the most to each candidate? • What if you wanted to know where the information came from, and how old it was? Michael Franklin Dow Jones EV Summit May 2006

  24. A “Deep-Web” Query Approach SELECT y.name,f.occupation,… FROM Yahoo_Actors y, FECInfo f WHERE y.name = f.name Michael Franklin Dow Jones EV Summit May 2006

  25. “Yahoo Actors” JOIN “FECInfo” Q: Did it Work? Michael Franklin Dow Jones EV Summit May 2006

  26. Level of Functionality Time (and cost) The Fundamental Tradeoff Structure enables computers to help users manipulate and maintain the data. Semi-Structured (schema-later) Structured (schema-first) Unstructured (schema-less) Michael Franklin Dow Jones EV Summit May 2006

  27. Dataspaces* • Deal with all the data from an enterprise – in whatever form • Data co-existence no integrated schema, no single warehouse • Pay-as-you-go services • Keyword search is bare minimum. • Data manipulation and increased consistency as you add work. * “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005. Michael Franklin Dow Jones EV Summit May 2006

  28. Data Coexistence Autonomous Sources Search, Browse, Approximate Answer Best Effort Guarantees Single Schema Centralized Administration Structured Query Strict Integrity Constraints Dataspaces vs. Databases Michael Franklin Dow Jones EV Summit May 2006

  29. The World of Dataspaces Web Search Far Virtual Organization Administrative Proximity Federated DBMS Near Desktop Search DBMS High Low Semantic Integration Michael Franklin Dow Jones EV Summit May 2006

  30. Conclusions • Structured data not going away. • In fact, there will be lots more of it. • and it must be processed as fast as it is created. • Structure is crucial for successful data integration and manipulation. • Much effort will be expended to add structural information to text and media. • Traditional (structured) database technology is not up to the task. • Great opportunities for innovation. • HiFi and Dataspaces are examples. Michael Franklin Dow Jones EV Summit May 2006