1 / 36

The Sibdata Revolution

The Sibdata Revolution. Nick Roussopoulos DCS & UMIACS & Univ. of Maryland. September 2009. Data Management: Past to Current. Structured Data Structured architectures. Data Management: Huh???. The Landscape.

daw
Download Presentation

The Sibdata Revolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Sibdata Revolution Nick Roussopoulos DCS & UMIACS & Univ. of Maryland September 2009

  2. Data Management: Past to Current • Structured Data • Structured architectures Nick Roussopoulos

  3. Data Management: Huh??? Nick Roussopoulos

  4. The Landscape Bell’s Law:Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect • Mainframes 1960s • Minicomputers 1970s • Microcomputers/PCs 1980s • Web-based computing 1990s • Devices (Smart phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications that Mandate new data management methods & tools. Nick Roussopoulos

  5. Data Then and Now • The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!! • The Data Integration quagmire: 40 years of continuous successes (sic) and still a long way to the end. • Structure provides crucial understanding for making data usable and leads to discovery/innovation. Nick Roussopoulos

  6. Data Streaming  Data Explosion PoS System Barcodes Phones Sensors RFID • Exponential data growth • New challenges: continuous, inter-connected, distributed, physical • Shrinking business cycles • More complex decisions Inventory Transactional Systems Telematics Clickstream Nick Roussopoulos

  7. The Structure Spectrum • Structured data (schema-first) • regular, known, conforming, … • e.g., Relational database • Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, … • Semi-structured data (schema-later) • Provides structural information, but less constrained. e.g., XML, tagged text/media Nick Roussopoulos

  8. Data Integration • Integration is the ultimate schema-first problem. • Requires complete understanding & disambiguation • Structure (semantics) is both a key enabler and a key impediment here. Nick Roussopoulos

  9. Structured Data: How much • Conventional Wisdom: ~20% of data is structured currently. • Consumer apps, enterprise search, multimedia apps are placing downward pressure on this. Nick Roussopoulos

  10. State of the Art: Integration-in-the-large • Team work, huge & expensive effort, excruciating pain • Extremely long time lag between data generation and availability • Custom-coded implementations that are often unsuccessful • Clearing house of already discovered knowledge (the high overhead is for disambiguating the semantics of the heterogeneous data) Nick Roussopoulos

  11. Future: Integration-in-the-small • End-user, limited in scope, requires training • Continuous as the data sources and equipment evolve • End-user tools are needed • Small cost, enormous opportunity for discovery and innovation Nick Roussopoulos

  12. Sibling Data • Aggregation and naming of disparate data regardless location • Includes actual data, references to external data, queries that generate data, & programs to process data • May include other sibdata • Open vs Closed • Open: continuous accumulation • Closed: fixed snapshot (archival) • Location Independent semantics Nick Roussopoulos

  13. Web search results Nick Roussopoulos

  14. Content vs URL • Content  • http://www.michaelmoore.com/ Nick Roussopoulos

  15. Deep-Web Queries SELECT y.title FROM Yahoo_Movies m WHERE m.title like Moore; Nick Roussopoulos

  16. Result vs. Query • Results are associated with the time the query was run • Queries can be captured in sibdata and executed at will; thus the sibdata would be open and captures a different result each time it executes Nick Roussopoulos

  17. Queries to Relational Databases Yahoo_Actors Nick Roussopoulos

  18. Sibdata • Deal with all the data from everywhere & in whatever form they come • Data co-existence no integrated schema, no single warehouse • Expand-as-you-go • Integrate little by little as you need • ETL Data mapping-integrating as you add more data Nick Roussopoulos

  19. Sibdata Properties • Lightweight • Metadata captures the encapsulation, name, and provenance data • Location-independent • Accessible from anywhere • Isolated • Generated with no interference • Durable • Persist until dropped • Secure • Guarantee security defined by the creators and sources • Compose multiple levels of security to its components Nick Roussopoulos

  20. Comparison to Transactions • Transactions • grouping of many actions into an atomic transaction- ACID properties • Substrate: database • Sibdata • Grouping of data into an atomic sibdata – LLADS • Substrate: actions/transactions/data generators Nick Roussopoulos

  21. Sibdata Infrastructure Nick Roussopoulos

  22. Sibdata Servers • Establish a global sibdata ID and name • Creates and maintains metadata with provenance, users, security, etc. • Provides searchable catalog • Provides storage for non-sib compliant data sources • Fault tolerance (replication) Nick Roussopoulos

  23. Sib Protocols • Establish Sibdata protocol • Concurrency-Consistency issues (?) • Sharing of data • Name conventions • Dispute resolution • Distributed Logging • Security Using chits • Group and multi-valued ownership and visibility Nick Roussopoulos

  24. User Interface • Simple OS support • Query Languages • Graphical Languages • ETL tools • Extra functionality • High dimensional indexing • Mining Nick Roussopoulos

  25. Conclusions • Need to build Sib Infrastructure • Refine the sibdata semantics • Refine the security protocols • For data aggregates • User groups • Great opportunities for innovation Nick Roussopoulos

  26. Presentations & Project • 3 X 7 students = 21 presentations ~2 per lecture • Lecture dates • Sep: 15, 22, 29 • Oct: 6, 13, 20, 27 • Nov: 3, 10, 17, 24 • Dec: 1, 8 • Project: Proposal due Sep 29 • Discussion: Every lecture be prepared to give a 2-3 min progress report, papers found, etc. Nick Roussopoulos

  27. Network Data IndependenceHellerstein Berkeley • Physical Data Independence • Decoupling data from layout (not hard coded applications) • Permits reorganization of data w/o affecting the apps • Declarative query languages • Using the schema • Distributed Databases • Transparency hides location from the user who acts as if he is accessing a centralized database • Limited sites- not capable to expand to the mobility of and constant change of the configuration Nick Roussopoulos

  28. table R 1 4 5 6 9 11 3 1 occurrence file Pilars of Data independence • Indexes- offer indirection allowing modification of the underlying structure • Schema based and declarative query languages & optimization Nick Roussopoulos

  29. Sibdata Independence • Encapsulation of dissimilar data • Data can be moved, rearranged, altered • Additional indices on top of Sibdata becomes part of the sibdata • Naming and provenance data are fixed • Do not change to the outside world • Containment information (sibdata encapsulation within other sibdata) is guaranteed Nick Roussopoulos

  30. DHT (Chord) • Data centric distribution • according to content- total data independence • very large number of distributed servers • Configuration changes rapidly (although this may not be really that important) • Fault-tolerance (extra machines) • Limited to single key searches (not range or join queries Nick Roussopoulos

  31. Network Names & Services • Internet Indirection Infrastructure (i3) • Triggers (id,r) where id = global ID and r is an address to forward packets • When a mobile user moves to r’, he modifies his trigger to (id,r’) • It also supports 1-to-n mappings (anycast) • Content Distribution Networks (Akamai) • Replicates heavy data (images, videos) to multiple sites and redirects user accesses to those that are closer (indirection via location independence) Nick Roussopoulos

  32. Relevant DB Technologies • Distributed Aggregation • Monitor networks (collecting stats) • Computing synopses and pass it along • Adaptive execution plans • Feedback to the execution • Commutative tasks to avoid extended delays • Range search over DHT • Trie hashing • Still limited • P2P & Mobile Databases Nick Roussopoulos

  33. Pier: A P2P in situ Query Engine Goals • Massively distributed processing • Scallability • Relaxed consistency (best effort) Architecture • P2P Built on top of DHT • Multicast to all related nodes (lscan) • Pipelining the intermediate results Nick Roussopoulos

  34. Pier Joins • Stored in DHT • Namespace=relation NR, NS • resourceID =Primary Key (PK) • instanceID =tuple # if not a PK • Assume R and S are already DHT hashed using <NR,PKR,1> and <NS,PKS,1> • Symmetric Join building phase • lscan NR and NS eliminate unqualified tuples and not needed attributes • Rehash all above tuples using • namespace NQ • resourceID=R.pkey*S.pkey • Tuples are tagged with relation name • SymmetricJoin Probing phase • Probing in parallel with building (with callbacks) locally • Satisfying tuples are either sent to the Qsite or DHT-ed for the pipelined op • Consumes a lot of bandwidth Nick Roussopoulos

  35. Better Joins • Fetch Matches • Hash only S • lscan R and fetch NS tuples • Rewriting Join using 2-way semijoin • Project R & R on their PK and joining attribute • Do symmetric join on these projections • Rewriting Join using Bloom filters • Create and DHT the Bloom filters • Do lscan and access the Bloom filter to eliminate not joinable tuples Nick Roussopoulos

  36. Conclusions for Pier • P2P bring massive parallelism • Repetitive data comparison over DHT brings along massive waste of bandwidth • Smarter in situ distillation (2-way semijoins, Bloom filters) work better Nick Roussopoulos

More Related