1 / 49

DataSpaces : A New Abstraction for Data Management

DataSpaces : A New Abstraction for Data Management. Alon Halevy* DASFAA, 2006 Singapore. *Joint work with Mike Franklin and David Maier. Outline. Dataspaces: collections of heterogeneous (un)-structured data. examples and characteristics Dataspace Support Platforms:

mlittle
Download Presentation

DataSpaces : A New Abstraction for Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DataSpaces:A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier

  2. Outline • Dataspaces: • collections of heterogeneous (un)-structured data. • examples and characteristics • Dataspace Support Platforms: • “Pay-as-you-go” data management • Getting there: • Recent progress and research challenges

  3. Outline • Dataspaces: • collections of heterogeneous (un)-structured data. • examples and characteristics • Dataspace Support Platforms: • “Pay-as-you-go” data management • Getting there: • Recent progress and research challenges

  4. Shrapnels in Baghdad Story courtesy of Phil Bernstein

  5. AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginatedFrom BudgetOf HomePage Personal Information Management [Semex: Dong et al.]

  6. Google Base

  7. What do they have in common? • All dataspaces contain >20% porn. • The rest is spam.

  8. What do they have in common? • Must manage allthe data in the space • Need best-effort services with no setup time. • Data is heterogeneous, • possibly unstructured • Do not have control over the data, just access.

  9. Isn’t this Data Integration? Sequenceable Entity Structured Vocabulary Experiment Phenotype Gene Nucleotide Sequence Microarray Experiment Protein OMIM HUGO Swiss- Prot GO Gene- Clinics Locus- Link Entrez GEO

  10. No, it’s Data Co-existence • Data integration systems require semantic mappings.

  11. Semantic Mappings Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database A Inventory Database B

  12. No, It’s Data Co-existence • Data integration systems require semantic mappings. • Dataspaces are “pay-as-you-go”: • Provide some services immediately • Create more tight integrations as needed.

  13. The Cost of Semantics Schema first vs. schema last Benefit Dataspaces Data integration solutions Investment (time, cost)

  14. Why Now? • Data management is moving towards dataspace-like applications. • Prediction: • Data management is about people, not enterprises. • In 5 years our community will figure it out. • We’ve made relevant progress: • Combining DB & IR • Creation and management of semantic mappings. • Uncertainty, lineage, inconsistency

  15. Dataspaces Fundamentals:Participants and Relationships sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB

  16. sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSSP Components • Seamless flow from search to query • Query about participants, locate data • Lineage, uncertainty, completeness • Set up workflows • Find participants • Discover and refine relationships • Maintain catalog • Heterogeneous index • Reference reconciliation • Additional associations • Cache for performance & availability • participants’ capabilities • Relationships • Quality of both Catalog Local Store & Index Relax Search & query Administration Discovery & Enhancement

  17. sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSSP Components Catalog Local Store & Index Search & query Administration Discovery & Enhancement

  18. Querying a Dataspace • Best effort, based on: • Approximate semantic mappings • Other mechanisms • Example -- searching for Beng Chin’s phone number: • Keyword search for “beng chin ooi” • Examine attributes of tuples/XML elements • Match attributes to ‘address’

  19. Querying a Dataspace • Best effort • Combine structured and unstructured data

  20. Two Kinds of Data SQL Keyword Queries (XQuery) Structured Data (or XML) Unstructured Data [Dong, Liu, Halevy]

  21. Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources

  22. Volvo Palo alto Volvo Palo alto

  23. Acura integra Palo alto

  24. Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources • Iterative -- sequences of queries: • Used cars palo alto • Saab for sale • Classified ads • Classified ads for Saabs near Palo Alto

  25. Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources • Iterative: sequences of queries • Reflective: need to explain and expose confidence in answers.

  26. Dataspace Reflection • Sources of uncertainty: • Unreliable sources • Data obtained by imprecise extractions • Inconsistent data • Approximate query answering (mappings) • A DSSP must: • Expose the lineage of an answer • Web search engines already do this • Reason about relationship between sources

  27. LUI Introspection • Currently, three distinct formalisms: • Uncertainty • Lineage • Inconsistency • Single formalism should do it all: • Uncertainty and lineage (Trio @ Stanford) • Lineage & inconsistency (Orchestra @ U.Penn)

  28. ULDB’s[Benjelloun, Das Sarma, Halevy, Widom] • Combine uncertainty and lineage • Based on x-tuples: {(t1 | t2 | t3)} • Queries can be answered with no additional complexity • You can do even better than uncertain DBs. • Because of lineage, you can sometimes obtain completeness.

  29. sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSS Components Catalog Local Store & Index Search & query Administration Discovery & Enhancement

  30. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List

  31. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: Dong

  32. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: FirstName “Dong”

  33. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: FirstName “Dong”

  34. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: name “Dong”

  35. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: name “Dong”

  36. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: Paper author “Dong”

  37. Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: Paper author “Dong”

  38. sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSS Components Catalog Local Store & Index Search & query Administration Discovery & Enhancement

  39. AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites ComeFrom EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginitatedFrom BudgetOf HomePage Enhancing a Dataspace • Creating associations • [Dong & Halevy, CIDR 2005] • Reference reconciliation • [Dong et al., SIGMOD 2005] • Very active field.

  40. sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSS Components Catalog Local Store & Index Search & query Administration Discovery & Enhancement

  41. Reusing Human Attention • Human attention is most expensive. • Reuse whenever possible. E.g.,: • Manual schema mapping • Annotations • Queries written on data • Temporary collections of items • Operations on the data (cut & paste) • Solicit semantic information selectively • The ESP game: [von Ahn et al.]

  42. Learning from Past Matches [Doan et. al, Transformic] • Every manual map is a learning example. • Learn models for elements in mediated schema. • Use multi-strategy learning. • Thousands of maps in very little time. Reuse for a very related task.

  43. Corpus-based Matching Product Music (no tuples) [Madhavan et al.]

  44. CD albumName prodID Corpus MusicCD CD Obtaining More Evidence Product

  45. MusicCD CD albumName prodID ASIN Title album artists artistName recordLabel recordCompany discount price 4Y6DF23 The Doors Columbia $12.99 The Best of the Doors Comparing with More Evidence Product Music

  46. Challenges • Learn from other kinds of user activities • Create other kinds of relationships between participants • Identify higher-level goals from user actions • Develop a formal framework for reusing human attention.

  47. Conclusion “Dataspaces: because that’s the size of the problem” • Pay-as-you-go data integration • Data management for the masses • Key: reuse of human attention • Need to be very data driven in our research

  48. Some References • www.cs.washington.edu/homes/alon • SIGMOD Record, December 2005: • Original dataspace vision paper • PODS 2006: • Specific technical challenges for dataspace research • Semex: CIDR 2005, SIGMOD 2005 • Teaching integration to undergraduates: • SIGMOD Record, September, 2003.

  49. 1. Build initial models T T S S Name: Instances: Type: … Name: Instances: Type: … Name: Instances: Type: … Name: Instances: Type: … M’s Ms M’t Mt t t s s 2. Find similar elements in corpus Corpus of schemas and mappings 3. Build augmented models 5. Use additional statistics (IC’s) to refine match 4. Match using augmented models

More Related