1 / 17

Intensional Associations in Dataspaces

Intensional Associations in Dataspaces. Jens Dittrich Saarland University. Lukas Blunschi ETH Zurich. Marcos Vaz Salles Cornell University. ICDE 2010. Potentially relevant results. What is missing?. Irrelevant results that sound like Kevin Spacey. Potentially relevant results.

chinue
Download Presentation

Intensional Associations in Dataspaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intensional Associations in Dataspaces Jens Dittrich Saarland University Lukas Blunschi ETH Zurich Marcos Vaz Salles Cornell University ICDE 2010

  2. Potentially relevant results What is missing? Irrelevant results that sound like Kevin Spacey

  3. Potentially relevant results Other Members of the Spacey Family in the Trade Colleagues Who acted together with Kevin Spacey Items connected to Kevin Spacey by relationships Other Folks from NJ in the Trade

  4. Potentially relevant results Movies in Common Same Last Name Samuel L. Jackson (37) Tom Hanks (34) Items connected to Kevin Spacey by relationships John Graham Spacey (great-great-uncle!) Robin Williams (34) Dustin Hoffman (34) Morgan Freeman (32) Same Place of Birth Zach Braff, Adam Horovitz, Andrew Shue Joel Silver, Craig Kingsbury, Joseph Kraft Drew Rosenhaus, Lauryn Hill, Stacey Kent

  5. The Problem • Keywords are not enough • If item is not tagged, it is not returned • No meaningful definition of relatedness • Relationships essential, but hard to get right • Searches do not include related items • Adding relationships to search queries hurts response time • The more flexible the definition of relatedness, the higher the cost

  6. Our Solution • Keywords are not enough • Declarative mini-language to define intensional associations • Relationships essential, but hard to get right • Special class of neighborhood-enriched search queries over virtual associations • New index structure for neighborhood searches to process these queries efficiently

  7. Association Trails Join Predicate that relates elements from the left with elements from the right θ(L, R) A: QL  QR • Example: Actors in the same movies moviesInCommon: //person[type=“actor”]  //person[type=“actor”], θ1 = (ml  L/movies: ml  R/movies) Meaning: Every element from query on the left has an intensional edge to θ-matching elements from query on the right Search queries that select elements in the dataspace θ1

  8. Neighborhood Search Queries • Combine search with pre-defined joins in association trails to get related items • Examples: • Search for “kevin spacey” also returns colleagues who acted together, other family members, etc • Search for “actors who won the Oscar” also returns other actors strongly related to this set by virtual associations Related Items Search Results

  9. Query Processing over Association Trails • Intuition: Index at association trail definition time to avoid costly joins at runtime • Naive Approach • Materialize all association trails into join index • Probe join index to get related items Naive Approach: Given m association trails and n items, index size is worst-case O(mn2)

  10. Grouping-Compressed Index (GCI) • Still materializes join, but in compressed form • Takes advantage of redundancy in join output • O(mn) worst-case on equi-joins • Intuition: CA NJ NJ samePlaceOfBirth θ(L,R)=(L.placeOfBirth = R.placeOfBirth) CA NJ NJ CA For each clique, only represent pivot, edges from pivot, and elements in clique

  11. Grouping-Compressed Index (GCI) • Technical challenge is to answer neighborhood queries without decompressing • Intuition: • Details on full version of the paper Search Results Probe pivot only once Search: actors who won the Oscar CA NJ NJ samePlaceOfBirth θ(L,R)=(L.placeOfBirth = R.placeOfBirth) CA NJ NJ CA

  12. Experiments with IMDb Dataset • Dataset: • IMDb biographies and filmographies • ~2M people, ~1.5M movies • Queries: • Original search returns a subset of people • Neighborhood processing includes all people related to original set through association trails • Association trails: moviesInCommon, samePlaceOfBirth, sameHeight, sameLastName, sameBirthdate

  13. Experiments with IMDb Dataset • Indexing: over order-of-magnitude gains • Querying: • Naive method very sensitive to selectivity • Querying compressed index comparable to uncompressed one with high selectivity

  14. Related Work • Neighborhood queries in dataspaces / IR: Dong & Halevy [SIGMOD 2007], Carmel et al. [SIGIR 2003] • Intensional Associations: Srivastava & Velegrakis [SIGMOD 2007] • Graph Indexing: Trissl and Leser [SIGMOD 2007], Neumann & Weikum [VLDB 2008], Weiss et al. [VLDB 2008], XML • Recursive Queries: Declarative Networking & Datalog [SIGMOD 2006]

  15. Conclusion Thank you! • Association Trails • Declarative mini-language to specify intensional associations in dataspaces • Neighborhood Search Queries • Return associated items along with search results • Search combined with joins • Grouping-Compressed Index (GCI) • Efficient scheme to index intensional associations and process neighborhood search queries

  16. Backup Slides

  17. Association Trail Examples • Actors in the same movies moviesInCommon: //person[type=“actor”]  //person[type=“actor”], θ1 = (ml  L/movies: ml  R/movies) • Actors born in same place samePOB: //person[type=“actor”]  //person[type=“actor”], θ2 = (L.placeOfBirth = R.placeOfBirth) θ1 θ2

More Related