1 / 23

YARS2: A Federated Repository for Querying Graph Structured Data from the Web

YARS2: A Federated Repository for Querying Graph Structured Data from the Web. Andreas Harth, Juergen Umbrich, Aidan Hogan, Stefan Decker ISWC 2007, Busan, Korea Wednesday, November 14, 2007. Outline. Motivation System Architecture Indexing Distribution Query Processing Conclusion.

conlan
Download Presentation

YARS2: A Federated Repository for Querying Graph Structured Data from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. YARS2: A Federated Repository for Querying Graph Structured Data from the Web Andreas Harth, Juergen Umbrich, Aidan Hogan, Stefan Decker ISWC 2007, Busan, Korea Wednesday, November 14, 2007

  2. Outline • Motivation • System Architecture • Indexing • Distribution • Query Processing • Conclusion

  3. Problem Statement • Current search technology allows to locate information resources via keyword searches • Users with complex information needs require the ability to browse information spaces • Browsing unknown information spaces allows for • learning about a subject area • discover previously unknown associations • viewing data integrated from a number of sources

  4. Challenges • Browsing information spaces requires query processing • System has to scale, scale, scale • Build a system able to answer queries over web-scale data sets • Combined data from a large number of sources • Web data is scruffy • unknown schemas • varying quality • Ad-hoc query answering over combined data from millions of Web sources • Data mining operations over portions of the data • In database speak: build a data warehouse over Web data • Indexing and query processing are at the core of search engines - hint: how does Google index, and how do they do query processing? you don’t know? the haven’t published? 

  5. Goal: SPARQL query processing E.g. all people working at DERI CONSTRUCT { ?s ?p ?o . } WHERE { ?s rdf:type foaf:Person . ?s foaf:workplaceHomepage <http://www.deri.org/> . ?s ?p ?o . }

  6. YARS2 Data Flow

  7. Index Manager

  8. Index File Organisation

  9. Index Organisation • Split between memory and on-disk allows to perform lookups in O(1) disk seeks • binary search on in-memory data structure • Read-optimised (very fast), updates in batch processing • Sort is most expensive operation O(nlogn), can be done offline at intervals

  10. Complete Index on Quads • Given prefix lookup capabilities, only 6 indexes are needed to cover all access patterns

  11. Index Lookups

  12. Discussion • What we see there is that cpu time is high in the smaller block sizes, and disk i/o becomes the bottleneck in the larger block sizes • Possible to trade memory space for time • smaller block size -> faster lookups, but requires more memory • larger block size -> slower lookups, but uses less memory

  13. Data distribution

  14. Data Distribution • Distributed hash tables offer very good scaling properties • S, O, C are typically well distributed (and hash buckets are about the same size) • P values are not well distributed (rdf:type is notorious example) • Keywords are also not well distributed • Two distribution strategies: hash-based partitioning, and random partitioning (flooding)

  15. Pushing joins • Cool thing about index distribution is that it’s possbile to compute some joins locally • e.g. ocsp >< spoc (where o == s), because both o and s are hashed to the same machine • e.g. keyword >< spoc (because keyword is distributed not really randomly, but on the spoc index machines)

  16. Network lookup overhead • Overhead to initialise/shut down connection • Typical model for query processing is tuple-at-a-time (using iterator pattern) • Not suitable for network communication • Thus, query/results blocking that ships queries and results in batches

  17. Network Throughput

  18. Discussion • 2k row blocking seems optimal • doesn’t scale linearly (possible causes: network is bottleneck (our router does currently only 100MBits) • or, single queue requires synchronisation and creates single hot-spot

  19. Query Processing

  20. Join Processing • Focus on join processing, because that is the most expensive operation • Index nested loops joins where left side of query plan is evaluated, and then queries constructed for right side, that are shipped to remote machine • mulitple thread lookups, coordination done using queues

  21. Multithreaded Join Processing

  22. Distributed Join Processing

  23. Conclusion • RDF query processing possible using adaptations of indexing and query processing techniques known from the 70ies to 90ies of last century • Scale = use basic operations, and optimise them well • Measure, measure, measure

More Related