1 / 31

Searching, Navigating, and Querying the Semantic Web with SWSE

http://challenge.semanticweb.org/ Andreas, Aidan, Renaud, Juergen, Sean, Stefan. Searching, Navigating, and Querying the Semantic Web with SWSE. Deadline Friday, 13. July. 8 pages LNCS springer style 1 slide will be 1 paragraph in final paper please send slides back by Tuesday

drago
Download Presentation

Searching, Navigating, and Querying the Semantic Web with SWSE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://challenge.semanticweb.org/ Andreas, Aidan, Renaud, Juergen, Sean, Stefan Searching, Navigating, and Querying the Semantic Web with SWSE

  2. Deadline Friday, 13. July • 8 pages LNCS springer style • 1 slide will be 1 paragraph in final paper • please send slides back by Tuesday • first paper draft Wednesday • Prototype by July, 13

  3. 1. Introduction • Sean/Andreas

  4. What is the Problem? • Current search engines, both for the Web and the Intranet, are based on keyword searches over documents • More advanced systems have document clustering capabilities using topic taxonomies, or use shallow metadata (meta tags for topic description)‏ • With the traditional approach to search, you only get matching documents to keyword searches, but not answers to precise questions (e.g. telephone number of Person X, projects a person is working on)‏ • Hard to further process traditional search results (information in documents) with a program • Not possible to combine two sources and derive an answer that one source alone couldn’t answer - mashups

  5. Why is it interesting and important • Loads of data available (on the Web, Intranet)‏ • Leverage and provide new insights into information assets • Connecting the dots across data sources can reveal previously unseen relations (data mashups)‏ • Ultimately: transform the web of documents into a web of data

  6. Why is it hard? • Multitude of formats and data models: HTML (text documents), RSS, (relational) data, RDF • difficult to integrate and consolidate data about entities – no common identifiers across sources • scale: web is huge • how can users navigate massive data sets with unknown schema

  7. Why hasn’t it been solved before? • Sean?

  8. What are the key components of the approach? • graph-structured data format (RDF) to merge data from multiple sources in multiple format, entity-centric world view (talking about books, persons, stocks, locations rather than just documents)‏ • exact matching based on IFPs and fuzzy matching to consolidate entities • Distributed system, KISS architecture, highly optimised primitives, scale by adding hardware • Entity search and navigation interface oblivious to schema

  9. 2. Example Session

  10. User Interaction Model • data model: entities with attributes, and relations between entities • UI primitives • match keywords in attributes, display entities • filter by entity type • follow relations (incoming links and outgoing links) for an entity – focus change

  11. Example queries • Get phone number of rudi studer (answers instead of links)‏ • Explore and navigate rudi studer and surrounding entities (combination of different sources)‏ • maybe show ontology graph here for rudi studer result set (authored 110 papers, 38 people know him, he knows 27 people, he’s maker of a file, editor of something, workinfo homepage is this…) - Andreas

  12. 3. Architecture

  13. Two components • Data preparation and integration • Semantic search and query engine

  14. Architecture Overview

  15. 3.1. Data Preparation and Integration

  16. Crawling and data gathering • Juergen/Andreas • What? • Why? • How? • crawl during june/july 2007 • started with RDF rdfs:seeAlso • added search-engine scraped RSS (with common english words)‏ • added dblp URLs for HTML pages and crawled them to depth 2/3?

  17. Data Conversion • We convert structured data exported from the “Deep Web” under liberal licenses • Such data includes DBLP, CiteSeer, IMdB, Wikipedia etc. • Much data present in these datasets. Can combine information (e.g. DBLP, CiteSeer) so the sum is greater than the parts • Create a target ontology/schema in RDFS/OWL; write wrappers e.g. XSLT to convert data to RDF according to the schema

  18. Use of the MultiCrawler architecture for gathering data. A crawler framework developed especially for getting structured data from a various kind of different sources(RDF, XML, RSS; HTML,...)‏ Support for crawling rdf documents, by following rdf:seeAlso links Ability of getting more structured data by crawling different data formats, converting them automatically into RDF by using XSLT's, and extracting new urls Over two month (june, july 07) XYZ documents( resulted in ABC Quads/Triples) were crawled from different rdf repositories, rss feeds and selected html documents Crawling and data gathering(MultiCrawler)‏

  19. Entity consolidation (IFPs)‏ • Need to integrate information on the same resources across sources • Can do so automatically if URIs are correctly used; often they’re not • Can look at other unique keys (inverse functional properties) to match instances of the same resource; keys such as ISBN codes, IM chat usernames, etc. • IFPs are identified as such in ontologies • If two instances of book have the same ISBN code they are the same book • This way we attempt to have one instance (result) per entity; the total knowledge contribution towards an entity from all sources is summarised in one result

  20. Entity Matching and Linking • How to interlink various web documents with existing RDF entities ? • RDF entities: Geonames, Foaf:Agent, Doap:Project‏ • Web documents: HTML, RSS, RDF, ... • We want to annotate web documents (unstructured information) with existing RDF resources (semi-structured information) with rdfs:seeAlso links on a large scale. • To be able to have a better description of the entities identified in a document: • A web document speaks about a company named « DERI ». What is the company ? How can I find its description and contact information ? • To be able to find web documents based on entity conjunctions: • I want to read documents that refers to events about the « SWSE » project of « Andreas Harth » from « DERI » in « Ireland ». • How ? • Web documents are indexed with a normal IR engine, then named entities are matched against the inverted index (very efficient, hundreds of matches per second). The issue is how to avoid noisy matches: we need a disambiguation process. • Disambiguation: Weighting scheme from the IR engine; basic reasoning using contextual information about the entity; statistically with co-occurrence of rdfs:seeAlso links. • Evaluation ? • Randomly selection of RDF entities. Manual verification of the links between the entity and the web documents.

  21. 3.2. Semantic Web Search and Query Engine

  22. Index Manager • Index Manager maintains local YARS index • Index contains complete quad index and keyword index • Keyword index (lucene) provides identifiers of entities that match a keyword query • Complete quad index contains six quad indices in different order (SPOC, POCS, etc.) which are required to offer lookups on any quad pattern • Each of these individual indices comprises a blocked, compressed, sorted file of quads with an in-memory sparse index • Sparse index in memory stores the first quad of each block and it’s position on-disk allowing direct access to the first block pertaining to a given quad pattern

  23. Index Manager • Specifically the Index Manager • Creates the local indices given a file • Re-orders and merge-sorts the quad indices • Serialises the sparse index for each • Creates the Keyword index • Offers query processing over the local index • Can access the index manager via RMI client and pose lookups and SPARQL queries to the index

  24. Query Processing • provide SPARQL interface enriched with keyword lookups to aid explorations of unkown data • distributed query processing based on flooding and hash partitioning depending on query and data distribution • iterator notion with batch transfer of data to improve performance • ui uses query processor, and also SPARQL api available

  25. Ranking • Ranking used called ReConRank • Need a way of ordering results according to importance • Also need trust metrics since data can be provided by anyone, anywhere about anything • Combine data-source graph (physical layer) and the data graph (logical layer) into one large graph and apply links analysis to rank both entities (importance) and sources (trust)‏ • Also include TF-IDF scores from Keyword Index in a weightings scheme to improve relevance of top scored entities • Currently operates at run-time on the results returned from the index

  26. Entity Search and Navigation Interface • Sean

  27. 4. Prototype • Andreas/Renaud/Aidan • Online with > 1m sources (need to get a lot of HTML – currently have roughly 100k FOAF and 100k rss sources)‏ • > 1m entities • target 500 million triples • index generation, object consolidation: xx hours, xx triples/second • distributed on 4 machines for query processing and 1 machine for user interface • SPARQL endpoint

  28. Evaluation • response times for user interface < 1 second • see http://jayant7k.blogspot.com/2006/06/benchmarking-results-of-mysql-lucene.html for an acceptable benchmark method • need to test concurrent access • select 10000 keyword searches • do HTTP lookups with 1,2,3 concurrent threads • measure overall time • calculate average response time (overall time/10000)‏ • Evaluation of: fuzzy matching (Renaud) how? • Evaluation of ranking (Aidan) how?

  29. Build that thing - TBD • get stats for current index (aidan) – monday • get class diagram for current index (andreas) - monday • object consolidate index (aidan) - monday • crawl HTML pages (andreas) – Monday • entity recognition for RSS and HTML (renaud) - tuesday • disambiguation? (renaud) - wednesday • rebuild index with additional information (aidan) – thursday • performance tests (aidan/renaud) - friday • add ranking to UI (andreas) - tuesday • cut off for lucene or check how lucene can return streaming results (aidan) - tuesday • hack local join of keyword and spoc (andreas/aidan) – tuesday

  30. 5. Conclusion • Andreas/Sean • We have shown how to apply semantic web technology to a large-scale Web data integration scenario using a systems approach • Future work: complement the centralised warehousing approach with an on-demand approach to include live data sources in query processing – the basics, namely distributed query processing are already there • We see commercial potential in a wide range of application areas for our system: general Web search, vertical search engines, and enterprise search

  31. Conclusion II • add sw challenge criteria to conclusion

More Related