Semantic Web Search and Navigation Approach for Data Integration and Querying

http://challenge.semanticweb.org/ Andreas, Aidan, Renaud, Juergen, Sean, Stefan Searching, Navigating, and Querying the Semantic Web with SWSE

Deadline Friday, 13. July • 8 pages LNCS springer style • 1 slide will be 1 paragraph in final paper • please send slides back by Tuesday • first paper draft Wednesday • Prototype by July, 13

1. Introduction • Sean/Andreas

What is the Problem? • Current search engines, both for the Web and the Intranet, are based on keyword searches over documents • More advanced systems have document clustering capabilities using topic taxonomies, or use shallow metadata (meta tags for topic description)‏ • With the traditional approach to search, you only get matching documents to keyword searches, but not answers to precise questions (e.g. telephone number of Person X, projects a person is working on)‏ • Hard to further process traditional search results (information in documents) with a program • Not possible to combine two sources and derive an answer that one source alone couldn’t answer - mashups

Why is it interesting and important • Loads of data available (on the Web, Intranet)‏ • Leverage and provide new insights into information assets • Connecting the dots across data sources can reveal previously unseen relations (data mashups)‏ • Ultimately: transform the web of documents into a web of data

Why is it hard? • Multitude of formats and data models: HTML (text documents), RSS, (relational) data, RDF • difficult to integrate and consolidate data about entities – no common identifiers across sources • scale: web is huge • how can users navigate massive data sets with unknown schema

Why hasn’t it been solved before? • Sean?

What are the key components of the approach? • graph-structured data format (RDF) to merge data from multiple sources in multiple format, entity-centric world view (talking about books, persons, stocks, locations rather than just documents)‏ • exact matching based on IFPs and fuzzy matching to consolidate entities • Distributed system, KISS architecture, highly optimised primitives, scale by adding hardware • Entity search and navigation interface oblivious to schema

2. Example Session

User Interaction Model • data model: entities with attributes, and relations between entities • UI primitives • match keywords in attributes, display entities • filter by entity type • follow relations (incoming links and outgoing links) for an entity – focus change

Example queries • Get phone number of rudi studer (answers instead of links)‏ • Explore and navigate rudi studer and surrounding entities (combination of different sources)‏ • maybe show ontology graph here for rudi studer result set (authored 110 papers, 38 people know him, he knows 27 people, he’s maker of a file, editor of something, workinfo homepage is this…) - Andreas

3. Architecture

Two components • Data preparation and integration • Semantic search and query engine

Architecture Overview

3.1. Data Preparation and Integration

Crawling and data gathering • Juergen/Andreas • What? • Why? • How? • crawl during june/july 2007 • started with RDF rdfs:seeAlso • added search-engine scraped RSS (with common english words)‏ • added dblp URLs for HTML pages and crawled them to depth 2/3?

Data Conversion • We convert structured data exported from the “Deep Web” under liberal licenses • Such data includes DBLP, CiteSeer, IMdB, Wikipedia etc. • Much data present in these datasets. Can combine information (e.g. DBLP, CiteSeer) so the sum is greater than the parts • Create a target ontology/schema in RDFS/OWL; write wrappers e.g. XSLT to convert data to RDF according to the schema

Use of the MultiCrawler architecture for gathering data. A crawler framework developed especially for getting structured data from a various kind of different sources(RDF, XML, RSS; HTML,...)‏ Support for crawling rdf documents, by following rdf:seeAlso links Ability of getting more structured data by crawling different data formats, converting them automatically into RDF by using XSLT's, and extracting new urls Over two month (june, july 07) XYZ documents( resulted in ABC Quads/Triples) were crawled from different rdf repositories, rss feeds and selected html documents Crawling and data gathering(MultiCrawler)‏

Entity consolidation (IFPs)‏ • Need to integrate information on the same resources across sources • Can do so automatically if URIs are correctly used; often they’re not • Can look at other unique keys (inverse functional properties) to match instances of the same resource; keys such as ISBN codes, IM chat usernames, etc. • IFPs are identified as such in ontologies • If two instances of book have the same ISBN code they are the same book • This way we attempt to have one instance (result) per entity; the total knowledge contribution towards an entity from all sources is summarised in one result

Entity Matching and Linking • How to interlink various web documents with existing RDF entities ? • RDF entities: Geonames, Foaf:Agent, Doap:Project‏ • Web documents: HTML, RSS, RDF, ... • We want to annotate web documents (unstructured information) with existing RDF resources (semi-structured information) with rdfs:seeAlso links on a large scale. • To be able to have a better description of the entities identified in a document: • A web document speaks about a company named « DERI ». What is the company ? How can I find its description and contact information ? • To be able to find web documents based on entity conjunctions: • I want to read documents that refers to events about the « SWSE » project of « Andreas Harth » from « DERI » in « Ireland ». • How ? • Web documents are indexed with a normal IR engine, then named entities are matched against the inverted index (very efficient, hundreds of matches per second). The issue is how to avoid noisy matches: we need a disambiguation process. • Disambiguation: Weighting scheme from the IR engine; basic reasoning using contextual information about the entity; statistically with co-occurrence of rdfs:seeAlso links. • Evaluation ? • Randomly selection of RDF entities. Manual verification of the links between the entity and the web documents.

3.2. Semantic Web Search and Query Engine

Index Manager • Index Manager maintains local YARS index • Index contains complete quad index and keyword index • Keyword index (lucene) provides identifiers of entities that match a keyword query • Complete quad index contains six quad indices in different order (SPOC, POCS, etc.) which are required to offer lookups on any quad pattern • Each of these individual indices comprises a blocked, compressed, sorted file of quads with an in-memory sparse index • Sparse index in memory stores the first quad of each block and it’s position on-disk allowing direct access to the first block pertaining to a given quad pattern

Index Manager • Specifically the Index Manager • Creates the local indices given a file • Re-orders and merge-sorts the quad indices • Serialises the sparse index for each • Creates the Keyword index • Offers query processing over the local index • Can access the index manager via RMI client and pose lookups and SPARQL queries to the index

Query Processing • provide SPARQL interface enriched with keyword lookups to aid explorations of unkown data • distributed query processing based on flooding and hash partitioning depending on query and data distribution • iterator notion with batch transfer of data to improve performance • ui uses query processor, and also SPARQL api available

Ranking • Ranking used called ReConRank • Need a way of ordering results according to importance • Also need trust metrics since data can be provided by anyone, anywhere about anything • Combine data-source graph (physical layer) and the data graph (logical layer) into one large graph and apply links analysis to rank both entities (importance) and sources (trust)‏ • Also include TF-IDF scores from Keyword Index in a weightings scheme to improve relevance of top scored entities • Currently operates at run-time on the results returned from the index

Entity Search and Navigation Interface • Sean

4. Prototype • Andreas/Renaud/Aidan • Online with > 1m sources (need to get a lot of HTML – currently have roughly 100k FOAF and 100k rss sources)‏ • > 1m entities • target 500 million triples • index generation, object consolidation: xx hours, xx triples/second • distributed on 4 machines for query processing and 1 machine for user interface • SPARQL endpoint

Evaluation • response times for user interface < 1 second • see http://jayant7k.blogspot.com/2006/06/benchmarking-results-of-mysql-lucene.html for an acceptable benchmark method • need to test concurrent access • select 10000 keyword searches • do HTTP lookups with 1,2,3 concurrent threads • measure overall time • calculate average response time (overall time/10000)‏ • Evaluation of: fuzzy matching (Renaud) how? • Evaluation of ranking (Aidan) how?

Build that thing - TBD • get stats for current index (aidan) – monday • get class diagram for current index (andreas) - monday • object consolidate index (aidan) - monday • crawl HTML pages (andreas) – Monday • entity recognition for RSS and HTML (renaud) - tuesday • disambiguation? (renaud) - wednesday • rebuild index with additional information (aidan) – thursday • performance tests (aidan/renaud) - friday • add ranking to UI (andreas) - tuesday • cut off for lucene or check how lucene can return streaming results (aidan) - tuesday • hack local join of keyword and spoc (andreas/aidan) – tuesday

5. Conclusion • Andreas/Sean • We have shown how to apply semantic web technology to a large-scale Web data integration scenario using a systems approach • Future work: complement the centralised warehousing approach with an on-demand approach to include live data sources in query processing – the basics, namely distributed query processing are already there • We see commercial potential in a wide range of application areas for our system: general Web search, vertical search engines, and enterprise search

Conclusion II • add sw challenge criteria to conclusion

Semantic Web Search and Navigation Approach for Data Integration and Querying

Semantic Web Search and Navigation Approach for Data Integration and Querying

Presentation Transcript

Querying the Semantic Web with RQL *

Searching the Web CS3352 Searching the Web

Distributed Rule Responder Querying on the Semantic Web Harold Boley

Storing and Querying Fuzzy Knowledge in the Semantic Web

 -Queries: Enabling Querying for Semantic Associations on the Semantic Web

Semantic Basics: Markup, Querying, and Reasoning

SOWL:Spatiotemporal Representation , Reasoning and Querying over the Semantic Web

Growing the Semantic Web with Inverse Semantic Search

Querying the deep Web

Automatic Creation and Simplified Querying of Semantic Web Content

Natural Language Querying of the Semantic Web

Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web

Searching The Semantic Web

Querying Dynamic and Context-Sensitive Metadata in Semantic Web

Navigating the Web

Querying The Web Database

Querying the Web of Data with SPARQL and XSPARQL

Chapter 3 Querying the Semantic Web

Distributed Rule Responder Querying on the Semantic Web Harold Boley

Semantic Access: Semantic Interface for Querying Databases

Searching for Knowledge and Data on the Semantic Web

Growing the Semantic Web with Inverse Semantic Search