Structure

Structure • Query Processing • Data models • Query models • Approaches • Challenges • Keyword query processing on RDF • Structured query processing on RDF • Structured query processing on the Web • Routing needs to linked data sources • Linked data query processing

Query Processing

Query Processing Query Matching Data

Data / Data Models • Textual • Bag-of-words • Represent documents, text in structured data,…, real-world objects (captured as structured data) • Miss “structured information” • in text, e.g. linguistic structure, hyperlinks, (positional information) • in structured data term (statistics) In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum. combination Cloud Computing Technologies solutions management `big data' industry solutions support complex ……

Data / Data Models • Textual • Structured • Resource Description Framework (RDF) • Represent real-world objects, services, applications, …. documents • Resource attribute values and relationships between resources • Schema

Data / Data Models • Textual • Structured • Hybrid • Textual and structured data

Query / Query Models • Unstructured • Fully-structured • Hybrid: unstructured + structured

Query / Query Models • Unstructured • NL • Keywords book price 30

Query / Query Models • Unstructured • Fully-structured • SQL: select, from, where • SELECT title, price FROM Books WHERE Price < 30

Query / Query Models • Unstructured • Fully-structured • SQL: select, from, where • SPARQL: BGP, filter, optional, union, select, construct, ask, describe • PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX ns: <http://example.org/ns#> SELECT ?title ?price WHERE { ?x dc:title ?title . OPTIONAL { ?x ns:price ?price . FILTER (?price < 30) } } UNION { ?book dc11:title ?title . ?book dc11:creator ?author } }

Query / Query Models • Unstructured • Fully-structured • SQL • SPARQL • Conjunctive queries, e.g., graph patterns (BGP)

Query / Query Models • Fully-structured • Unstructured • Hybrid: content and structure constraints

Query Processing • Matching queries against data

Approaches – Taxonomy (1) • Complete • Sound Query Matching • Approximate • Not complete • Not sound • Ranked • Best effort • Top-k Data Query processing focuses on efficiency whereas ranking deals with result quality!

Approaches – Taxonomy (2) Textual Data • Keyword query on textual data • (Standard IR) Structured query on textual data • Hybrid query (XML IR) Unstructured Query Structured Query • Keyword query on structured data • Structured query on structured data • (standard DB) Structured Data

Keyword Query / Textual Data • Retrieve documents • Inverted list (inverted index) • keyword  {<doc1, pos, score, ...>, • <doc2, pos, score, ...>, ...} • AND-semantics: top-k join = =

Structured Query / Structured Data • Retrieve data for triple patterns • Index on tables • Multiple “redundant” indexes to cover different access patterns • Join (conjunction of triples) • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes SP-index PO-index = = =

Keyword Query / Structured Data • Retrieve keyword elements • Using inverted index • keyword  {<el1, score, ...>, <el2, score, ...>,…} • Exploration / “Join” • Data indexes for triple lookup • Materialized index (paths up to graphs) • Top-k Steiner tree search, top-k subgraph exploration ↔ ↔ = =

References • Günter Ladwig, Thanh Tran: Combining Query Translation with Query Answering for Efficient Keyword Search. ESWC 2010:288-303 • Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano: Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-Shaped (RDF) Data. ICDE 2009:405-416 • Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, Lizhu Zhou: EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. SIGMOD 2008:903-914 • Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer: Ontology-Based Interpretation of Keywords for Semantic Search. ISWC/ASWC 2007:523-536 • Hao He, Haixun Wang, Jun Yang, Philip S. Yu: BLINKS: ranked keyword searches on graphs. SIGMOD 2007:305-316 • Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan, Rushi Desai, Hrishikesh Karambelkar: Bidirectional Expansion For Keyword Search on Graph Databases. VLDB 2005:505-516

Structured Query / Textual Data • Based on offline IE (offline see Peter’s slides) • Based on online IE, i.e., “retrieve “ is as follows • Derive keywords to retrieve relevant documents • On-the-fly information extraction, i.e., phrase pattern matching “X title Y” • Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. • Index • Inverted index for document retrieval and pattern matching • Join index  inverted index for storing materialized joins between keywords • Neighborhood indexes for phrase patterns Hybrid case

References • Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni: Structured Querying of Web Text Data: A Technical Challenge. CIDR 2007:225-234 • S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, pages 717–726, 2006. • S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti: Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1), 2008. • M. J. Cafarella. Extracting and querying a comprehensive web database. In CIDR, 2009. • G. Ramakrishnan, S. Balakrishnan, and S. Joshi. Entity annotation using inverse index operations. In EMNLP, 2006. • M. Cafarella and O. Etzioni. A search engine for natural language applications. In WWW, 2006.

Query Processing – Main Tasks • Retrieval • Documents , data elements, triples, paths, graphs • Inverted index,…, but also other indexes (B+ tree) • Index documents, triples materialized join paths • Join • Different join implementations, efficiency depends on availability of indexes • Non-blocking join good for early result reporting and for “unpredictable” linked data scenario Query Matching Data

Query Processing – More Tasks • Disjunction, aggregation, grouping • Join order optimization • Approximate • Approximate the search space • Retrieve only some results • Approximate the join • Parallelization • Top-k • Use only some entries in the input streams to produce k results • Multiple sources • On-the-fly mapping, similarity join • Federation, routing • Hybrid • Join text and data Query Matching Data

Query Processing on the WebResearch Challenges and Opportunities • Large amount of semantic data • Data inconsistent, redundant, and low quality • Large amount of data embedded in text • Large amount of sources • Large amount of links between sources • Optimization parallelization, • Approximation • Hybrid querying and data management • Federation, routing • Online schema mappings • Similarity join

Approaches Textual Data • Keyword query on textual data • (Standard IR) Structured query on textual data (DB – IR) Unstructured Query Structured Query Search Space Approximation • Keyword query on structured data • (IR-DB) • Structured query on structured data • (standard DB) Routing, Approximation, Adaptive Optimization Structured Data

Keyword Query Processing onGraph-Structured RDF Data

„stanford article turing award“ Keyword Search in DBs / Keyword Translation (Kacholia et al., VLDB05) User information need • Keywords might produce large number of matching elements in the data graphs • The data graphs might be large in size • Search complexity increases substantially with the size of the data graphs • Large number of results Specification Translation

Query Space (Tran et al., ICDE2009) Query space = connecting keyword elements with schema elements Schema graph derived from data graph • Main Idea • Query space: more compact representation of the data graph • Online construction of query space out of schema graph • Match keywords against labels of resources to find keyword elements • Connect keyword elements with elements of schema graph to obtain query space • Online top-k query graph exploration • Exploration on much reduced summary model called query space • Substantially decrease complexity • Top-k procedure for graph exploration to compute only the top-k most relevant results

Top-k Query Graph Exploration on Query Space Query space, three paths from keyword matching elements, and costs of elements • Cost-directed exploration of minimal Steiner graphs • Explore all possible distinct paths starting from keyword elements • At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to obtain a candidate • Top-k terminateswhen • highest cost in the candidate list (the cost of the k-ranked query graph) • < lowest possible cost that can achieved with paths in the queues

Structured Query Processing onGraph-Structured RDF Data

Query Processing • Structured query: conjunctive queries • Conjunctive queries on graph-structured data amounts to the task of graph-pattern matching • A solution for determining matching requires exponential time • Search complexity increases substantially with the size of the graph • The size of the graph is very large on the Web of linked data

Answer Space (Tran et al., SemData@VLDB 2010) An extended example of the data graph • Construction of answer space is based on bisimulation • Answer space • Comprises of classes (extensions) and relations between them • Resources in an extension exhibit the same structure, i.e., have the same (incoming and outgoing) paths • Is a structural description more fine-granular then a schema The resulting answer space • Summary model for general data graphs • Structure-based data partitioning to store data that share structures • Structure-aware processing to filter candidates and prune queries using a smaller answer space

Structural-aware Matching Using Answer Space The answer space An example query • Match query against answer space • Answer space matches contain elements satisfying the query structure • Focus on answer spaces matches to compute final answers • Prune query parts containing non-distinguished variables only • Match remaining query against data graph (i.e., focus on elements in the answer space matches identified and loaded before) • Advantages: reduction in IO cost and number of union & joins

Query Processing on the Linked Data Web

Query Processing on the Web • Routing • Find combinations of sources • Federation • Query parts  sources • Combining results from different sources • Online schema mappings • Similarity join

Linked Data More Links More Data • 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links • As of 09-2010 + other linked data not covered by LOD cloud

Challenges “Articles from awarded researchers at Stanford ” • Large number of unknown, unprocessed & irrelevant sources! • What is in there? • What is out there? • What is relevant? • Formulating queries is a hard task! • Which data sources? • Which schema elements? USABILITY • Processing queries is expensive! • Process against all data sources? SCALABILITY

Searching Linked Data • Given the needs (expressed as sets of keywords), • are there answers in processed linked data? • what combination of data sources produce them? • how to incorporate related unprocessed linked sources? • Identify valid combination of sources • Identify schema elements • Keyword Query Routing • Let user choose combination of sources • Focus on this combination of sources and related linked sources • Linked Data Query Processing

Keyword Query Routing(Tran et al., ISWC 2010)

Keyword Query Routing • Linked data (schema and data are linked) • Routing based on keywords • Find combinations of sources

LOD Data Graph • Web data modeled as a set of interlinked data graphs • Each data graph represent a source • Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia … John. John Smith Music Award title name label uni1 pub2 pub1 pub3 per4 prize2 prizes author employ author author per2 per1 per3 prize1 sameAs prizes sameAs label name name name name Stanford University John McCarthy John Mccarthy John McCarthy Turing Award

LOD Schema Graph • Web data modeled as a set of interlinked data graphs • Each data graph represent a source • Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia University Written Work Article employ author author Person Author Person Prize sameAs prizes sameAs

LOD Source Graph • Web data modeled as a set of interlinked data graphs • Each data graph represent a source • Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia author sameAs sames

„stanford award“ Keyword Query Answers User information need article Freebase DBLP DBPedia Article … John. John Smith Music Award type title name label uni1 pub2 pub1 pub3 per4 prize2 prizes author employ author author per2 per1 per3 prize1 sameAs prizes sameAs label name name name name Stanford University John McCarthy John Mccarthy John McCarthy Turing Award

Problem Definition • Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path. • d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less. • Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources can be combined to produce non-empty keyword query results.

„stanford award“ A Valid Keyword Routing Plan User information need article Freebase DBLP DBPedia Article … John. John Smith Music Award type title name label uni1 pub2 pub1 pub3 per4 prize2 prizes author employ author author per2 per1 per3 prize1 sameAs prizes sameAs label name name name name Stanford University John McCarthy John Mccarthy John McCarthy Turing Award

The Search Space • Multi-level inter-relationship graphs capture the entire search space • Relationships between elements • and between different levels • A solution: apply existing approaches to keyword search for computing Steiner graphs • Steiner graphs might span several linked sources • Search space grow exponentially with the number of sources and their associated links • Search space is too large!

Keyword Sets • One keyword set for every data source • Elements stand for distinct keywords mentioned in a source Freebase DBLP DBPedia … John. John Smith Smith Music Music Award title name label uni1 pub2 pub1 pub3 per4 prize2 prizes author author author per2 per1 per3 prize1 sameAs prizes sameAs employ Stanford John McCarthy John Award name name name label Stanford University University John McCarthy McCarthy John Mccarthy John McCarthy John McCarthy Turing Award Turing

Element-level Keyword-Element Relationship Graph (E- KERG) • A keyword-element captures a keyword k and the data element mentioning k • A relationship between two keyword-elements exists iff there is a path between their associated data elements • In d-max KERG, the paths to be considered have length d-max or less Freebase DBLP DBPedia pub4 per4 prize2 … John. John John Smith Smith Music Music Award title name label John Award uni1 pub2 pub1 pub3 per4 prize2 prizes author author author per2 per1 per3 prize1 sameAs prizes sameAs employ uni1 per2 per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford University University John McCarthy McCarthy John Mccarthy John McCarthy John McCarthy Turing Award Turin

Structure

Structure

Presentation Transcript

Structure

Structure

Structure

Structure

Structure

Structure

Structure

Structure

Structure

Structure

STRUCTURE

Structure

Structure

Structure

Structure

Structure

STRUCTURE

Structure

Structure

Primary structure Secondary structure

Structure

Structure