CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

CASS-MT Review: 6-Apr-2011Task 3: Semantic Databases on the XMT PNNL: David Haglin, Bob Adolf, Sinan al-Saffar, Cliff Joslyn Cray: David Mizell SNL: Eric Goodman,Edward Jimenez, Greg Mackey

HPC applied to Semantic Graph Databases • Expressing queries as a graph • SPARQL • SGD as an Appliance (Front end) User Interface • Search Processing Approach • Query Optimization • On-the-fly Inferencing Search / Query • Dictionary Encoding • Materialized Inference • Paging graph portions / dictionary Data Storage & Manipulation • Billion Triple size datasets • Extant Ontological Scaling • Motif Analysis Analysis

Outline Introduction (David Haglin) Accomplishments Focus this review: Query Search Process OWL Rules, Subgraph Isomorphism, Sprinkle-SPARQL (Eric Goodman) Generic Forward-Inferencing Capability(David Mizell) Graph Analysis and Extant Ontology (Sinan al-Saffar) What next? (David Haglin)

Accomplishments Accepted Papers: Eric Goodman, Edward Jimenez, David Mizell, Sinan al-Saffar, Bob Adolf, and David Haglin. “High-performance Computing Applied to Semantic Databases”. Extended Semantic Web Conference (ESWC 2011), May 2011. (23% acceptance rate) Submissions: Cliff Joslyn, Bob Adolf, Sinan al-Saffar, John Feo, Eric Goodman, David Haglin, Greg Mackey, and David Mizell. “High Performance Descriptive Semantic Analysis of Semantic Graph Databases”. Workshop on High-Performance Computing for the Semantic Web, ESWC 2011, May 2011. Sinan al-Saffar, Cliff Joslyn, Alan Chappell. “Extant Ontological Scaling and Descriptive Semantics for Semantic Structure Discovery in Large Graph Datasets.” IEEE/WIC/ACM International Conference on Web Intelligence. Workshops Organized: HPCSW – Most of task 3 personnel on program committee. Complex Query Workshop – scheduled for April 25/26 in Seattle, WA Hybrid Database Planning Technical Meeting: Battelle Seattle Research Center, February 2011 UW (Howe, Shaw), PNNL (CASS/SDB and TAI), SNL

CASS-MT Quarterly ReviewTask 3: Semantic Databases on the XMT Eric Goodman Edward Jimenez Greg Mackey Update April 2011

Sprinkle SPARQL • Sprinkle SPARQL presented in ESWC paper • Paucity of scalability results in literature • 10 nodes running MapReduce • 1 node running BigOWLIM Note: MapReduce method did not operate on inferred set. They hand-encoded expanded queries to catch the possibilities.

LUBM Query 1 SELECT ?X WHERE {?X rdf:type ub:GraduateStudent} {?X ub:takesCourse http:www.Department0.University0.edu/GraduateCourse0} All the Graduate Students All the Students that took a particular course 4 matches 20,157,119 matches

Sprinkle phase • Create an array the same size as the order of the graph for each variable in each BGP • Process each BGP • If node fulfills constraint of BGP, increment counter in associated array for the variable • The point: Constrain the problem before we start joining 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Sprinkle phase All the Students that took a particular course 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

Sprinkle phase All the Students that took a particular course All the Graduate Students 1 2 0 1 0 1 1 0 0 1 0 1 1 0 2 0 1 0 1 1

Future Query Work • Spinkle-SPARQL • In-depth analysis • More discriminating use of Sprinkle • Comparison to other approaches • MTGL subgraph isomorphism algorithm • Approach from Bob Adolf and David Haglin • Array-based method from David Mizell for SC10 demo

Inference Work • Multimap data structure • OWL Horst rules • rdfp4: Transitivity • rdfp8: InverseOf • rdfp12: Equivalent Classes • rdfp15: SomeValuesFrom • These are the set of rules required for LUBM

Multimaps • A mapping between keys and multiple values • Comes up often in RDFS/OWL inferencing • Class hierarchies • Property hierarchies • SameAs relationships • Indices to find triples with certain subjects, predicates, or objects

Multimap: First Loop Inside Multimap Class External to Class Key Counter 0 2 1 Index Keys 0 0 1 0 1 1 0 0 2 1 0 2 1 0 3 0 1 0 1 0 4 0 1 3 1 0 0 0

Multimap: First Loop Inside Multimap Class External to Class Key Counter 1 2 Index Keys 0 0 1 0 1 1 0 0 2 5 1 0 3 0 2 1 0 3 1 0 0 4 1 3 4 1 0 0 1 4

Multimap: Initialize Storage Inside Multimap Class External to Class Values Key Counter 1 2 Index Keys 0 0 1 0 1 1 0 0 2 5 1 0 3 0 2 1 0 3 1 0 0 4 1 3 4 1 0 0 1 4

Results • Data set: ~5B Zipfian Integers • Value was “1” for each key • Total time at 128 Processors • Old Method: • 23.5 seconds • 208e6 inserts/second • New Method: • 11.5 seconds • 422e6 inserts/second • Comparison to hashing • 5.5 seconds • 878e6 inserts/second • Speedup from 2 to 128 (ideal 64x) • Old: 37x • New: 53x Note: Had to grab class member variables and pass them back in to get good scaling.

OWL Horst Preliminary Results

Future Inference Work • Compare with Chris Rickett and David Mizell’s strategy • Prepare submission for ISWC 2011 (June deadline) • Move to on-the-fly inference

Towards a Generic Forward-Inferencing Capability for Semantic Database OntologiesDavid Mizell, Cray Inc.working withChris Rickett, Cray Inc.Eric Goodman, SandiaSinan al-Saffar, PNNL Lake Union

The Main Idea Develop an automated or semi-automated process for • extracting the ontology from an RDF triples database • translating the ontological rules into a simple syntax, eg Jena Rules • using the translation to perform forward (later backward) inferencing on the database

Forward Inferencing: Computing the "Closure" of an Ontology on an RDF Triples Dataset …(also called “materialization”) Triples database Get applied to… … ( David is-a Cray-employee ) … … ( Shoaib is-a Cray-employee ) … … Ontology rules … ( ?x is-a Cray-employee ) -> ( ?x has-a cell-phone ) … ( Cray-employee subset-of US-citizen ) … New, inferred triples: ( David has-a cell-phone ) ( Shoaib has-a cell-phone ) ( David is-a US-citizen ) ( Shoaib is-a US-citizen )

The Forward Inferencing Process • Take each rule ( ?x is-a Cray-employee ) -> ( ?x has-a cell-phone ) • Search the database for triples that match the left-hand side of the rule ( ?x is-a Cray-employee ) ( David is-a Cray-employee ) • Add the new triple(s) to the database corresponding to the right-hand side ( David has-a cell-phone ) • (worst case) repeat until you reach a fixed-point

Rule Left-hand Side Matching is a Lot Like Querying ( ?x is-a Cray-employee ) && ( ?x is-a manager ) -> ( ?x has-a Blackberry ) ( Shoaib is-a Cray-employee ) … ( Shoaib is-a manager ) … JOIN

What Eric Goodman and I (mostly Eric) Did Last Year Goodman and Mizell, “Scalable In-Memory Closure on Billions of Triples,” International Workshop on Scalable Semantic Web Knowledge Bases, at the International Semantic Web Conference, Shanghai, Nov. 2010 • RDFS is a standard ontology with 13 rules. 6 of these have 2 triple patterns on the left-hand side (require join-like processing). We only used those. • Wrote 6 functions with the same overall structure: • Search the database for matches to the left-hand side • Add the implied triples • Eric cleverly scheduled the application of these functions to avoid fixpoint iteration

What Chris Rickett and I (mostly Chris) Did, for the SC 2010 Demo Castagna, Dollin and Seaborne, “Vivisecting LUBM,” HP Laboratories, HPL-2009-348, Nov. 6, 2009 What the HP Labs researchers did: • Extracted the LUBM ontology rules • Re-wrote them in “Jena Rules” format • Applied them in “streaming” fashion to the LUBM database :Chair A owl:Class ; rdfs:label "chair" ; rdfs:subClassOf :Professor ; owl:intersectionOf ( :Person [ a owl:Restriction ; owl:onProperty :headOf ; owl:someValuesFrom :Department ]) (?x rdf:typeub:Chair) -> (?x rdf:typeub:Professor) . (?x rdf:typeub:Person) (?x ub:headOf ?y) (?y rdf:typeub:Department)-> (?x rdf:typeub:Chair) . (?x rdf:typeub:Chair) -> exists ?y : (?x rdf:typeub:Person) (?x ub:headOf ?y) (?y rdf:typeub:Department) .

What Chris Rickett and I (mostly Chris) Did (2) (?x rdf:typeub:Course) -> (?x rdf:typeub:Work) . (?x rdf:typeub:Research) -> (?x rdf:typeub:Work) . (?x rdf:typeub:GraduateCourse) -> (?x rdf:typeub:Course) (?x rdf:typeub:Work) . (?x rdf:typeub:UndergraduateStudent) -> (?x rdf:typeub:Student) . (?x rdf:typeub:ResearchAssistant) -> (?x rdf:typeub:Student) . (?x rdf:typeub:GraduateStudent) -> (?x rdf:typeub:Person) . (?x rdf:typeub:Faculty) -> (?x rdf:typeub:Employee) . (?x rdf:typeub:Professor) -> (?x rdf:typeub:Faculty) (?x rdf:typeub:Employee) . (?x rdf:typeub:AssistantProfessor) -> (?x rdf:typeub:Professor) (?x rdf:typeub:Faculty) (?x rdf:typeub:Employee) . (?x rdf:typeub:AssociateProfessor) -> (?x rdf:typeub:Professor) (?x rdf:typeub:Faculty) (?x rdf:typeub:Employee) . (?x rdf:typeub:Dean) -> (?x rdf:typeub:Professor) (?x rdf:typeub:Faculty) (?x rdf:typeub:Employee) . (?x rdf:typeub:FullProfessor) -> (?x rdf:typeub:Professor) (?x rdf:typeub:Faculty) (?x rdf:typeub:Employee) . (?x rdf:typeub:Chair) -> (?x rdf:typeub:Professor) (?x rdf:typeub:Faculty) (?x rdf:typeub:Employee) . … • Grabbed their Jena-formatted rules from the paper’s appendix • Chris wrote a parser for the rules, converted them to triples-pattern (integer) data structure (using Eric Goodman’s “dictionary”) • Iterated through the rules until no new triples were added • Recently, I tuned the inferencer by substituting a hash table specialized to integer triples (written by Eric Goodman) – used for duplicate elimination • Time on LUBM8000, 1.1B triples before, 1.7B after (just inferencing, no I/O): • 350 sec/128p; • 185 sec/256p • 148 sec/512p

Open Issues • How does this performance compare to the specific function-per-rule approach? Is there a programmer time vs. execution time tradeoff? • How generalizable is this “generic” approach? Jena Rules are easy to parse, but… • Semantics can be quite tricky • Usually will have to combine some custom, database-specialized rules with a standard ontology such as • RDFS • OWL Lite • OWL DL • … • What we learn from this may help us with on-the-fly (backwards) inferencing in the future

Graph Analysis and Extant Ontology Sinan al-Saffar, Cliff Joslyn

Informing the Design of a Future Database Engine Similarly to relational databases, in order to optimize any future graph database engine, we need to understand: Graph Content and Structure Queries and Inference Why? Because these influence the data structure and algorithms of choice to achieve efficient time and space utilization This has to happen in both: The overall design, And as a dynamic query optimization component

Graph-O-Scope We built a set of functions that compute statistical measures to help us understand the contents of semantic graphs The intention is to re-implement these functions in an API that is to be used from within a dynamic query optimization module Some of the Statistics: Edge and nodes counts and graph density Literal, blank, and URI counts with break-downs by subjects/object Predicate and class distributions Counts of reification and ontological components In-degree / out-degree dist Connected component sameAs cliques

What is in the graph? Question: How can we “understand” a 2 billion edge graph? We looked at three large datasets: BTC is a result of a semantic web crawl Uniprot is a ten year, primary bioinformatics reference LUBM is a synthetic dataset

Reification Discovery: A good chunk of the data is refied Design: Make database a hybrid Primary Statement Annotation

Terminal Edges Discovery: Literal nodes and edge constitute a good size of the data Design: Implement literals as node properties (outside the graph)

Class Coverage (BTC) Discovery: 168k classes but 16 cover 80%, 64 cover 95% of the data Design: Implement types as node property (huge effect on inference)

Predicate Coverage (BTC) Discovery: 95k different predicates but 64 cover 86% of the data Design: optimize graph data structure for a small range of edge labels

UniProt Extant Ontology I A 243-edge graph as a statistical representation of the present semantic structures in 2b-edge Uniprot graph

Uniprot Extant Ontology I zoomed-in

Extant Ontological Scaling I

Extant Ontological Scaling II

Extant Ontological Scaling III

Level 1 Extant of Uniprot – Scaling applied rdfs:seeAlso 51.82%

Future Work: specific directions Continue working with Larry Holder (WSU) to find common ground on frequent subgraph mining and semantic database query Work with Bill Howe on query language and hybrid search strategies Expand our collaboration with Task 1. Support Task 16 (Mayo) Engage with Bioinformatics domain to find/build interestingly large and complex Bio dataset (i.e., more complex than uniprot) Find collections of complex queries Continue work on search engine comparison: Array-based Subgraph-isomorphism (MTGL) Sprinkle-SPARQL Explore query optimization strategies Extend study of larger path types (n=4,5) and/or non-linear motifs

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

CASS-MT Review: 6-Apr-2011 Task 3: Semantic Databases on the XMT

Presentation Transcript

Text and Web Search

FrameNet Meets the Semantic Web

Semantic Inference for Question Answering

Latent Semantic Analysis of the FOSS4G 2007 Program

Web of Data

Agents and Knowledge Interoperability in the Semantic Web Era

Chapter 22: Distributed Databases

Semantic Analysis

Semantic Parsing: The Task, the State of the Art and the Future

Chapter 19: Distributed Databases

IATA CASS

CEOS IDN Task Team

Distributed Databases

IDN TASK TEAM

HAPTER 4

Creating Databases for Web applications

Semantic Web Services

Semantic Content-based Access To Hypervideo Databases

Isolation in Relational Databases

Web of Data

2011 PE Review: IV-A: Hydrology and Hydraulics