SHER: A Scalable Highly Expressive Reasoner and its Applications.

SHER: A Scalable Highly Expressive Reasoner and its Applications. J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershenbaum, L. Ma, E. Schonberg, K. Srinivas SHER: Scalable Highly Expressive Reasoner

Outline • Background and motivation • Core SHER technical innovations • Scalability via Summarization • Refinement: Resolving Inconsistencies in a Summary • Integration with Incomplete Reasoners • Conjunctive Query Evaluation • SHER concrete applications • Automated Clinical trials matching using ontologies • Anatomy Lens: Semantic search over PubMed • Scalable Text Analytics cleanup • Conclusion SHER: Scalable Highly Expressive Reasoner

Project background and Motivation • Emergence of OWL as a standardized language for expressing semantic relations in ontologies. • 2004 : OWL a W3C Recommendation • Emergence of standardized ontologies encoded in OWL, especially in healthcare, life sciences: • Biopax • SNOMED • About 80 ontologies at OBO (e.g. GO, FMA) • Emerging use of ontologies in search and retrieval of structured and unstructured data. SHER: Scalable Highly Expressive Reasoner

Vision: Semantic Information Retrieval Ontology Definition Louvre located in Paris …. John Smith visited the Louvre Semantic Information Retrieval System Legacy data DB2 RDF Store Homogeneous view Unstructured data John Smith! Show me all people who visited France? SHER: Scalable Highly Expressive Reasoner

Problems • Computational complexity of reasoning • Intractable in the worst case. • In 2005, intractable in practice on large and expressive KBs • Imprecision/inconsistencies in ontologies. • Reasoner inability to scale consistency check • Query answering in expressive ontologies SHER: Scalable Highly Expressive Reasoner

Dealing with complexity challenges • Reducing the expressivity of DL languages • Why? • 80/20 rule ~ 80% of use cases covered by 20% of the language constructs • Tractability • Ease of implementation • Result of this line of Research: • DL-Lite family (Diego Calvanese et al.) • Covers: ER, UML • LogSpace complexity • Easy scalable implementation on top of relational DBMS • EL++ (Franz Baader et al.) • Covers most life science ontologies • Polynomial time complexity (satisfiability, subsumption, and instance checking) • Simple rule-style implementation • OWL 2.0 Profiles • Approximate Reasoning • Screech OWL Reasoner (Pascal Hitzler et al.) SHER: Scalable Highly Expressive Reasoner

SHER – A Highly Scalable SOUND and COMPLETE Reasoner for large OWL-DL KB • Reasons over highly expressive ontologies • Reasons over data in relational databases • No inferencing on load • hence deals better with fast changing data • the downside: reasoning is performed at query time. • Highly scalable -- reasons on 7.7M records in 7.9 s. • State of the art cannot run on more than 1 million records on a 64 bit dual processor machine with 4G heap. • Can scale to more than 60 million triples • Semantically index 300 million triples from the medical literature. • Tolerate inconsistencies • Provide explanations SHER: Scalable Highly Expressive Reasoner

Outline • Background and motivation • Core SHER technical innovations • Scalability via summarization • Refinement: Resolving inconsistencies in a summary • Integration with incomplete reasoners • Conjunctive Query Evaluation • SHER concrete applications • Automated Clinical trials matching using ontologies • Anatomy Lens: Semantic search over PubMed • Scalable Text Analytics cleanup • Conclusion and future work SHER: Scalable Highly Expressive Reasoner

Legend: C – Course P - Person M - Man W – Woman H - Hobby Scalability via Summarization (ISWC 2006) Summary Original ABox C’{C1, C2} C2 C’ C1 • The summary mapping function f that satisfies the constraints: • If any individual a is an explicit member of a concept C in the original Abox, and f(a) is an explicit member of C in the summary Abox. • If a≠b is explicitly in the original Abox, then f(a) ≠f(b) is explicitly in the summary Abox. • If a relation R(a, b) exists in the original ABox, then R(f(a), f(b)) exists in the summary. • If the summary is consistent, then the original Abox is consistent (converse is not true). isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy M2 P2 M1 M’ P’ P1 likes likes likes TBox: Functional (isTaughtBy) Disjoint (Man, Woman) H2 H’ H1 SHER: Scalable Highly Expressive Reasoner

Summarization effectiveness I – Instances after summarization RA – Role assertions after summarization SHER: Scalable Highly Expressive Reasoner

Scalability via Filtering (ISWC 2006) • For expressive ontologies, query answering can be reduced to a consistency check on the Abox. • For the SHIN subset of DL (OWL-DL minus datatype reasoning and nominals), only certain types of relations are key to finding an inconsistency. • Specifically, any relation R which appears as part of an universal restriction (S.C) or a maximum cardinality (nS) are key for finding inconsistencies. • All relations that do not participate in such concept expressions can be filtered, provided we can compute all relevant concepts in the ontology… SHER: Scalable Highly Expressive Reasoner

Filtering effectiveness I – Instances after filtering RA – Role assertions after filtering SHER: Scalable Highly Expressive Reasoner

Refinement (AAAI 2007) • What if summary is inconsistent? • Either, • Original ABox has a real inconsistency Or, • ABox was consistent but the process of summarization introduced fake inconsistency in the summary • Therefore, we follow a process of Refinement to check for real inconsistency • Refinement = Selectively decompress portions of the summary • Use Justifications for the inconsistency to select portion of summary to refine • Justification = minimal set of assertions responsible for inconsistency • Repeat process iteratively till refined summary is consistent or justification is “precise” SHER: Scalable Highly Expressive Reasoner

Refinement: Resolving inconsistencies in a summary Legend: C – Course P - Person M - Man W – Woman H - Hobby Original ABox Summary Summary is inconsistent C’{C1, C2, C3} C’ C2 C3 C1 isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy M’ P’ W’ M2 P2 M1 P1 P3 W1 likes likes likes TBox: Functional (isTaughtBy) Disjoint (Man, Woman) H1 H2 H’ After 1st Refinement After 2nd Refinement – Consistent Summary Cx’ Cy’ Cx’ Cy’ Cx’{C1, C2} Cy’{C3} isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy Px’{P1, P2} Py’{P3} M’ W’ Px’ M’ P’ W’ Py’ likes likes Summary still inconsistent! H’ H’ SHER: Scalable Highly Expressive Reasoner

Refinement: Solving Membership Query (AAAI 2007) Summary Legend: C – Course P - Person M - Man W – Woman H - Hobby Original ABox Summary is inconsistent C’{C1, C2, C3} C2 C3 C’ C1 isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy M2 P2 M1 P1 P3 W1 M’ P’ W’ likes likes likes TBox: Functional (isTaughtBy) Disjoint (Man, Woman) H2 H1 H’ After 1st Refinement After 2nd Refinement – Consistent Summary Cx’ Not(Q) Cy’ Not(Q) Cx’ Cy’ Cx’{C1, C2} Cy’{C3} isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy isTaughtBy Px’{P1, P2} Py’{P3} M’ W’ Px’ Px’ M’ P’ W’ Py’ Not(Q) Not(Q) likes likes Summary still inconsistent! Solns: P1, P2 Sample Q: PeopleWithHobby? H’ H’ SHER: Scalable Highly Expressive Reasoner Not(Q)

Results : Consistency Check SHER: Scalable Highly Expressive Reasoner

Results: Membership Query Answering SHER: Scalable Highly Expressive Reasoner

Improving SHER Performance through integration with a fast but incomplete reasoner • Refinement • Critical for completeness • But time consuming (joins between large tables): majority of time spent in refinement • However, a lot of solutions “easily” detected using query expansion • e.g. ColonNeoplasm(x) = Disease(x) ^ hasAssociatedMorphology(x, y) ^ Neoplasm(y) ^ hasFindingSite(x, z) ^ Colon(z) • Improved SHER Performance by adding Query Expansion module • General Idea: • Quickly find solutions to query • Refine summary to isolate solution individuals • Test remaining individuals • Advantages: • Any sound technique to find solutions quickly can be used (QE, forward-chaining based rule system) • Much less refinement required if above technique finds many solutions • Depending on expressivity of logic, you may not need refinement at all

SHER Hybrid Algorithm Evaluation • Avg. Query Answering time for Clinical Trials Use-Case down to 15mins • Huge reduction in number of refinement steps

Conjunctive Query in SHER (ISWC 2008) • SHER also supports Grounded Conjunctive Queries (CQ), which combine membership/type queries and relationship queries R(x,y) • GraduateStudent(x) ^ isMemberOf(x, y) ^ Department(y) ^ subOrganizationOf(y, z) • Solving CQ much harder than MQ • Summarization/Refinement algorithm does not directly apply for RQ • Intuitively, summarization groups individuals based on type – works well for type queries, but for relationship queries we need to consider pairs of individuals • Alternate 3-step Approach: • Use a Datalog Rule engine to estimate potential relationships • Use various heuristics to find definite relationships • Test remaining relationships in Summary and solve by splitting • Advantage: • Graceful degradation depending on complexity of query and ontology/data (very fast on realistic queries and datasets)

Conjunctive Query Evaluation Comparison with KAON2 on the UOBM Benchmark

Outline • Background and motivation • Core SHER technical innovations • Scalability via summarization • Refinement: Resolving inconsistencies in a summary • Integration with incomplete reasoners • Conjunctive Query Evaluation • SHER concrete applications • Automated Clinical trials matching using ontologies • Anatomy Lens: Semantic search over PubMed • Scalable Text Analytics cleanup • Conclusion and future work SHER: Scalable Highly Expressive Reasoner

Matching Patient Records to Clinical Trials Using Ontologies • With collaboration with Columbia University Medical Center : Chintan Patel and James Cimino • Work presented at ISWC 2007 SHER: Scalable Highly Expressive Reasoner

Problem • In complex domains such as healthcare, there is a “semantic gap” between data and queries. E.g., Patient dataQueries Patient on Hydrocortisone 2% Patients on drugs with steroids as ingredients Patient tested positive for Patients with tuberculosis meningitis mycobacterium tuberculosis Can ontology analytics using ontologies such as SNOMED be used to bridge this gap? Case study on patient recruitment for clinical trials problem SHER: Scalable Highly Expressive Reasoner

Clinical Trials Matching Current scenario: A day in the life of Columbia’s Clinical Trials Investigator Look at criteria in the trial protocol Pore through patient charts Call Physician to discuss consent Result: Poor participation in clinical trials! Can ontology analytics be used to find patients that match clinical trial criteria to improve participation? SHER: Scalable Highly Expressive Reasoner

Clinical Trials Matching What we want to do: A day in the life of Columbia’s Clinical Trials Investigator Look at criteria in the trial protocol Find patients automatically Call Physician to discuss consent Query for criteria Find matching patients Ontology Reasoner Ontologies (SNOMED) Patient Data SHER: Scalable Highly Expressive Reasoner

Technical Challenges to using ontologies • Knowledge engineering NY Presbyterian Clinical Data Repository Coded in MED Need to map local knowledge (MED) to domain knowledge (SNOMED),e.g., Presence of MRSA on a Lab test means lab test hasCaustiveAgent MRSA. SHER: Scalable Highly Expressive Reasoner

Technical challenges to using ontologies • Scalability of reasoning • ABox (patient data): 1 year data at Columbia (250K patients) 60M RDF triples • TBox (SNOMED+MED, 461K concepts) • Expressivity of reasoning: SNOMED is EL++, but ABox contains negation, e.g., • Lab results ruled out the presence of an organism • Inconsistent, noisy and incomplete data • Lab results that indicate both the presence and absence of an organism. SHER: Scalable Highly Expressive Reasoner

Overall Solution coded in Mapping Patient Data semi-automatic LabA MRSAOrganism Present coded in MED Local Knowledge SNOMED Domain Knowledge ∃associatedObservation.MRSA Integrated Tbox Query Extraction ETL SHER Ontology Reasoner Patient Data LabA:∃causativeAgent. MRSAOrganism Abox Matching patients SHER: Scalable Highly Expressive Reasoner

Results for ~250K patients Optimizations to use query expansion to quickly compute and remove “obvious” solutions. SHER: Scalable Highly Expressive Reasoner

AlphaWorks Service: Anatomy Lens • Ontology-based PubMed Search • GOAL: Real-time Ontology Reasoning on the Web • Overcome Keyword Search: Poor precision & recall • Link 3 large OWL Ontologies: FMA, GO, MeSH • Dataset Size: 16 Million MEDLINE Articles, ~300M Triples • Support Structured queries: • Find articles about “neuron development” (GO Process) in the “Hippocampus” region (FMA Part) of the brain • Possible solution article may be about “dendrite morphogeneis” in the “Archicortex“

Real-time Reasoner for Service • EL+ Reasoner in SHER for Anatomy Lens • Many HCLS Ontologies fall in this EL+ fragment of OWL • Highly optimized: • Classification Times: • GO (32K) – 3 s • FMA (75K concepts) – 30 s • SNOMED (350K concepts) – 8 mins • State-of-the-art reasoner - CEL - takes >2hrs on SNOMED • Additional Features: • Incremental reasoning • Explanations support

Anatomy Lens Demo • Online Video: http://anatomylens.alphaworks.ibm.com/AnatomyLens/AnatomyLensVideo/AnatomyLensVideo.html

Scalable Cleanup of Information Extraction Data Using Ontologies • In collaboration with Christopher Welty, James Fan, and William Murdock • Presented at ISWC 2007 SHER: Scalable Highly Expressive Reasoner

Problem • Text extraction from natural language is imperfect • Relationship extraction is especially problematic, e.g. • ...the decision in September 1991 to withdraw tactical nuclear bombs, missiles and torpedos from US Navy ships... • Text extraction: • nuclear ownerOf bombs • nuclear type Weapon, bombs type Weapon • Can ontology reasoning be used to improve relationship extraction? SHER: Scalable Highly Expressive Reasoner

Background • SemantiClean (ISWC-2006) • Ontology • ownerOf domain (Person ⊔ Organization) • Person disjointFrom Organization • Person disjointFrom Weapon • Organization disjointFrom Weapon... • Add triple at a time Check with DL Reasoner Discard if inconsistent nuclear ownerOf bomb Improves relationship extraction by 8-15%. SHER: Scalable Highly Expressive Reasoner

Evaluating the Triple at a Time Approach • Scalability • Text extraction on a normal desktop can process a million documents/day. • Each document extracts ~70 entities, ~40 relations. • Consistency detection in DL reasoners does not scale to such large RDF graphs. SHER: Scalable Highly Expressive Reasoner

Computational Experience Dataset SHER: Scalable Highly Expressive Reasoner

Conclusions • SHER • Reasons over highly expressive ontologies • Reasons on data in relational databases • No inferencing on load, hence deals better with fast changing data • Integrates with fast incomplete reasoners • Highly scalable -- reasons on 7.7M records in 7.9 s. • semantically indexed 300 million triples from the medical literature. • Tolerates inconsistencies • Provides explanations • Many applications: • Semantic Matching for clinical trials • Semantic search over PubMed • Scalable text analytics cleanup • What next? • SHER Code release scheduled for the end of June 2008 SHER: Scalable Highly Expressive Reasoner

THANKS!QUESTIONS? More on SHER: http://domino.research.ibm.com/comm/research_projects.nsf/pages/iaa.index.html SHER: Scalable Highly Expressive Reasoner

BACKUP SHER: Scalable Highly Expressive Reasoner

Integrating MED and SNOMED MED SNOMED Mapping UMLS NLP (MMTX) Manual 100,210 concepts 361,824 concepts Integrated Tbox • 17,446 concepts in MED directly mapped by subclass relations to SNOMED concepts (17% of MED). • Including subclasses of 17,446 concepts, the coverage of MED is 75,514 concepts. • 88% of concepts in the Abox were covered by the integrated Tbox. SHER: Scalable Highly Expressive Reasoner

Modeling patient data in SNOMED • Modeling positive and negative results, e.g. • LabEventA MRSAOrganism Absent • is modeled as: • LabEventA: • ∀causativeAgent.¬MRSAOrganism • Modeling groupings of events • RadiologyEventA findingSite Colon • RadiologyEventA morphology Neoplasm • is modeled as: • RadiologyEventA: • ∃roleGroup. • (∃hasMorphology.Neoplasm ⊓ • ∃hasFindingSite. Colon) Patient Data LabA MRSAOrganism Present ETL Patient Data LabA:∃causativeAgent. MRSAOrganism Abox SHER: Scalable Highly Expressive Reasoner

Validation (100 patient records) Misses primarily due to incorrect mappings. No false positives SHER: Scalable Highly Expressive Reasoner

Solutions to Challenges • Large Aboxes (patient data): Reason on a summarized version of the data (ISWC 2006). • Large Tboxes (SNOMED+MED): Compute closure of concepts in Abox which is 22,561 concepts. • Incomplete data. MRSA defined as • ∃hasCausativeAgent.MRSAOrganism ⊓ Infection • Patient record will never indicate infection, hence no matches. • Convert all conjuncts to disjuncts for user specified concepts (e.g., MRSA Disorder) in the query. • ∃hasCausativeAgent.MRSAOrganism ⊔ Infection SHER: Scalable Highly Expressive Reasoner

Justification based consistent subset Justification based consistent subset by example: Justifications J1 - x, y, m J2 - x, y, z J3 - y, q Set of removed assertions x removed because of J1 y cannot be removed because of J2 BUT y can be removed due to J3. BUT for knowledge bases filled with thousands of inconsistencies, even this justification based consistent subset computation may not scale. Approximate cleanup technique SHER: Scalable Highly Expressive Reasoner

Approximate cleanup u:Nation v:Organization a:Nation v c:Nation u • Summarization: Perform consistency detection on a summarized version of the larger RDF graph. a c e:Organization e partOf ownerOf residentOf b d f b:Organization d:Person f:Nation s Original data (Abox) s:Person Summary Abox • Mapping function f satisfies: • If a:C ∈ A, then f(a):C ∈ A’ • If R(a, b) ∈ A, then R(f(a), f(b)) ∈ A’ • If a≠b ∈ A, then f(a) ≠ f(b) ∈ A’ • If A’ is consistent, then A is consistent. Converse does not hold. SHER: Scalable Highly Expressive Reasoner

Check consistency • Find justification (minimal set of assertions that cause the inconsistency), e.g., uownerOf s • Refine the summary to make the justification more precise. Isolating an inconsistency u:Nation v:Organization u v ownerOf s s:Person Summary Abox To make a justification more precise, refine summary individuals in the justification by the sets of role assertions they have. Refined summary a,c,f in A mapped to u a mapped to u u u v:Organization u’ v ownerOf ownerOf s c,f mapped to u’ s s:Person s:Person SHER: Scalable Highly Expressive Reasoner

Stop refining when a justification is precise. A justification J is precise when: For all summary individuals s ∈ J, and for all role assertions R(s, t) ∈ J implies that for all individuals a ∈ A such that f(a)=s, there is an individual b ∈ A such that f(b)=t and R(a,b) ∈ A. Is the inconsistency real? Key to scalable inconsistency detection: Precise justification J where each individual in J has many thousand individuals in A mapped to it Refined summary Precise justification a mapped to u u u’ v ownerOf s c,f mapped to u’ d mapped to s SHER: Scalable Highly Expressive Reasoner

Cleaning up inconsistencies • Once a precise justification is found, check if it is conclusive. A precise justification is conclusive if for example: • its acyclic • its cyclic, but can be shown to be acyclic after the application of deterministic tableau rules • No real use cases where justifications are not conclusive. • Remove a single assertion of a precise, conclusive justification. Iterate to find all justifications, until the knowledge base is consistent. SHER: Scalable Highly Expressive Reasoner

SHER: A Scalable Highly Expressive Reasoner and its Applications.

SHER: A Scalable Highly Expressive Reasoner and its Applications.

Presentation Transcript

Building Highly Scalable and Available Applications and Services with Windows Azure AppFabric MID315

Designing Highly Scalable OLTP Systems

Building Scalable, Global, and Highly Available Web Apps

Replication and Its Applications

Highly Scalable Packetised correlators

Building Scalable .NET Applications

Gilead Sher

Building Highly Scalable Websites

Scalable Perfect Hashing Schemes and Applications

A Highly Scalable Perfect Hashing Algorithm

reasoner

Porcupine: A Highly Scalable, Cluster-based Mail Service

Porcupine: a highly scalable email service

Designing Highly Scalable OLTP Systems

Commitment and Reasoner

Scalable Authoritative OWL Reasoner Aidan Hogan, Andreas Harth, Axel Polleres

Elasticity and its Applications

Highly Scalable Distributed Dataflow Analysis

Scalable Applications and Real Time Response

A Scalable Simulator for TinyOS Applications

Ecommerce and its Applications

Highly scalable web conferencing app