1 / 56

An Algebraic Approach to Articulate Ontologies for Information Integration

This paper discusses the use of ontologies in information integration, including their functions, establishment, and addressing heterogeneity and semantic mismatches. It also explores the challenges and alternatives for ontology sharing and proposes language solutions.

Download Presentation

An Algebraic Approach to Articulate Ontologies for Information Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C S K An Algebraic Approach to Articulate Ontologies for Information Integration February 2001 Gio Wiederhold Stanford University Gio Wiederhold SKC 1

  2. What are Ontologies? Ontologies list the terms and their relationships that allow communication among partners in enterprises (in machine-readable form) Relationships determine meaning - parent, school, company Databases use ontologies during design in their E-R diagrams (Implicitly) and represent the leaf nodes in their schemas Knowledge-bases use ontologies (often implicitly) add class definition (to hold instances), constraints, and, sometimes, operations among the terms Gio Wiederhold SKC 2

  3. Functions of Ontologies . • Enable Precision in Understanding People = designers, implementors, users, maintainers Systems = implementors = users = maintainers • Share the Cost of Knowledge Acquistion & Maintenance reuse encoded knowledge, remain up-to-date as domains change • Enable Information Interoperation * Define the terms that link domains Gio Wiederhold SKC 3

  4. Ancestors of Ontologies . • Lexicons: collect terms used in informtion systems • Taxonomies: categorize, abstract, classify terms • Schemas of databases: attributes, ranges, constraints • Data dictionaries: systems with multiple files, owners • Object libraries: grouped attributes, inherit., methods • Symbol tables: terms bound to implemented programs • Domain object models: (XML DTD): interchange terms • . . . More Knowledge formalized Gio Wiederhold SKC 4

  5. Establishing Ontologies . Top-down: • Commonly acceptable UPPER layers Domain-specific • Sharing tools • Object based Bottom-up • Pragmatic, TASK-specific collections • Database schemas and models Gio Wiederhold SKC 5

  6. Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, • Local Needs have Priority, • Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems  • Representation and Access Conventions  • Naming and Ontology  Gio Wiederhold SKC 6

  7. Semantic Mismatches Information comes from many autonomous sources • Differing viewpoints (by source) • differing terms for similar items { lorry, truck } • same terms for dissimilar items trunk(luggage, car) • differing coverage vehicles (DMV, AIA) • differing granularity trucks (shipper, manuf.) • different scope student museum fee, Stanford • Hinders use of information from disjoint sources • missed linkages loss of information, opportunities • irrelevant linkages overload on user or application program • Poor precision when merged Still ok for web browsing ,poor for business & science Gio Wiederhold SKC 7

  8. More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toys apparent drug-target with poor annotation ) False Negatives cause lost opportunities, suboptimal to some degree False positives = poor precision typically cost more than false negatives = poor recall human limit human with tools? data errors acceptable limit information quantity Need for precision Information Wall adapted from Warren Powell, Princeton Un. Gio Wiederhold SKC 8

  9. Ontology Sharing Three Alternatives • Create a committee to define everybody’s terms • Takes many years, until people are worn out • Ignored when changes make deviation necessary • Get all terms and put them into large model [ Cyc, UMLS, Federated Schemas, . . . ] • Can be rapid • Ignores conflicts • Hard to maintain (requires committee) • Keep all Terms distinct, except where sharing • Requires initial effort • Empowers participants Gio Wiederhold SKC 9

  10. Proposed Language Solutions Specify and define terminology usage: ontology • Domain-specific ontologies XML DTD assumption • Small, focused, cooperating groups • high quality, some examples - genomics, arthritis, Shakespeare plays • allows sharable, formal tools • ongoing, local maintenance affecting users - annual updates • poor interoperation, users still face inter-domain mismatches • Cannot achieve globally consistency • wonderful for users and their programs • too many interacting sources • long time to achieve,2 sources (UAL, BA), 3 (+ trucks), 4, … all ? • costly maintenance, since all sources evolve • no world-wide authority to dictate conformance Gio Wiederhold SKC 10

  11. An unsolved problem Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sub languages used by the resources are subsets of a globally consistent language This assumption is provably false Working towards the goal of globally consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts Gio Wiederhold SKC 11

  12. General Ontologies? • Have all the Knowledge together • simple for customers of KBs • hard for owners of KBs • Large KB will cover multiple domains • created by a committee -- slow • maintained by a committee-- costly • Differences in level of abstraction -- efficiency • homeowner: nail • carpenter: sinker, brad, boxnail, . . . Gio Wiederhold SKC 12

  13. Structural Heterogeneity Gio Wiederhold SKC 13

  14. No committee is needed to forge compromises * within a domain Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent • context is implicit Domain Ontology • Compromises hide valuable details Gio Wiederhold SKC 14

  15. SKC grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation Gio Wiederhold SKC 15

  16. Domain-specific Expertise . Knowledge needed is huge • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction Society of specialists Gio Wiederhold SKC 16

  17. Intersection create a subset ontology • keep sharable entries • Union create a joint ontology • merge entries • Difference create a distinct ontology • remove shared entries An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Gio Wiederhold SKC 17

  18. Sample Operation: INTERSECTION Terms useful for purchasing Result contains shared terms Source Domain 1: Owned and maintained by Store Source Domain 2: Owned and maintained by Factory Gio Wiederhold SKC 18

  19. INTERSECTION support Articulation ontology Matching rules that use terms from the 2 source domains Terms useful for purchasing Store Ontology Factory Ontology Gio Wiederhold SKC 19

  20. Shoe Factory • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } Sample Intersections Articulation ontology matching rules : size = size color =table(colcode) style = style Ana- tomy {. . . } Hard- ware foot = foot Employees Employees Nail (toe, foot) Nail (fastener) . . . . . . Department Store Gio Wiederhold SKC 20

  21. Arti- culation ontology Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies typically prior intersections Gio Wiederhold SKC 21

  22. Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused Gio Wiederhold SKC 22

  23. What is the most recent year an OPEC member nation was on the UN security council (SC)? Related to DARPA HPKB Challenge Problem SKC resolves 3 Sources CIA Factbook ‘96 (nation) OPEC (members, dates) UN (SC members, years) SKC obtains the Correct Answer 1996 (Indonesia) Other groups obtained more, but factually wrong answers Problems resolved by SKC Factbook – a secondary source -- has out of date OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994) different country names Gambia => The Gambia historical country names Yugoslavia UN lists future security council members Gabon 1999 needed ancillary data Sample Processing in HPKB Gio Wiederhold SKC 23

  24. Interoperation via Articulation At application definition time • Match ontologies • Establish articulation rules. • Record the process At execution time • Query rewriting • Optimization based on an Ontology Algebra. For maintenance • Regenerate rules using the stored formulation Gio Wiederhold SKC 24

  25. Semi-automatic approach Provide library of automatic match heuristics • Lexical Methods -- spelling • Structural Methods -- relative graph position • Reasoning-based Methods • Nexus  • Hybrid Methods • Iterative/Non-iterative Methods GUI tool to • - display matches and • - verify generated matches using human expert • - expert can also supply matching rules Gio Wiederhold SKC 25

  26. Articulation Generator Being built by Prasenjit Mitra Thesaurus OntA Context-based Word Relator Phrase Relator Driver Semantic Network (Nexus) Structural Matcher Ont1 Ont2 Human Expert Gio Wiederhold SKC 26

  27. Lexical Methods • Preprocessing rules. • -Expert-generated seed rules. • e.g., (Match O1.President O2.PrimeMinister) • -Context-based preprocessing directives. • Thesaurus - synonyms, relationships • Distance of words as measure of relatedness. Gio Wiederhold SKC 27

  28. Tools to create articulations Vehicle registration ontology Vehicle sales ontology Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.) Suggestions for articulations Gio Wiederhold SKC 28

  29. Tools to create articulations Graph matcher for Articulation- creating Expert Transport ontology Vehicle ontology Suggestions for articulations Gio Wiederhold SKC 29

  30. continue from initial point • Also suggest similar terms • for further articulation: • by spelling similarity, • by graph position • by term match repository • Expert response: • 1. Okay • 2. False • 3. Irrelevant • to this articulation • All results are recorded • Okay’s are converted into articulation rules Gio Wiederhold SKC 30

  31. Based on processing headwords ý definitions using algebra primitives Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources .have been processed. Notice presence of 2 domains: chemistry, transport Gio Wiederhold SKC 31

  32. Using the match repository Gio Wiederhold SKC 32

  33. Navigating the match repository Gio Wiederhold SKC 33

  34. Relative Arc Importance • PageRank (Google) limitations • node oriented • high rank to words with little semantic value • conjunctions, articles AndThe • prepositions, pronouns toit • Relative arc importance • contribution of source rank to target rank Gio Wiederhold SKC 34

  35. ArcRank • For All source s and target t nodes in graph sort outgoing , rank by sorted order sort incoming , rank by sorted order for each arc compute • In ranking • Equal values take same rank • Ranks numbered consecutively Gio Wiederhold SKC 35

  36. All Pairs Similarity • Compute similarity value for all node pairs product of inbound arc importance vectors product of outbound arc importance vectors similarity = • Similarity Matrix • Initial state: nodes similar only to themselves • Node substitution: terms replace similar ones • Iterative convergence: bounded substitution Gio Wiederhold SKC 36

  37. Contents • Examples • Verb (Educate) • Adverb (Ever) • Proper Noun (Scotland -- undefined term Gio Wiederhold SKC 37

  38. Examples (Verb) Gio Wiederhold SKC 38

  39. Examples (Adverb) Gio Wiederhold SKC 39

  40. Examples (Proper Noun) Gio Wiederhold SKC 40

  41. Country Graphs Gio Wiederhold SKC 41

  42. To be matched to Gio Wiederhold SKC 42

  43. Articulation knowledge for U (A B) U U U (B C) Legend: U (C E) U : union U (C E) U : intersection B) (A U U (B C) (C D) Knowledge Composition Composed knowledge for applications using A,B,C,E Articulation knowledge Knowledge resource E Articulation knowledge for Knowledge resource C U Knowledge resource A Knowledge resource B Knowledge resource D Gio Wiederhold SKC 43

  44. Unary Summarize -- structure up Glossarize - list terms Filter - reduce instances Extract - circumscription Binary Match - data corrobaration Difference - distance measure Intersect - schem discovery Blend - schema extension Constructors create object create set Connectors match object match set Editors insert value edit value move value delete value Converters object - value object indirection reference indirection Primitive Operations Model and Instance Gio Wiederhold SKC 44

  45. Future: exploiting the result Avoid n2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1 Result has links to source Processing & query evaluation is best performed within Source Domains & by their engines Gio Wiederhold SKC 45

  46. SKC Synopsis • Research: • Reliable query answers from heterogeneous, imperfect data sources • Sources: • General: CIA World Factbook ‘96, UN-www, OPEC-www Webster’s Dictionary, Thesaurus, Oxford English Dictionary • Topical: OPEC, BattleSpace Sensors, Logistics Servers • Client: • DARPA High Performance Knowledge Base project • Theory: • Rule-based algebra • Translation & Composition primitives Gio Wiederhold SKC 46

  47. Empowerment autonomously maintainable Domain Specialization • Knowledge Acquisition (20% effort) & • Knowledge Maintenance (80% effort *) to be performed by • Domain specialists • Professional organizations • Field teams of modest size * based on experience with software Gio Wiederhold SKC 47

  48. Innovation in SKC • No need to harmonize full ontologies • Focus on what is critical for interoperation • Rules specific for articulation • Tools for creation and maintenance • Maintenance is distributed • to n sources • to m articulation agents • Potentially many sets of articulation rules is m < n2 , depends on semantic architecture density a research question Gio Wiederhold SKC 48

  49. Backup Viewgraphs Gio Wiederhold SKC 49

  50. Summary . • Algebra enables Interoperation by dealing explicitly with differences by knowledge identifying maintenance domains keeping sources autonomous • Assumes domain has a common ontology composing domain ontologies requires the algebra to manage the linkages where articulation occurs processes are best executed within the domains • Knowledge about articulation is disjoint allows integration specialists to work independently supports multiple intersections and views • Maintenance is structured and partitioned Gio Wiederhold SKC 50

More Related