1 / 22

Towards Scalable Information Integration with Instance Coreferences

Towards Scalable Information Integration with Instance Coreferences. Abir Qasem 1 , Dimitre Dimitrov 2 , Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 07/11/09. U.S. Department of Energy DE-FG02-05ER84171 SBIR grant. The Semantic Web. Definition

latika
Download Presentation

Towards Scalable Information Integration with Instance Coreferences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Scalable Information Integration with Instance Coreferences Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1 1 Lehigh University 2 Tech-X Corporation 07/11/09 U.S. Department of Energy DE-FG02-05ER84171 SBIR grant

  2. The Semantic Web • Definition • The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. (Berners-Lee et al., Scientific American, May 2001) • Ontology • a key component of the Semantic Web • ontologies define the semantics of the terms used in semi-structured web pages • identify context, provide shared definitions • has a formal syntax and unambiguous semantics • can be used to describe alignments between heterogeneous schemas

  3. A Web of Ontologies S1 S2 commits to extends Dublin Core Foaf Region extends extends extends extends extends Congress Citeseer DBLP commits to commits to AIGP NSF Awards commits to S3 S4 S7 commits to commits to The answer to a user’s query might require the combination of data from S1, S2, S3, and S4. S5 S6

  4. RDF(S) (1999, revised 2004) essentially semantic networks with URIs XML serialization syntax OWL (2004) extends RDF with more semantic primitives based on description logics (DLs) has a model theoretic semantics Semantic Web Standards World Wide Web Consortium (W3C) Recommendations rdfs:Class rdf:Property <owl:Class rdf:ID=”Band”> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=”#hasMember” /> <owl:allValuesFrom rdf:resource=”#Musician” /> </owl:Restriction> </rdfs:subClassOf></owl:Class> A Band is a subset of the groups which only have Musicians as members rdf:type rdf:type g:Person rdf:type rdfs:domain rdfs:subclassOf u:Chair g:name rdf:type g:name John Smith

  5. Integrating RDF Sources QUERY: Find all academic papers written by Marvin Minsky’s advisees. AIGP - http://aigp.eecs.umich.edu/ DBLP - http://www.informatik.uni-trier.de/~ley/db/ “Eugene Charniak” “Eugene Charniak” aigp:name dblp:name =? aigp:researcher/show/93 dblp:c/Charniak:Eugene aigp:advisorOf dblp:hasAuthor aigp:researcher/show/21 dblp:jrnl/aim/Charniak97 aigp:name dblp:title “Marvin Minsky” “Statistical Techniques for Natural Language Parsing”

  6. Coreference Information • owl:sameAs • states that two URIs denote the same individual • Linking Open Data initiative • ~100 sources with over 4 billion triples (i.e., facts) • >100 million explicit owl:sameAs statements • Many RDF users publish owl:sameAs statements with their data • Can use automated coreference resolution techniques to find others • allow for the possibility of human correction

  7. Scaling • AIGP and DBLP have about 4000 coreferent instances • Marvin Minsky has about 20 advisees • Only a small fragment of coreference information is relevant to any given query • Need to be selective about what information to use • Quantity of coreference information • 80K between DBPedia and Geonames • 100K between CIA factbook and Geonames

  8. Semantic Web Space System startup or periodic update Query Phase SPARQL Query Domain ontologies O1 On GNS IndexKB LAV, GAV, (REL statements are LAV + URL of data source)‏ OWLII map ontologies Om1 Omn LAV/GAV matches Potentially relevant sources from the leaves REL set Retrieve potentially relevant sources and load them in a reasoner Data sources http calls S1 S2 KAON2 S4 S5 S3 S4 Sn Result Potentially relevant sources OBII LAV/GAV LAV/GAV Rcs and Rps to LAV/GAV is  ? R1 Rn Rs to Indexed Equivalence closure Get All  EQKB

  9. Potential Relevance • A summary of a source’s content that allows us to ignore sources that can not possibly contribute to a query • Unless we look inside the source there is no way to guarantee its relevance • REL statements have three forms stating relevance of three different assertions a source can have (In the following d is the URL of a data source, Cs is a class, CE is a class expression, Ps, Pq are property names, {u1 …. un} are a set of URIs) • For Classes Rc the form is REL (d, Cs, CE) • For properties Rp the form is REL (d, Ps, Pq) • For owl:sameAs assertions Rthe form is REL (d, {u1 …. un})

  10. Information Integration vs. Source Selection

  11. Equivalence KB • Implementation is a variation of disjoint set forest algorithm [Cormen et al. 01] • standard operations: union(x,y) and find-set(x) • Also supports isEquivalent and getAllEquivalent methods • The index is built by an update algorithm (with a set of seed URLs) • Uses an inverted document index for equivalence relevance information [Cormen et al. 01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms: Second Edition. The MIT Press, Cambridge, MA, 2001.

  12. Update of EquivalenceKB

  13. Preliminary Tests • We have used • 202,383 owl:sameAs statements that align data from AIGP, DBLP and Citeseer data sources • Part of Hawkeye Project • http://swat.cse.lehigh.edu/resources/index.html • 166 million facts and several “integration resources” • PC with 3GB • EquivalanceKB is 7mb • Buildup time 3 seconds • 1000 calls to getAllEquivalents returns in less than half a second

  14. Query Answering Needs equivalence information

  15. GNS Extension Needs equivalenceinformation

  16. GNS Extension • contains is used before expansion to avoid cyclic expansion • To avoid redundancy, we consider syntactic query containment • E.g., CONTAINS(cl, P(x,a)) is true if P(x,y) is in cl • Equivalence information is relevant • author (X, GNS) in Closed list • we should not expand author (X, GOAL-NODE-SEARCH) • assuming GNS = GOAL-NODE-SEARCH

  17. GNS Extension • unifyEQ is like regular unify except it accounts for coreferences • When matching two constants we use isEqual of Equivalence KB • livesIn(X, DC) and livesIn (X, WashingtonDC) will not unify unless • we know DC = WashingtonDC

  18. Conclusion and Future Work • Scalable Instance Coreference Handling is an important issue • Initial work shows promise • Two important issues • Avoid pre-computation of equivalence closure and make the system more dynamic • Disk based implementation of EquivalenceKB • We are currently fine tuning a dynamic algorithm • UpdateEqualKB is not seeded with all URIs but rather with URIs from a query • Equivalence information is updated as new URIs are discovered due to rule expansion • Coming soon to a conference near you

  19. Backups

  20. Axiom type Subject (left-hand side) Object (right-hand side)‏ owl:equivalentClass Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue rdfs:subClassOf All of the above + owl:unionOf All of the above + owl:allValuesFrom owl:equivalentProperty rdfs:subPropertyOf named properties , owl:inverseOf named properties , owl:inverseOf owl:inverseOf named properties named properties OWLII in OWL/RDF

  21. Map example O1:GreenTranpsort (X) :- O2:Transport (X), O2:greenRating(X, good) <owl:Class rdf:about=“http://O1#GreenTransport”> <rdfs:subClassOf rdf:resource=“http://O2#Transport”/> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=“http://O2#greenRating”/> <owl:hasValue rdf:resource= “http://uri#good”/> </owl:Restriction> </rdfs:subClassOf> </owl:Class>

  22. REL example R4: O1:MtnBike (X) ⊑ O1:GreenTransport(X) ,U2 <meta:RelStatement> <meta:source rdf:resource=“http://U2”/> <meta:contained> <owl:Class rdf:about=“http://O1#MtnBike” /> </meta:contained> <meta:container> <owl:Class rdf:about=“http://O1#GreenTransport” /> </meta:container> </meta:RelStatement>

More Related