Tim Finin, James Mayfield, Anupam Joshi,R. Scott Cost and Clay Fink University of Maryland, Baltimore County Johns Hopkins University, Applied Physics Lab 04 January 2004 Information Retrieval and the Semantic Web DARPA contract F30602-00-0591and NSF awards ITR-IIS-0326460 and ITR-IIS-0325464 provided partial research support for this work
“XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria.” -- Philip Wadler, Et tu XML? The fall of the relational empire, VLDB, Rome, September 2001.
“The web has made people smarter. We need to understand how to use it to make machines smarter, too.” -- Michael I. Jordan (UC Berkeley), paraphrased from a talk at AAAI, July 2002
“The Semantic Web will globalize KR, just as the WWW globalize hypertext” -- Tim Berners-Lee
“The multi-agent systems paradigm and the web both emerged around 1990. One has succeeded beyond imagination and the other has not yet made it out of the lab.” -- Anonymous, 2001
Vision • Semantic markup (e.g., OWL) as markup • Web documents are traditional HTML documents, augmented with machine-readable semantic markup that describes their content • Inference and retrieval are tightly bound • Inference over semantic markup improves retrieval and text retrieval facilitates inference • Agents should use the web like humans do • Think of a query, encode to retrieve possibly relevant documents, read some and extract knowledge, repeat until objectives met
Why use IR techniques? • We will want to retrieve over structured and unstructured knowledge • We should prepare for the appearance of text documents with embedded SW markup • We may want to get our SWDs into conventional search engines, such as Google. • Mature, scalable, low cost, deployed infrastructure • IR techniques also have some unique characteristics that may be very useful • e.g., ranking matches, document similarity, clustering, relevance feedback, etc.
Encoder (“swangler”) Extractor Framework–Semantic Markup agent Local KB Semantic Web Query Inference Engine Encoded Markup Semantic Markup Statement to be proved Web Search Engine Ranked Pages Filters Semantic Markup Semantic Markup
Encoder (“swangler”) Extractor Framework–Incorporating Text Local KB Semantic Web Query Inference Engine Encoded Markup Semantic Markup Statement to be proved Web Search Engine Text Query Filters Text Text Ranked Pages Filters Semantic Markup Semantic Markup
Harnessing Google • Google started indexing RDF documents some time in late 2003 • Can we take advantage of this? • We’ve developed techniques to get some structured data to be indexed by Google • And then later retrieved • Technique: give Google enhanced documents with additional annotations containing Swangle Terms™
Swangle definition swan·gle Pronunciation: ‘swa[ng]-g&lFunction: transitive verbInflected Forms: swan·gled; swan·gling /-g(&-)li[ng]/Etymology: Postmodern English, from C++ mangle, Date: 20th century 1: to convert an RDF triple into one or more IR indexing terms 2: to process a document or query so that its content bearing markup will be indexed by an IR system Synonym: see tblify - swan·gler /-g(&-)l&r/ noun
Swangling • Swangling turns a SW triple into 7 word like terms • One for each non-empty subset of the three components with the missing elements replaced by the special “don’t care” URI • Terms generated by a hashing function (e.g., SHA1) • Swangling an RDF document means adding in triples with swangle terms. • This can be indexed and retrieved via conventional search engines like Google • Allows one to search for a SWD with a triple that claims “Ossama bin Laden is located at X”
A Swangled Triple <rdf:RDF xmlns:s="http://swoogle.umbc.edu/ontologies/swangle.owl#" </rdf> <s:SwangledTriple> <s:swangledText>N656WNTZ36KQ5PX6RFUGVKQ63A</s:swangledText> <rdfs:comment>Swangled text for [http://www.xfront.com/owl/ontologies/camera/#Camera, http://www.w3.org/2000/01/rdf-schema#subClassOf, http://www.xfront.com/owl/ontologies/camera/#PurchaseableItem] </rdfs:comment> <s:swangledText>M6IMWPWIH4YQI4IMGZYBGPYKEI</s:swangledText> <s:swangledText>HO2H3FOPAEM53AQIZ6YVPFQ2XI</s:swangledText> <s:swangledText>2AQEUJOYPMXWKHZTENIJS6PQ6M</s:swangledText> <s:swangledText>IIVQRXOAYRH6GGRZDFXKEEB4PY</s:swangledText> <s:swangledText>75Q5Z3BYAKRPLZDLFNS5KKMTOY</s:swangledText> <s:swangledText>2FQ2YI7SNJ7OMXOXIDEEE2WOZU</s:swangledText></s:SwangledTriple>
What’s the point? • We’d like to get our documents into Google • Swangle terms look like words to Google and other search engines. • Cloaking obviates modifying document • Add rules to the web server so that, when a search spider asks for document X the document swangled(X) is returned. Caching makes this efficient • A swangle term length of 7 may be an acceptable length for a Semantic Web of 1010 triples -- collision prob for a triple ~ 2*10-6. • We could also use Swanglish – hashing each triple into N of the 50K most common English words
Student Event Scenario • UMBC sends out descriptions of ~50 events a week to students. • Each student has a “standing query” used to route event messages. • A student only receives announcements of events matching his/her interests and schedule. • Use LMCO’s AeroText system to automatically add DAML+OIL markup to event descriptions. • Categorize text announcements into event types • Identify key elements and add DAML markup • Use JESS to reason over the markup, drawing ontology-supported inferences
Event Ontology • A simple ontology for University events • Includes classes, subclasses, properties, etc. • Can include instance data, e.g., UMBC, NEC, Fairleigh Dickenson, etc.
Movie Sport Event Categories Talk . . . Trip OWLIR Architecture Expand EventDescription Agents Classification Extract triples & reason InfoExtraction LMCO AeroText + Java Jess Jess EventDescriptions Text Text+DAML Text+DAML Text +triples Text +triples Converttriples toindex terms Extract triples & reason Converttriples toindex terms Text Must Index Query User Interface Text Jess OK SIRE Retrieve Must not Text + triples Final Results Inference on results Results User Interface
Swoogle Search CGI scripts SWOs Videofiles HTML documents Audiofiles SWIs Images SWD = SWO + SWI SWOOGLE 2 Ontology Dictionary Human users The web, like Gaul, is divided into three parts: the regular web (e.g. HTML), Semantic Web Ontologies (SWOs), and Semantic Web Instance files (SWIs) Web Server SwoogleStatistics OntologyDictionary SwoogleSearch Web Service Intelligent Agents service IR analyzer SWD analyzer analysis SWD Cache SWD Metadata digest SWD Reader The Web Candidate URLs SWD Rank Web Crawler Swoogle Statistics discovery A SWD’s rank is a function of its type (SWO/SWI) and the rank and types of the documents to which it’s related. Swoogle uses four kinds of crawlers to discover semantic web documents and several analysis agents to compute metadata and relations among documents and ontologies. Metadata is stored in a relational DBMS. Services are provided to people and agents. http://swoogle.umbc.edu/ Statistics as of November 2004 SWD IR Engine Swoogle provides services to people via a web interface and to agents as web services. Swoogle puts documents into a character n-gram based IR engine to compute document similarity and do retrieval from queries Contributors include Tim Finin, Anupam Joshi, Yun Peng, R. Scott Cost, Jim Mayfield, Joel Sachs, Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding, and Drew Ogle. Partial research support was provided by DARPA contract F30602-00-0591 and by NSF by awards NSF-ITR-IIS-0326460 and NSF-ITR-IDM-0219649. November 2004.
Concepts • Document • A Semantic Web Document (SWD) is an online document written in semantic web languages (i.e. RDF and OWL). • An ontology document (SWO) is a SWD that contains mostly term definition (i.e. classes and properties). It corresponds to T-Box in Description Logic. • An instance document (SWI or SWDB) is a SWD that contains mostly class individuals. It corresponds to A-Box in Description Logic. • Term • A term is a non-anonymous RDF resource which is the URI reference of either a class or a property. • Individual • An individual refers to a non-anonymous RDF resource which is the URI reference of a class member. In swoogle, a document D is a valid SWD iff. JENA* correctly parses D and produces at least one triple. *JENA is a Java framework for writing Semantic Web applications. http://www.hpl.hp.com/semweb/jena2.htm rdf:type foaf:Person rdfs:Class rdf:type http://.../foaf.rdf#finin foaf:Person
Demo Find “Time” Ontology (Swoogle Search) 1 • Digest “Time” Ontology • Document view • Term view 2 3 Find Term “Person” (Ontology Dictionary) • Digest Term “Person” • Class properties • (Instance) properties 4 Swoogle Statistics 5
Find “Time” Ontology Demo1 We can use a set of keywords to search ontology. For example, “time, before, after” are basic concepts for a “Time” ontology.
Usage of Terms in SWD http://www.cs.umbc.edu/~finin/foaf.rdf http://foo.com/foaf.rdf rdf:type rdf:type foaf:Person foaf:Person foaf:mbox http://foo.com/foaf.rdf#finin email@example.com firstname.lastname@example.org foaf:mbox http://xmlns.com/foaf/1.0/ populated Class rdfs:subClassOf wordNet:Agent populated Property foaf:Person rdf:type rdfs:Class rdfs:domain defined Class foaf:mbox rdf:type defined Property rdf:Property defined Individual
Demo2(a) Digest “Time” Ontology (term view) TimeZone before …………. intAfter
Demo2(b) Digest “Time” Ontology (document view)
Demo3 Find Term “Person” Not capitalized! URIref is case sensitive!
Demo4 Digest Term “Person” 167 different properties 562 different properties
Demo5 Swoogle Statistics
Swoogle IR Search • This is work in progress, not yet fully integrated into Swoogle • Documents are put into an ngram IR engine (after processing by Jena) in canonical XML form • Each contiguous sequence of N characters is used as an index term (e.g., N=5) • Queries processed the same way • Character ngrams work almost as well as words but have some advantages • No tokenization, so works well with artificial languages and agglutinative languages => good for RDF!
Why character n-grams? • Suppose we want to find ontologies for time • We might use the following query “time temporal interval point before after during day month year eventually calendar clock duration end begin zone” • And have matches for documents with URIs like • http://foo.com/timeont.owl#timeInterval • http://foo.com/timeont.owl#CalendarClockInterval • http://purl.org/upper/temporal/t13.owl#timeThing
Another approach: URIs as words • Remember: ontologies define vocabularies • In OWL, URIs of classes and properties are the words • So, take a SWD, reduce to triples, extract the URIs (with duplicates), discard URIs for blank nodes, hash each URI to a token (use MD5Hash), and index the document. • Process queries in the same way • Variation: include literal data (e.g., strings) too.
What we have done • Developed Swoogle – a crawler based retrieval system for SWDs • Developed and implemented a technique to get Google to index and retrieve SWDs • Prototyped (twice) an ngram based IR engine for SWDs • Explored the integration of inference and retrieval • Used these in several demonstration systems