information retrieval and the semantic web n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Information Retrieval and the Semantic Web PowerPoint Presentation
Download Presentation
Information Retrieval and the Semantic Web

Loading in 2 Seconds...

play fullscreen
1 / 37

Information Retrieval and the Semantic Web - PowerPoint PPT Presentation


  • 357 Views
  • Uploaded on

Tim Finin, James Mayfield, Anupam Joshi, R. Scott Cost and Clay Fink University of Maryland, Baltimore County Johns Hopkins University, Applied Physics Lab 04 January 2004. Information Retrieval and the Semantic Web.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Retrieval and the Semantic Web' - libitha


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval and the semantic web

Tim Finin, James Mayfield, Anupam Joshi,R. Scott Cost and Clay Fink

University of Maryland, Baltimore County

Johns Hopkins University, Applied Physics Lab

04 January 2004

Information Retrieval and the Semantic Web

DARPA contract F30602-00-0591and NSF awards ITR-IIS-0326460 and ITR-IIS-0325464 provided partial research support for this work

slide3
“XML is Lisp's bastard nephew, with uglier syntax and no semantics. Yet XML is poised to enable the creation of a Web of data that dwarfs anything since the Library at Alexandria.”

-- Philip Wadler, Et tu XML? The fall of the relational empire, VLDB, Rome, September 2001.

slide4
“The web has made people smarter. We need to understand how to use it to make machines smarter, too.”

-- Michael I. Jordan (UC Berkeley), paraphrased from a talk at AAAI, July

2002

slide5
“The Semantic Web will globalize KR, just as the WWW globalize hypertext”

-- Tim Berners-Lee

slide6
“The multi-agent systems paradigm and the web both emerged around 1990. One has succeeded beyond imagination and the other has not yet made it out of the lab.”

-- Anonymous, 2001

slide7

tell

register

vision
Vision
  • Semantic markup (e.g., OWL) as markup
    • Web documents are traditional HTML documents, augmented with machine-readable semantic markup that describes their content
  • Inference and retrieval are tightly bound
    • Inference over semantic markup improves retrieval and text retrieval facilitates inference
  • Agents should use the web like humans do
    • Think of a query, encode to retrieve possibly relevant documents, read some and extract knowledge, repeat until objectives met
why use ir techniques
Why use IR techniques?
  • We will want to retrieve over structured and unstructured knowledge
    • We should prepare for the appearance of text documents with embedded SW markup
  • We may want to get our SWDs into conventional search engines, such as Google.
    • Mature, scalable, low cost, deployed infrastructure
  • IR techniques also have some unique characteristics that may be very useful
    • e.g., ranking matches, document similarity, clustering, relevance feedback, etc.
framework semantic markup

Encoder

(“swangler”)

Extractor

Framework–Semantic Markup

agent

Local

KB

Semantic

Web Query

Inference

Engine

Encoded

Markup

Semantic

Markup

Statement

to be proved

Web

Search

Engine

Ranked

Pages

Filters

Semantic

Markup

Semantic

Markup

framework incorporating text

Encoder

(“swangler”)

Extractor

Framework–Incorporating Text

Local

KB

Semantic

Web Query

Inference

Engine

Encoded

Markup

Semantic

Markup

Statement

to be proved

Web

Search

Engine

Text

Query

Filters

Text

Text

Ranked

Pages

Filters

Semantic

Markup

Semantic

Markup

harnessing google
Harnessing Google
  • Google started indexing RDF documents some time in late 2003
  • Can we take advantage of this?
  • We’ve developed techniques to get some structured data to be indexed by Google
  • And then later retrieved
  • Technique: give Google enhanced documents with additional annotations containing Swangle Terms™
swangle definition
Swangle definition

swan·gle

Pronunciation: ‘swa[ng]-g&lFunction: transitive verbInflected Forms: swan·gled; swan·gling /-g(&-)li[ng]/Etymology: Postmodern English, from C++ mangle, Date: 20th century

1: to convert an RDF triple into one or more IR indexing terms

2: to process a document or query so that its content bearing markup will be indexed by an IR system

Synonym: see tblify

- swan·gler /-g(&-)l&r/ noun

swangling
Swangling
  • Swangling turns a SW triple into 7 word like terms
    • One for each non-empty subset of the three components with the missing elements replaced by the special “don’t care” URI
    • Terms generated by a hashing function (e.g., SHA1)
  • Swangling an RDF document means adding in triples with swangle terms.
    • This can be indexed and retrieved via conventional search engines like Google
  • Allows one to search for a SWD with a triple that claims “Ossama bin Laden is located at X”
a swangled triple
A Swangled Triple

<rdf:RDF

xmlns:s="http://swoogle.umbc.edu/ontologies/swangle.owl#"

</rdf>

<s:SwangledTriple> <s:swangledText>N656WNTZ36KQ5PX6RFUGVKQ63A</s:swangledText> <rdfs:comment>Swangled text for [http://www.xfront.com/owl/ontologies/camera/#Camera, http://www.w3.org/2000/01/rdf-schema#subClassOf, http://www.xfront.com/owl/ontologies/camera/#PurchaseableItem] </rdfs:comment> <s:swangledText>M6IMWPWIH4YQI4IMGZYBGPYKEI</s:swangledText> <s:swangledText>HO2H3FOPAEM53AQIZ6YVPFQ2XI</s:swangledText> <s:swangledText>2AQEUJOYPMXWKHZTENIJS6PQ6M</s:swangledText> <s:swangledText>IIVQRXOAYRH6GGRZDFXKEEB4PY</s:swangledText> <s:swangledText>75Q5Z3BYAKRPLZDLFNS5KKMTOY</s:swangledText> <s:swangledText>2FQ2YI7SNJ7OMXOXIDEEE2WOZU</s:swangledText></s:SwangledTriple>

what s the point
What’s the point?
  • We’d like to get our documents into Google
    • Swangle terms look like words to Google and other search engines.
  • Cloaking obviates modifying document
    • Add rules to the web server so that, when a search spider asks for document X the document swangled(X) is returned. Caching makes this efficient
  • A swangle term length of 7 may be an acceptable length for a Semantic Web of 1010 triples -- collision prob for a triple ~ 2*10-6.
  • We could also use Swanglish – hashing each triple into N of the 50K most common English words
student event scenario
Student Event Scenario
  • UMBC sends out descriptions of ~50 events a week to students.
  • Each student has a “standing query” used to route event messages.
    • A student only receives announcements of events matching his/her interests and schedule.
  • Use LMCO’s AeroText system to automatically add DAML+OIL markup to event descriptions.
    • Categorize text announcements into event types
    • Identify key elements and add DAML markup
  • Use JESS to reason over the markup, drawing ontology-supported inferences
event ontology
Event Ontology
  • A simple ontology for University events
  • Includes classes, subclasses, properties, etc.
  • Can include instance data, e.g., UMBC, NEC, Fairleigh Dickenson, etc.
slide21

Movie

Sport

Event Categories

Talk

. . .

Trip

OWLIR Architecture

Expand EventDescription

Agents

Classification

Extract triples & reason

InfoExtraction

LMCO AeroText

+ Java

Jess

Jess

EventDescriptions

Text

Text+DAML

Text+DAML

Text +triples

Text +triples

Converttriples toindex terms

Extract triples & reason

Converttriples toindex terms

Text

Must

Index

Query

User

Interface

Text

Jess

OK

SIRE

Retrieve

Must not

Text + triples

Final Results

Inference on results

Results User Interface

slide23

Swoogle Search

CGI scripts

SWOs

Videofiles

HTML

documents

Audiofiles

SWIs

Images

SWD = SWO + SWI

SWOOGLE 2

Ontology Dictionary

Human users

The web, like Gaul, is divided into three parts: the regular web (e.g. HTML), Semantic Web Ontologies (SWOs), and Semantic Web Instance files (SWIs)

Web Server

SwoogleStatistics

OntologyDictionary

SwoogleSearch

Web Service

Intelligent Agents

service

IR analyzer

SWD analyzer

analysis

SWD Cache

SWD Metadata

digest

SWD Reader

The Web

Candidate

URLs

SWD Rank

Web Crawler

Swoogle Statistics

discovery

A SWD’s rank is a function of its type (SWO/SWI) and the rank and types of the documents to which it’s related.

Swoogle uses four kinds of crawlers to discover semantic web documents and several analysis agents to compute metadata and relations among documents and ontologies. Metadata is stored in a relational DBMS. Services are provided to people and agents.

http://swoogle.umbc.edu/

Statistics as of November 2004

SWD IR Engine

Swoogle provides services to people via a web interface and to agents as web services.

Swoogle puts documents into a character n-gram based IR engine to compute document similarity and do retrieval from queries

Contributors include Tim Finin, Anupam Joshi, Yun Peng, R. Scott Cost, Jim Mayfield, Joel Sachs, Pavan Reddivari, Vishal Doshi, Rong Pan, Li Ding, and Drew Ogle. Partial research support was provided by DARPA contract F30602-00-0591 and by NSF by awards NSF-ITR-IIS-0326460 and NSF-ITR-IDM-0219649. November 2004.

concepts
Concepts
  • Document
    • A Semantic Web Document (SWD) is an online document written in semantic web languages (i.e. RDF and OWL).
    • An ontology document (SWO) is a SWD that contains mostly term definition (i.e. classes and properties). It corresponds to T-Box in Description Logic.
    • An instance document (SWI or SWDB) is a SWD that contains mostly class individuals. It corresponds to A-Box in Description Logic.
  • Term
    • A term is a non-anonymous RDF resource which is the URI reference of either a class or a property.
  • Individual
    • An individual refers to a non-anonymous RDF resource which is the URI reference of a class member.

In swoogle, a document D is a valid SWD iff. JENA* correctly parses D and produces at least one triple.

*JENA is a Java framework for writing Semantic Web applications. http://www.hpl.hp.com/semweb/jena2.htm

rdf:type

foaf:Person

rdfs:Class

rdf:type

http://.../foaf.rdf#finin

foaf:Person

slide25
Demo

Find “Time” Ontology

(Swoogle Search)

1

  • Digest “Time” Ontology
  • Document view
  • Term view

2

3

Find Term “Person”

(Ontology Dictionary)

  • Digest Term “Person”
  • Class properties
  • (Instance) properties

4

Swoogle Statistics

5

find time ontology
Find “Time” Ontology

Demo1

We can use a set of keywords to search ontology. For example, “time, before, after” are basic concepts for a “Time” ontology.

usage of terms in swd
Usage of Terms in SWD

http://www.cs.umbc.edu/~finin/foaf.rdf

http://foo.com/foaf.rdf

rdf:type

rdf:type

foaf:Person

foaf:Person

foaf:mbox

http://foo.com/foaf.rdf#finin

finin@umbc.edu

finin@umbc.edu

foaf:mbox

http://xmlns.com/foaf/1.0/

populated Class

rdfs:subClassOf

wordNet:Agent

populated Property

foaf:Person

rdf:type

rdfs:Class

rdfs:domain

defined Class

foaf:mbox

rdf:type

defined Property

rdf:Property

defined Individual

slide28

Demo2(a)

Digest “Time” Ontology (term view)

TimeZone

before

………….

intAfter

slide29

Demo2(b)

Digest “Time” Ontology (document view)

slide30

Demo3

Find Term “Person”

Not capitalized! URIref is case sensitive!

slide31

Demo4

Digest Term “Person”

167 different properties

562 different properties

slide32

Demo5

Swoogle Statistics

swoogle ir search
Swoogle IR Search
  • This is work in progress, not yet fully integrated into Swoogle
  • Documents are put into an ngram IR engine (after processing by Jena) in canonical XML form
    • Each contiguous sequence of N characters is used as an index term (e.g., N=5)
    • Queries processed the same way
  • Character ngrams work almost as well as words but have some advantages
    • No tokenization, so works well with artificial languages and agglutinative languages

=> good for RDF!

why character n grams
Why character n-grams?
  • Suppose we want to find ontologies for time
  • We might use the following query

“time temporal interval point before after during day month year eventually calendar clock duration end begin zone”

  • And have matches for documents with URIs like
    • http://foo.com/timeont.owl#timeInterval
    • http://foo.com/timeont.owl#CalendarClockInterval
    • http://purl.org/upper/temporal/t13.owl#timeThing
another approach uris as words
Another approach: URIs as words
  • Remember: ontologies define vocabularies
  • In OWL, URIs of classes and properties are the words
  • So, take a SWD, reduce to triples, extract the URIs (with duplicates), discard URIs for blank nodes, hash each URI to a token (use MD5Hash), and index the document.
  • Process queries in the same way
  • Variation: include literal data (e.g., strings) too.
what we have done
What we have done
  • Developed Swoogle – a crawler based retrieval system for SWDs
  • Developed and implemented a technique to get Google to index and retrieve SWDs
  • Prototyped (twice) an ngram based IR engine for SWDs
  • Explored the integration of inference and retrieval
  • Used these in several demonstration systems