query broadening to improve ir n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Query Broadening to improve IR PowerPoint Presentation
Download Presentation
Query Broadening to improve IR

Loading in 2 Seconds...

play fullscreen
1 / 24

Query Broadening to improve IR - PowerPoint PPT Presentation


  • 87 Views
  • Uploaded on

first we look at a method for Information Retreival query broadening that requires input from the user then we look at an automatic method for query broadening using a thesaurus

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Query Broadening to improve IR' - senwe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
query broadening to improve ir
first we look at a method for Information Retreival query broadening that requires input from the user

then we look at an automatic method for query broadening using a thesaurus

by the end of the lecture you should understand what a thesaurus, terminology-bank, ontology are, and how they are used to broaden queries

Query Broadening to improve IR
some issues to be resolved
Synonyms

football / soccer, tap / faucet: search for one, find both?

homonyms

lead (metal or leash?), tap: find both, only want one?

local/global contexts determine “good” terms

football articles: won’t mention word ‘football’;

will have particular meaning for the word ‘goal’

Precoordination (proximity query): multi-word terms

“Venetian blind” vs “blind Venetian”

Some issues to be resolved
evaluation effectiveness measures
effort - required by the users in formulation of queries

time - between receipt of user query and production of list of ‘hits’

presentation - of the output

coverage - of the collection

recall - the fraction of relevant items retrieved

precision - the fraction of retrieved items that are relevant

user satisfaction – with the retrieved items

Evaluation/Effectiveness measures
better hits query broadening
User unaware of collection characteristics is likely to formulate a ‘naïve’ query

query broadening aims to replace the initial query with a new one featuring one or other of:

new index terms

adjusted term weights

One method uses feedback information from the user

Another method uses a thesaurus / term-bank / ontology

Better hits: Query Broadening
relevance feedback
Relevance Feedback

From response to initial query, gather relevance information

H = set of all hits

HR = R = set of retrieved, relevant hits

HNR = H-R = set of retrieved, non-relevant hits

replace query q with replacement query q' :

q' = q

 di / |HR|

 di / |HNR|

note: this moves the query vector closer to the centroid of the “relevant retrieved” document vectors and further from the centroid of the “non-relevant retrieved” documents.

di  HR

di  HNR

using terms from relevant documents
We expect documents that are similar to one another in meaning (or usefulness) to have similar index terms.

The system creates a replacement query (q’) based on q, but adds index terms that have been used to index known relevant documents, increases the relative weight of index terms in q that are also found in relevant documents, and reduces the weight of terms found in non-relevant documents.

Using terms from relevant documents
how does this help
It could help if documents were being missed because of the synonym problem. The user uses the word ‘jam’, but some recipes use ‘jelly’ instead. Once a hit that uses ‘jelly’ has been recognized as relevant, then ‘jelly’ will appear n the next version of the query. Now hits may use ‘jelly’ but not ‘jam’.

Conversely, it can help with the homonym problem. If the user wants references to ‘lead’ (the metal), and gets documents relating to dog-walking, then by marking the dog-walking references as not relevant, key words associated with dog-walking will be reduced in weight

How does this help?
pros and cons of feedback
If  is set = 0, ignore non-relevant hits, a positive feedback system; often preferred

the feedback formula can be applied repeatedly, asking user for relevance information at each iteration

relevance feedback is generally considered to be very effective for “high-use” systems

one drawback is that it is not fully automatic.

pros and cons of feedback
slide9

not relevant

relevant

Recipe for jam pudding

Simple feedback example:

T = {pudding, jam, traffic, lane, treacle}

d1 = (0.8, 0.8, 0.0, 0.0, 0.4),

d2 = (0.0, 0.0, 0.9, 0.8, 0.0),

d3 = (0.8, 0.0, 0.0, 0.0, 0.8)

d4 = (0.6, 0.9, 0.5, 0.6, 0.0)

DoT report on traffic lanes

Recipe for treacle pudding

Radio item on traffic jam in Pudding Lane

Display first 2 documents that match the following query:

q = (1.0, 0.6, 0.0, 0.0, 0.0)

Retrieved documents are:

d1 : Recipe for jam pudding

d4 : Radio item on traffic jam

r = (0.91, 0.0, 0.6, 0.73)

positive and negative feedback
Positive and Negative Feedback

Suppose we set and  to 0.5,  to 0.2

q' = q  di / | HR | di / | HNR|

= 0.5 q + 0.5 d1  0.2 d4

= 0.5  (1.0, 0.6, 0.0, 0.0, 0.0)

+ 0.5  (0.8, 0.8, 0.0, 0.0, 0.4)

 0.2  (0.6, 0.9, 0.5, 0.6, 0.0)

= (0.78, 0.52,  0.1,  0.12, 0.2)

(Note |Hn| = 1 and |Hnr| = 1)

di  HR

di  HNR

slide11

relevant

relevant

Simple feedback example:

T = {pudding, jam, traffic, lane, treacle}

d1 = (0.8, 0.8, 0.0, 0.0, 0.4),

d2 = (0.0, 0.0, 0.9, 0.8, 0.0),

d3 = (0.8, 0.0, 0.0, 0.0, 0.8)

d4 = (0.6, 0.9, 0.5, 0.6, 0.0)

Display first 2 documents that match the following query:

q’ = (0.78, 0.52,  0.1,  0.12, 0.2)

Retrieved documents are:

d1 : Recipe for jam pudding

d3 : Recipe for treacle pud

r’ = (0.96, 0.0, 0.86, 0.63)

thesaurus
a thesaurus or ontology may contain

controlled vocabulary of terms or phrases describing a specific restricted topic,

synonym classes,

hierarchy defining broader terms (hypernyms) and narrower terms (hyponyms)

classes of ‘related’ terms.

a thesaurus or ontology may be:

generic (as Roget’s thesaurus, or WordNet)

specific to a certain domain of knowledge, eg medical

Thesaurus
language normalisation
Language normalisation

by replacing words from documents and query words with synonyms from a controlled language, we can improve precision and recall:

Uncontrolled keywords

Index terms

Content analysis

Thesaurus

match

User query

Normalised query

thesaurus ontology construction
Include terms likely to be of value in content analysis

for each term, form classes of related words

(separate classes for synonyms, hypernyms, hyponyms)

form separate classes for each relevant meaning of the word

terms in a class should occur with roughly equal frequency (not easy – NL has Zipf’s law word-freq )

avoid high-frequency terms

it involves some expert judgment that will not be easy to automate.

Thesaurus / Ontology construction
example thesaurus
Example thesaurus

A public-domain thesaurus (WORDNET) is available from:

http://www.cogsci.princeton.edu/~wn/

/home/cserv1_a/staff/nlplib/WordNet/2.0

/home/cserv1_a/staff/extras/nltk/1.4.2/corpora/wordnet

synonyms (sense 1):

data processor

electronic computer

computer

information processing system

example thesaurus1
Example thesaurus

A public-domain thesaurus (WORDNET) is available from:

http://www.cogsci.princeton.edu/~wn/

synonyms (sense 2):

estimator

calculator

computer

reckoner

figurer

terminology from wordnet help
Terminology (from WordNet Help)

Hypernym is the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y. Hyponym is the generic term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y. Coordinate words arewords that have the same hypernym.Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

hypernyms
Hypernyms

Sense 1computer, data processor, electronic computer, information processing system-> machine -> device -> instrumentality, instrumentation -> artifact, artefact -> object, physical object -> entity, somethingHypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

hyponyms
Hyponyms

Sense 1

computer, data processor, electronic computer, information processing system=> analog computer, analogue computer=> digital computer=> node, client, guest=> number cruncher=> pari-mutuel machine, totalizer, totaliser, totalizator, totalisator=> server, hostHypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

coordinate terms
Coordinate terms

Sense 1computer, data processor, electronic computer, information processing system-> machine=> assembly=> calculator, calculating machine=> calendar=> cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM=> computer, data processor, electronic computer, information processing system=> concrete mixer, cement mixer=> corker=> cotton gin, gin=> decoder

thesaurus use
replace term in document and/or query with term in controlled language

replace term in query with related or broader term to increase recall

suggest to user narrower terms to increase precision

Thesaurus use

computer (sense 1)

Doc: <data processor>

S

Thesaurus

match

Query: < electronic computer>

computer (sense 1)

thesaurus use1
replace term in document and/or query with term in controlled language

replace term in query with related or broader term to increase recall

suggest to user narrower terms to increase precision

Thesaurus use

All collection

All collection

B

Thesaurus

match

match

Query: <computer (sense 1)>

Query: <node(sense 6)>

thesaurus use2
replace term in document and/or query with term in controlled language

replace term in query with related or broader term to increase recall

suggest to user narrower terms to increase precision

Thesaurus use

All collection

N

All collection

Thesaurus

match

match

User

Query: <computer (sense 1)>

Query: client

key points
a thesaurus or ontology can be used to normalise a vocabulary and queries (?or documents?)

it can be used (with some human intervention) to increase recall and precision

generic thesaurus/ontology may not be effective in specialized collections and/or queries

Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results.

Key points