Introduction to information retrieval ir
Download
1 / 22

Introduction to Information Retrieval (IR) - PowerPoint PPT Presentation


  • 214 Views
  • Updated On :

Introduction to Information Retrieval (IR) Mark Craven craven@cs.wisc.edu craven@biostat.wisc.edu 5730 Medical Sciences Center Documents and Corpora document: a passage of free text or hypertext Usenet posting Web page newswire story MEDLINE abstract journal article

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to Information Retrieval (IR)' - Albert_Lan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to information retrieval ir l.jpg

Introduction to Information Retrieval (IR)

Mark Craven

craven@cs.wisc.edu

craven@biostat.wisc.edu

5730 Medical Sciences Center


Documents and corpora l.jpg
Documents and Corpora

  • document: a passage of free text or hypertext

    • Usenet posting

    • Web page

    • newswire story

    • MEDLINE abstract

    • journal article

  • corpus (pl. corpora): a collection of documents

    • MEDLINE

    • Reuters stories from 1999

    • the Web


The ad hoc retrieval problem l.jpg
The Ad-Hoc Retrieval Problem

  • given:

    • a document collection (corpus)

    • an arbitrary query

  • do:

    • return a list of relevant documents

  • this is the problem addressed by Web search engines


Typical ir system l.jpg

query

processor

spider

document

processor

Typical IR System

inverted index


The index and inverse index l.jpg
The Index and Inverse Index

  • index: a relation mapping each document to the set of keywords it is about

  • inverse index

  • where do these come from?


Inverted index l.jpg

aardvark

anteater

hungry

zebra

Inverted Index

index

corpus


A simple boolean query l.jpg

aardvark

anteater

hungry

zebra

A Simple Boolean Query

  • to answer query “hungry AND zebra”, get intersection of documents pointed to by “hungry” and documents pointed to by “zebra”


Other things to consider l.jpg
Other Things to Consider

  • How wan we search on phrases?

  • Should we treat these queries differently?

    • “a hungry zebra”

    • “the hungry zebra”

    • “hungry as a zebra”

  • If we query on “laugh zebra” should we return documents containing the following?

    • “laughing zebra”

    • “laughable zebra”

  • Boolean queries are too coarse - return too many or too few relevant documents.


Handling phrases l.jpg

aardvark

anteater

hungry

zebra

Handling Phrases

95

40

  • store position information in the inverted index

  • to answer query “hungry zebra”, look for documents having “hungry” at position i and “zebra” at position i + 1

25

38

26


Handling phrases10 l.jpg
Handling Phrases

  • but this is a primitive notion of phrase

    • we might want “zebras that are hungry” to be considered a match to the phrase “hungry zebra”

    • this requires doing sentence analysis; determining parts of speech for words, etc.


Stop words l.jpg
Stop Words

  • Should we treat these queries differently?

    • “a hungry zebra”

    • “the hungry zebra”

    • “hungry as a zebra”

  • Some systems employ a list of stop words (a.k.a. function words) that are probably not informative for most searches.

    • a, an, the, that, this, of, by, with, to …

    • stop words in a query are ignored

    • but might be handled differently in phrases


Stop words12 l.jpg
Stop Words

a

able

about

above

according

accordingly

across

actually

after

afterwards

again

against

all

allow

allows

almost

alone

along

already

also

although

always

am

among

amongst

an

and

another

any

anybody

anyhow

anyone

anything

anyway

anyways

anywhere

apart

appear

appreciate

appropriate

are

around

as

aside

ask

asking

associated

at

available

away

awfully

b

be

became

because

become

becomes

becoming

been

before

beforehand

behind

being

believe

below

beside

besides

best

better

between

beyond

both

brief

but

by

...


A special purpose stop list l.jpg
A Special Purpose Stop List

Bos taurus

Botrytis cinerea

C. elegans

Chicken

Goat

Gorilla

Guinea pig

Hamster

Human

Mouse

Pig

Rat

Spinach

unknown gene

cDNA

DNA clone

BAC

PAC

cosmid

clone

genomic sequence

potentially degraded


Stemming l.jpg
Stemming

  • If we query on “laugh zebra” should we return documents containing the following?

    • “laughing zebra”

    • “laughable zebra”

  • Some systems perform stemming on words; truncating related words to a common stem.

    • laugh laugh-

    • laughs laugh-

    • laughing laugh-

    • laughed laugh-


Stemming15 l.jpg
Stemming

  • the Lovins stemmer

    • 260 suffix patterns

    • iterative longest match procedure

(.*)SSES

$1SS

(.*[AEIOU].*)ED

$1

  • the Porter stemmer

    • about 60 patterns grouped into sets

    • apply patterns in each set before moving to next


Stemming16 l.jpg
Stemming

  • May be helpful

    • reduces vocabulary 10-50%

    • may increase recall

  • May not be helpful

    • for some queries, the sense of a word is important

    • stemming algorithms are heuristic; may conflate semantically different words (e.g. gall and gallery)

  • As with stop words, might want to handle stemming differently in phrases


The vector space model l.jpg

d1

d2

q

d3

The Vector Space Model

  • Boolean queries are too coarse - return too many or too few relevant documents.

  • Most IR systems are based on the vector space model


The vector space model18 l.jpg
The Vector Space Model

  • documents/queries represented by vectors in a high-dimensional space

  • each dimension corresponds to a word in the vocabulary

  • most relevant documents are those whose vectors are closest to query vector


Vector similarity l.jpg
Vector Similarity

  • one way to determine vector similarity is the cosine measure:

  • if the vectors are normalized, we can simply take their dot product


Determining word weights l.jpg
Determining Word Weights

  • lots of heuristics

  • one well established one is TFIDF (term frequency, inverse document frequency) weighting

    • numerator includes , number of occurrences of word in document

    • denominator includes , total number of occurrences of in corpus


Tfidf one form l.jpg
TFIDF: One Form

(N = total number of words in the corpus)


The probability ranking principle l.jpg
The Probability Ranking Principle

  • most IR systems are based on the premise that ranking documents in order of decreasing probability is the right thing to do

  • assumes documents are independent

    • does wrong thing with duplicates

    • doesn’t promote diversity in returned documents