Information retrieval
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

Information Retrieval PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Information Retrieval. February 3, 2003. Handout #2. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/

Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Information retrieval

Information Retrieval

February 3, 2003

Handout #2


Course information

Course Information

  • Instructor: Dragomir R. Radev ([email protected])

  • Office: 3080, West Hall Connector

  • Phone: (734) 615-5225

  • Office hours: M&F 11-12

  • Course page: http://tangra.si.umich.edu/~radev/650/

  • Class meets on Mondays, 1-4 PM in 409 West Hall


Queries and documents

Queries and documents


Queries

Queries

  • Single-word queries

  • Context queries

    • Phrases

    • Proximity

  • Boolean queries

  • Natural Language queries


Pattern matching

Pattern matching

  • Words, prefixes, suffixes, substrings, ranges, regular expressions

  • Structured queries (e.g., XML)


Relevance feedback

Relevance feedback

  • Query expansion

  • Term reweighting

  • Pseudo-relevance feedback

  • Latent semantic indexing

  • Distributional clustering


Document processing

Document processing

  • Lexical analysis

  • Stopword elimination

  • Stemming

  • Index term identification

  • Thesauri


Porter s algorithm

Porter’s algorithm

  • 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y


Porter s algorithm1

Porter’s algorithm

  • Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control


Porter s algorithm2

Porter’s algorithm

Example: the word “duplicatable”

duplicat rule 4duplicate rule 1b1duplic rule 3

The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.


Porter s algorithm3

Porter’s algorithm


Relevance feedback1

Relevance feedback

  • Automatic

  • Manual

  • Method: identifying feedback terms

    Q’ = a1Q + a2R - a3N

    Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|


Example

Example

  • Q = “safety minivans”

  • D1 = “car safety minivans tests injury statistics” - relevant

  • D2 = “liability tests safety” - relevant

  • D3 = “car passengers injury reviews” - non-relevant

  • R = ?

  • S = ?

  • Q’ = ?


Automatic query expansion

Automatic query expansion

  • Thesaurus-based expansion

  • Distributional similarity-based expansion


Wordnet and distsim

WordNet and DistSim

wn reason -hypen - hypernyms

wn reason -synsn - synsets

wn reason -simsn - synonyms

wn reason -over - overview of senses

wn reason -famln - familiarity/polysemy

wn reason -grepn - compound nouns

/clair3/tools/relatedwords/relate reason


Related substitutable words

Related (substitutable) words

Wordnet

Book: publication, product, fact, dramatic composition, record

Computer: machine, expert, calculator, reckoner, figurer

Fruit: reproductive structure, consequence, product, bear

Politician: leader, schemer

Newspaper: press, publisher, product, paper, newsprint

Distributional clustering:

Book: autobiography, essay, biography, memoirs, novels

Computer:adobe, computing, computers, developed, hardware

Fruit: leafy, canned, fruits, flowers, grapes

Politician: activist, campaigner, politicians, intellectuals, journalist

Newspaper: daily, globe, newspapers, newsday, paper


Indexing and searching

Indexing and searching


Computing term salience

Computing term salience

  • Term frequency (IDF)

  • Document frequency (DF)

  • Inverse document frequency (IDF)


Scripts to compute tf and idf

Scripts to compute tf and idf

cd /clair4/class/ir-w03/hw2

./tf.pl 053.txt | sort -nr +1 | more

./tfs.pl 053.txt | sort -nr +1 | more

./stem.pl reasonableness

./build-idf.pl

./idf.pl | sort -n +2 | more


Applications of tfidf

Applications of TFIDF

  • Cosine similarity

  • Indexing

  • Clustering


  • Login