information retrieval
Download
Skip this Video
Download Presentation
Information Retrieval

Loading in 2 Seconds...

play fullscreen
1 / 20

Information Retrieval - PowerPoint PPT Presentation


  • 112 Views
  • Uploaded on

Information Retrieval. February 3, 2003. Handout #2. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Information Retrieval' - kanoa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval

Information Retrieval

February 3, 2003

Handout #2

course information
Course Information
  • Instructor: Dragomir R. Radev ([email protected])
  • Office: 3080, West Hall Connector
  • Phone: (734) 615-5225
  • Office hours: M&F 11-12
  • Course page: http://tangra.si.umich.edu/~radev/650/
  • Class meets on Mondays, 1-4 PM in 409 West Hall
queries
Queries
  • Single-word queries
  • Context queries
    • Phrases
    • Proximity
  • Boolean queries
  • Natural Language queries
pattern matching
Pattern matching
  • Words, prefixes, suffixes, substrings, ranges, regular expressions
  • Structured queries (e.g., XML)
relevance feedback
Relevance feedback
  • Query expansion
  • Term reweighting
  • Pseudo-relevance feedback
  • Latent semantic indexing
  • Distributional clustering
document processing
Document processing
  • Lexical analysis
  • Stopword elimination
  • Stemming
  • Index term identification
  • Thesauri
porter s algorithm
Porter’s algorithm
  • 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. * - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
porter s algorithm1
Porter’s algorithm
  • Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control
porter s algorithm2
Porter’s algorithm

Example: the word “duplicatable”

duplicat rule 4duplicate rule 1b1duplic rule 3

The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

relevance feedback1
Relevance feedback
  • Automatic
  • Manual
  • Method: identifying feedback terms

Q’ = a1Q + a2R - a3N

Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

example
Example
  • Q = “safety minivans”
  • D1 = “car safety minivans tests injury statistics” - relevant
  • D2 = “liability tests safety” - relevant
  • D3 = “car passengers injury reviews” - non-relevant
  • R = ?
  • S = ?
  • Q’ = ?
automatic query expansion
Automatic query expansion
  • Thesaurus-based expansion
  • Distributional similarity-based expansion
wordnet and distsim
WordNet and DistSim

wn reason -hypen - hypernyms

wn reason -synsn - synsets

wn reason -simsn - synonyms

wn reason -over - overview of senses

wn reason -famln - familiarity/polysemy

wn reason -grepn - compound nouns

/clair3/tools/relatedwords/relate reason

related substitutable words
Related (substitutable) words

Wordnet

Book: publication, product, fact, dramatic composition, record

Computer: machine, expert, calculator, reckoner, figurer

Fruit: reproductive structure, consequence, product, bear

Politician: leader, schemer

Newspaper: press, publisher, product, paper, newsprint

Distributional clustering:

Book: autobiography, essay, biography, memoirs, novels

Computer:adobe, computing, computers, developed, hardware

Fruit: leafy, canned, fruits, flowers, grapes

Politician: activist, campaigner, politicians, intellectuals, journalist

Newspaper: daily, globe, newspapers, newsday, paper

computing term salience
Computing term salience
  • Term frequency (IDF)
  • Document frequency (DF)
  • Inverse document frequency (IDF)
scripts to compute tf and idf
Scripts to compute tf and idf

cd /clair4/class/ir-w03/hw2

./tf.pl 053.txt | sort -nr +1 | more

./tfs.pl 053.txt | sort -nr +1 | more

./stem.pl reasonableness

./build-idf.pl

./idf.pl | sort -n +2 | more

applications of tfidf
Applications of TFIDF
  • Cosine similarity
  • Indexing
  • Clustering
ad