information retrieval 3
Download
Skip this Video
Download Presentation
Information Retrieval (3)

Loading in 2 Seconds...

play fullscreen
1 / 40

Information Retrieval (3) - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

Information Retrieval (3). Prof. Dragomir R. Radev radev@umich.edu. SI650 Winter 2010. … 5. Evaluation of IR systems Reference collections TREC …. Relevance. Difficult to change: fuzzy, inconsistent Methods: exhaustive, sampling, pooling, search-based. Contingency table.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Information Retrieval (3)' - noleta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
information retrieval 3

Information Retrieval(3)

Prof. Dragomir R. Radev

radev@umich.edu

slide2

SI650 Winter 2010

5. Evaluation of IR systems

Reference collections

TREC

relevance
Relevance
  • Difficult to change: fuzzy, inconsistent
  • Methods: exhaustive, sampling, pooling, search-based
contingency table
Contingency table

retrieved

not retrieved

relevant

w=tp

x=fn

n1 = w + x

not relevant

y=fp

z=tn

N

n2 = w + y

precision and recall
Precision and Recall

w

Recall:

w+x

w

Precision:

w+y

exercise
Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Tolkien”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.

slide9

Interpolated average precision (e.g., 11pt)

Interpolation – what is precision at recall=0.5?

issues
Issues
  • Why not use accuracy A=(w+z)/N?
  • Average precision
  • Average P at given “document cutoff values”
  • Report when P=R
  • F measure: F=(b2+1)PR/(b2P+R)
  • F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R
kappa
Kappa
  • N: number of items (index i)
  • n: number of categories (index j)
  • k: number of annotators
kappa cont d
Kappa (cont’d)
  • P(A) = 370/400 = 0.925
  • P (-) = (10+20+70+70)/800 = 0.2125
  • P (+) = (10+20+300+300)/800 = 0.7875
  • P (E) = 0.2125 * 0.2125 + 0.7875 * 0.7875 = 0.665
  • K = (0.925-0.665)/(1-0.665) = 0.776
  • Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good
sample trec query
Sample TREC query

<top>

<num> Number: 305

<title> Most Dangerous Vehicles

<desc> Description:

Which are the most crashworthy, and least crashworthy,

passenger vehicles?

<narr> Narrative:

A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle\'s crashworthiness than the number of crashes per 100,000 miles, for example.

</top>

LA031689-0177

FT922-1008

LA090190-0126

LA101190-0218

LA082690-0158

LA112590-0109

FT944-136

LA020590-0119

FT944-5300

LA052190-0048

LA051689-0139

FT944-9371

LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

slide15

<DOCNO> LA031689-0177 </DOCNO>

<DOCID> 31701 </DOCID>

<DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE>

<SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION>

<LENGTH><P>586 words </P></LENGTH>

<HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE>

<BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE>

<TEXT>

<P>The federal government\'s highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over

accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P>

<P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the

Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after

Consumer Reports magazine charged that the vehicle had basic design flaws. </P>

<P>Several Fatalities </P>

<P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,

particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation

conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P>

<P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle

roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving

the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the

accident report, NHTSA declined to investigate the Samurai. </P>

...

</TEXT>

<GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher

number of single-vehicle, first event roll-overs," a federal official

said. </P></GRAPHIC>

<SUBJECT>

<P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;

RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P>

</SUBJECT>

</DOC>

trec cont d
TREC (cont’d)
  • http://trec.nist.gov/tracks.html
  • http://trec.nist.gov/presentations/presentations.html
most used reference collections
Most used reference collections
  • Generic retrieval: OHSUMED, CRANFIELD, CACM
  • Text classification: Reuters, 20newsgroups
  • Question answering: TREC-QA
  • Web: DOTGOV, wt100g
  • Blogs: Buzzmetrics datasets
  • TREC ad hoc collections, 2-6 GB
  • TREC Web collections, 2-100GB
comparing two systems
Comparing two systems
  • Comparing A and B
  • One query?
  • Average performance?
  • Need: A to consistently outperform B

[this slide: courtesy James Allan]

the sign test
The sign test
  • Example 1:
    • A > B (12 times)
    • A = B (25 times)
    • A < B (3 times)
    • p < 0.035 (significant at the 5% level)
  • Example 2:
    • A > B (18 times)
    • A < B (9 times)
    • p < 0.122 (not significant at the 5% level)
    • http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html

[this slide: courtesy James Allan]

other tests
Other tests
  • Student t-test: takes into account the actual performances, not just which system is better
    • http://www.fon.hum.uva.nl/Service/Statistics/Student_t_Test.html
    • http://www.socialresearchmethods.net/kb/stat_t.php
  • Wilcoxon Matched-Pairs Signed-Ranks Test
    • http://www.fon.hum.uva.nl/Service/Statistics/Signed_Rank_Test.html
slide21

IR Winter 2010

6. Automated indexing/labeling

Compression

indexing methods
Indexing methods
  • Manual: e.g., Library of Congress subject headings, MeSH
  • Automatic: e.g., TF*IDF based
loc subject headings
LOC subject headings

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

http://www.loc.gov/catdir/cpso/lcco/lcco.html

medicine
Medicine

CLASS R - MEDICINE

Subclass R

R5-920 Medicine (General)

R5-130.5 General works

R131-687 History of medicine. Medical expeditions

R690-697 Medicine as a profession. Physicians

R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc.

R711-713.97 Directories

R722-722.32 Missionary medicine. Medical missionaries

R723-726 Medical philosophy. Medical ethics

R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying

R727-727.5 Medical personnel and the public. Physician and the public

R728-733 Practice of medicine. Medical practice economics

R735-854 Medical education. Medical schools. Research

R855-855.5 Medical technology

R856-857 Biomedical engineering. Electronics. Instrumentation

R858-859.7 Computer applications to medicine. Medical informatics

R864 Medical records

R895-920 Medical physics. Medical radiology. Nuclear medicine

automatic methods
Automatic methods
  • TF*IDF: pick terms with the highest TF*IDF scores
  • Centroid-based: pick terms that appear in the centroid with high scores
  • The maximal marginal relevance principle (MMR)
  • Related to summarization, snippet generation
compression
Compression
  • Methods
    • Fixed length codes
    • Huffman coding
    • Ziv-Lempel codes
fixed length codes
Fixed length codes
  • Binary representations
    • ASCII
    • Representational power (2k symbols where k is the number of bits)
variable length codes
Variable length codes
  • Alphabet:

A .-  N -.  0 -----

B -...  O ---  1 .----

C -.-.  P .--.  2 ..---

D -..  Q --.-  3 ...—

E .  R .-. 4 ....-

F ..-. S ... 5 .....

G --. T -  6 -....

H .... U ..-  7 --...

I ..  V ...-  8 ---..

J .---  W .--  9 ----.

K -.-  X -..-

L .-..  Y -.—

M --  Z --..

  • Demo:
    • http://www.scphillips.com/morse/
most frequent letters in english
Most frequent letters in English
  • Most frequent letters:
    • E T A O I N S H R D L U
  • Demo:
    • http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
  • Also: bigrams:
    • TH HE IN ER AN RE ND AT ON NT
huffman coding
Huffman coding
  • Developed by David Huffman (1952)
  • Average of 5 bits per character (37.5% compression)
  • Based on frequency distributions of symbols
  • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
slide32

0

1

0

1

1

0

g

0

1

0

1

0

1

i

j

f

c

0

1

0

1

b

d

a

0

1

e

h

exercise1
Exercise
  • Consider the bit string: 01101101111000100110001110100111000110101101011101
  • Use the Huffman code from the example to decode it.
  • Try inserting, deleting, and switching some bits at random locations and try decoding.
extensions
Extensions
  • Word-based
  • Domain/genre dependent models
ziv lempel coding
Ziv-Lempel coding
  • Two types - one is known as LZ77 (used in GZIP)
  • Code: set of triples <a,b,c>
  • a: how far back in the decoded text to look for the upcoming text segment
  • b: how many characters to copy
  • c: new character to add to complete segment
slide37
<0,0,p> p
  • <0,0,e> pe
  • <0,0,t> pet
  • <2,1,r> peter
  • <0,0,_> peter_
  • <6,1,i> peter_pi
  • <8,2,r> peter_piper
  • <6,3,c> peter_piper_pic
  • <0,0,k> peter_piper_pick
  • <7,1,d> peter_piper_picked
  • <7,1,a> peter_piper_picked_a
  • <9,2,e> peter_piper_picked_a_pe
  • <9,2,_> peter_piper_picked_a_peck_
  • <0,0,o> peter_piper_picked_a_peck_o
  • <0,0,f> peter_piper_picked_a_peck_of
  • <17,5,l> peter_piper_picked_a_peck_of_pickl
  • <12,1,d> peter_piper_picked_a_peck_of_pickled
  • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep
  • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper
  • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
links on text compression
Links on text compression
  • Data compression:
    • http://www.data-compression.info/
  • Calgary corpus:
    • http://links.uwaterloo.ca/calgary.corpus.html
  • Huffman coding:
    • http://www.compressconsult.com/huffman/
    • http://en.wikipedia.org/wiki/Huffman_coding
  • LZ
    • http://en.wikipedia.org/wiki/LZ77
100 alternative search engines
100 alternative search engines
  • http://rss.slashdot.org/~r/Slashdot/slashdot/~3/83468703/article.pl
readings
Readings
  • 2: MRS9
  • 3: MRS13, MRS14
  • 4: MRS15, MRS16
ad