1 / 25

# CS 430: Information Discovery - PowerPoint PPT Presentation

CS 430: Information Discovery. Lecture 9 Extending the Boolean Model. Course Administration. Query languages. How would you formulate the following? What legal actions have resulted from the destruction of Pan Am Flight 103 over Lockerbie, Scotland, on December 21, 1988?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' CS 430: Information Discovery' - leane

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Lecture 9

Extending the Boolean Model

How would you formulate the following?

What legal actions have resulted from the destruction of Pan Am Flight 103 over Lockerbie, Scotland, on December 21, 1988?

Documents describing any charges, claims, or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant.

1. Documents containing both "information" and "retrieval"

information and retrieval

2. Documents containing "information" or "retrieval" or both

information or retrieval

3. Documents containing "information" or "retrieval" but not both

(information or retrieval) and not (information and retrieval)

1. Documents containing phrase "information retrieval"

2. Documents containing "information" and "retrieval" within four words of each other

information near 4 retrieval

By convention, stop words and punctuation are ignored.

swan adj 41 matches "John Swan, 41 Main Street."

information adj retrieval matches " ... information on retrieval methods ..."

Prefix:

"comp?" matches any word beginning "comp"

Suffix:

"?tal" matches any word ending "tal"

Ranges:

"1920...1925" matches any number between 1920 and 1925

Regular expression:

A pattern built up by simple strings (which are matched as substrings) and operators

Union: If e1 and e2 are regular expressions, then (e1 | e2) matches whatever matches e1 or e2.

Concatenation: If e1 and e2 are regular expressions, the occurrences of (e1e2) are formed by the occurrences of e1 followed immediately by e2.

Repetition: If e is a regular expression, then e* matches a sequence of zero or more contiguous occurrences of e.

(wild card) matches "wildcard"

travel l* ed matches "traveled" or "travelled", but not "traveed"

192 (0 | 1 | 2 | 3 |4 |5)matches any string in the range "1920" to "1925"

Counter-intuitive results:

Query q = A and B and C and D and E

Document d has terms A, B, C and D, but not E

Intuitively, d is quite a good match for q, but it is rejected by the Boolean model.

Query q = A or B or C or D or E

Document d1 has terms A, B, C,D and E

Document d2 has term A, but not B, C,D or E

Intuitively, d1 is a much better match than d2, but the Boolean model ranks them as equal.

Boolean is all or nothing

• Boolean model has no way to rank documents.

• Boolean model allows for no uncertainty in assigning index terms to documents.

• The Boolean model has no provision for assigning weights to the importance of query terms.

d and q are either in the set A or not in A. There is no halfway!

q

d

A

Term weighting

• Give weights to terms in documents and/or queries.

• Combine standard Boolean retrieval with vector ranking of results

Fuzzy sets

• Relax the boundaries of the sets used in Boolean retrieval

SIRE (Syracuse Information Retrieval Experiment)

Term weights

• Add term weights to documents

Weights calculated by the standard method of

term frequency * inverse document frequency.

Ranking

• Calculate results set by standard Boolean methods

• Rank results by vector distances

SIRE (Syracuse Information Retrieval Experiment)

Relevance feedback is particularly important with Boolean

retrieval because it allow the results set to be expanded

• Results set is created by standard Boolean retrieval

• User selects one document from results set

• Other documents in collection are ranked by vector

distance from this document

q is more or less in A. There is a halfway!

q

d

A

• A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document.

• Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.)

• For a given query, calculate the similarity between the query and each document in the collection.

• This calculation is needed for every document that has a non-zero weight for any of the terms in the query.

Fuzzy set theory

dAis the degree of membership of an element to set A

intersection (and)

dAB = min(dA, dB)

union (or)

dAB = max(dA, dB)

Fuzzy set theory example

standard fuzzy

set theory set theory

dA1 1 0 0 0.5 0.5 0 0

dB 1 0 1 0 0.7 0 0.7 0

and dAB1 0 0 0 0.5 0 0 0

or dAB 1 1 1 0 0.7 0.5 0.7 0

Terms: A1, A2, . . . , An

DocumentD, with index-term weights: dA1, dA2, . . . , dAn

Qor = (A1or A2or . . . or An)

Query-document similarity:

S(Qor, D) = Cor1 * max(dA1, dA2,.. , dAn) + Cor2 * min(dA1, dA2,.. , dAn)

where Cor1 + Cor2 = 1

Terms: A1, A2, . . . , An

DocumentD, with index-term weights: dA1, dA2, . . . , dAn

Qand = (A1and A2and . . . and An)

Query-document similarity:

S(Qand, D) = Cand1 * min(dA1,.. , dAn) + Cand2 * max(dA1,.. , dAn)

where Cand1 + Cand2 = 1

Experimental values:

Cand1 in range [0.5, 0.8]

Cor1 > 0.2

Computational cost is low. Retrieval performance much improved.

Paice model is a relative of the MMM model.

The MMM model considers only the maximum and minimum document weights.

The Paice model takes into account all of the document weights.

Computational cost is higher than from MMM. Retrieval performance is improved.

See Frake, pages 396-397 for more details

Terms: A1, A2, . . . , An

DocumentD, with term weights: dA1, dA2, . . . , dAn

Query terms are given weights, a1, a2, . . . ,an, which indicate their relative importance.

Operators have coefficients that indicate their degree of strictness

Query-document similarity is calculated by considering each document and query as a point in n space.

See Frake, pages 397-398 for details

CISI CACM INSPEC

P-norm 79 106 210

Paice 77 104 206

MMM 68 109 195

Percentage improvement over standard Boolean model (average best precision)

Lee and Fox, 1988

E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15

Methods based on fuzzy set concepts