CS 430: Information Discovery

CS 430: Information Discovery Lecture 11 The Boolean Model

Course Administration Assignment 2 A revised version of the assignment has been posted today. Assignment 1 If you have questions about your grading, send me email. The following are reasonable requests: the wrong files were graded, points were added up wrongly, comments are unclear, etc. We are not prepared to argue over details of judgment. If you ask for a regrade, the final grade may be lower than the original!

Boolean Diagram not (A or B) A and B A B A or B

Query Languages: Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacusandactor abacusoractor (abacus and actor)or(abacus and atoll) not actor

Query Languages: Proximity Operators abacusadjactor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacusnear 4actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Query Languages: Conventions By convention, stop words and punctuation are ignored. swanadj41 matches "John Swan, 41 Main Street." informationadjretrieval matches " ... information on retrieval methods ..."

Query Languages: Pattern Matching Prefix: "comp?" matches any word beginning "comp" Suffix: "?tal" matches any word ending "tal" Ranges: "1920...1925" matches any number between 1920 and 1925

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

Thoughts on Evaluating a Boolean Expression General Approach Specify the grammar for valid expressions Write an interpreter defined by this grammar -> parse query to create an expression -> evaluate each document Simple approach for Assignment 2 For each document -> scan expression for highest priority operator and evaluate -> repeat until all operators have been evaluated

Use of Postings File for Query Matching • 1 abacus 3 94 19 7 19 212 22 56 • 2 actor • 66 19 213 29 45 3 aspen 5 43 • 4 atoll • 3 • 70 34 40

Query Matching: Vector Ranking Methods Query:abacus asp? 1. From the index file (word list), find the postings file for: "abacus" every word that begins "asp" Merge these posting lists. Calculate the similarity to the query for each document that occurs in any of the postings lists. Sort the similarities to obtain the results in ranked order. Steps 2 and 3 should be carried out in a single pass.

Query Matching: Boolean Methods Query: (abacusorasp?) andactor 1. From the index file (word list), find the postings file for: "abacus" every word that begins "asp" "actor" Merge these posting lists. For each document that occurs in any of the postings lists, evaluate the Boolean expression to see if it is true or false. Step 2 should be carried out in a single pass.

Query Languages: Regular Expressions Regular expression: A pattern built up by simple strings (which are matched as substrings) and operators Union: If e1 and e2 are regular expressions, then (e1 | e2) matches whatever matches e1 or e2. Concatenation: If e1 and e2 are regular expressions, the occurrences of (e1e2) are formed by the occurrences of e1 followed immediately by e2. Repetition: If e is a regular expression, then e* matches a sequence of zero or more contiguous occurrences of e.

Regular Expression Examples (wild card) matches "wildcard" travel l* ed matches "traveled" or "travelled", but not "traveed" 192 (0 | 1 | 2 | 3 |4 |5)matches any string in the range "1920" to "1925" Techniques for processing regular expressions are taught in CS 381 and CS 481.

Problems with the Boolean model Counter-intuitive results: Query q = A and B and C and D and E Document d has terms A, B, C and D, but not E Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = A or B or C or D or E Document d1 has terms A, B, C,D and E Document d2 has term A, but not B, C,D or E Intuitively, d1 is a much better match than d2, but the Boolean model ranks them as equal.

Problems with the Boolean model (continued) Boolean is all or nothing • Boolean model has no way to rank documents. • Boolean model allows for no uncertainty in assigning index terms to documents. • The Boolean model has no provision for adjusting the importance of query terms.

Boolean model as sets d is either in the set A or not in A. d A

Extending the Boolean model Term weighting • Give weights to terms in documents and/or queries. • Combine standard Boolean retrieval with vector ranking of results Fuzzy sets • Relax the boundaries of the sets used in Boolean retrieval

Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights • Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking • Calculate results set by standard Boolean methods • Rank results by vector distances

Relevance feedback in SIRE SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded • Results set is created by standard Boolean retrieval • User selects one document from results set • Other documents in collection are ranked by vector distance from this document

Boolean model as fuzzy sets d is more or less in A. d A

Basic concept • A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document. • Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.) • For a given query, calculate the similarity between the query and each document in the collection. • This calculation is needed for every document that has a non-zero weight for any of the terms in the query.

MMM: Mixed Min and Max model Fuzzy set theory dAis the degree of membership of an element to set A intersection (and) dAB = min(dA, dB) union (or) dAB = max(dA, dB)

MMM: Mixed Min and Max model Fuzzy set theory example standard fuzzy set theory set theory dA1 1 0 0 0.5 0.5 0 0 dB 1 0 1 0 0.7 0 0.7 0 and dAB1 0 0 0 0.5 0 0 0 or dAB 1 1 1 0 0.7 0.5 0.7 0

MMM: Mixed Min and Max model Terms: A1, A2, . . . , An DocumentD, with index-term weights: dA1, dA2, . . . , dAn Qor = (A1or A2or . . . or An) Query-document similarity: S(Qor, D) = Cor1 * max(dA1, dA2,.. , dAn) + Cor2 * min(dA1, dA2,.. , dAn) where Cor1 + Cor2 = 1

MMM: Mixed Min and Max model Terms: A1, A2, . . . , An DocumentD, with index-term weights: dA1, dA2, . . . , dAn Qand = (A1and A2and . . . and An) Query-document similarity: S(Qand, D) = Cand1 * min(dA1,.. , dAn) + Cand2 * max(dA1,.. , dAn) where Cand1 + Cand2 = 1

MMM: Mixed Min and Max model Experimental values: Cand1 in range [0.5, 0.8] Cor1 > 0.2 Computational cost is low. Retrieval performance much improved.

Other Models Paice model The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than MMM. P-norm model DocumentD, with term weights: dA1, dA2, . . . , dAn Query terms are given weights, a1, a2, . . . ,an Operators have coefficients that indicate degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space.

Test data CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 206 MMM 68 109 195 Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988

Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts

CS 430: Information Discovery