Understanding Boolean and Vector Models in Information Retrieval

LIS618 lecture 6Vector Model and ProQuest Thomas Krichel 2011-11-01

advantages of Boolean model • supposedly easy to grasp by the user • precise semantics of queries • implemented in the majority of commercial systems

problems of Boolean model • sharp distinction between relevant and irrelevant documents • no ranking possible • users find it difficult to formulate Boolean queries • users find it difficult to resolve Boolean queries

vector model • associates weights with each index term appearing in the query and in each database document. • relevance can be calculated as the cosine between the two vectors, i.e. their cross product divided be the square roots of the squares of each vector. This measure varies between 0 and 1.

tf/idf • stands for term frequency / inverse document frequency • This refers to a technique that gives term a high rank in a document if • the term appears frequently in a document • the term does not appear frequently in other documents • We will look at each component one at time.

absolute & maximum term frequency • Let F_t_d be the number of times term t appears in the document d. This is its absolute term frequency in the document. • Let m_d be the maximum absolute term frequency achieved by any term in document d. Examples • Document 1: a b a a b c c d m_1 = 3, because "a" appears 3 times • Document 2: a b a f f f e d f a a m_2 = 4, because "a" or "f" appears 4 times

relative document term frequency • The relative term frequency f_t_d, is given by f_t_d = F_t_d / m_d that is the absolute term frequency of term t in document d divided by the maximum absolute term frequency of document d. • This completes the "term frequency" part of the tf/idf formula. • Let us look at this part through an example.

main example, part I • Consider three documents • 1: a b c a f o n l p o f t y x • 2: a m o e e e n n n a n p l • 3: r a e e f n l i f f f f x l • First, look at the maximum frequency achieved by any term in a given document. m_1 = 2 ("a", "f" and "o" are there twice) m_2 = 4 ("n" is there four times) m_3 = 5("f" is there five times)

main example part II • Now look at some example of absolute term frequency F_a_1 = 2 F_e_2 = 3 F_x_3 = 1 • and some examples of relative term frequency f_a_1 = F_a_1 / m_1 = 2 / 2 = 1 f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75 f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2

inverse document frequency • Let N be the number of documents in the datebase. N=3 in our example. • Let n_t be the number of documents where the term t appears. In our example n_a = 3 n_e = 2 n_x = 2 • N/n_t is an indication of inverse document frequency of a term. It is larger the less a term appears across documents in the database.

intermezzo: the logarithm • The logarithm, written log() is a mathematical function. You should know that • log() is an increasing function, i.e. the bigger is x, the bigger is log(x). • log(1) = 0 • log(x) > 0 if x > 1 • Your calculator will tell you what the logarithm of a number is.

tf/idf formula • Term frequency and inverse document frequency have to be combined. • The final formula for the weight combines the terms as follows w_t_d = f_t_d * log( N / n_t )

main example part III N = 3 w_a_1 = 1 * log(3/3) = log(1) = 0 ! w_e_2 = 0.75 * log(3/2) w_x_3 = 0.2 * log(3/2) where log(3/2) = 0.176, approximately

practical operation • The computer will search the documents for the query term and return the documents where the weight of term in the index for that document is strictly positive, by order of weights, highest to lowest. • If there are several query terms the computer will perform a more complicated operation that we will not further study here, so we limit ourselves to the case of one query term.

practical tests • You ask the computer to query the term "a" in our example. What documents are being returned? • Compare with the result of the Boolean model. • You ask the computer to query the term "e". What documents are being returned, and in what order?

advantages of vector model • term weighting improves performance • sorting is possible • easy to compute, therefore fast • results are difficult to improve without • query expansion • user feedback circle

ProQuest search targets • ProQuest searches “citations” and “documents”. • “citations” are description of documents such as author names, titles, journal etc. • “documents” contain the full-text of documents. • Target differences imply different behavior of an expression when matched against a candidate.

ProQuest search • If you enter two search terms, they will be used as one phrase. • If you use three term, they are searched to be appearing in proximity. • You can force phrase interpretation by placing the search expressions into double quotes.

terms • A search term is something you type and that has a meaning on its own. • For example: house, or krichel. • Terms have a regular expression interpretation.

regular expressions • ‘*’ is used as a right-handed truncation character only; it will find all forms of a word.For example, searching for “econom*”. • ‘?’ is used to replace any single character, either inside the word or the right end of the word. For example, searching for “wom?n” • ‘?’ cannot be used to begin a word.

operators: and • AND Find the words. • When searching for keywords in "Citation and Document Text," AND finds documents in which the words occur in the same paragraph (within approx. 1000 characters) or the words appear in any citation field.

operator: and not, or • “and not” is the same as “not” in Dialog. • “or” is a normal Boolean or.

proximity operators • W/numberFind documents where these words are within some number number of words apart (either before or after). Use when searching for keywords within "Citation and Document Text" or "Document Text."Example: computer W/3 careers • NOT W/number does the opposite.

proximity operators • W/PARA Finds documents where these words are within the same paragraph (within approx. 1000 characters). Use when searching for keywords within "Document Text."Example: internet W/PARA web

proximity operators • W/DOC Find documents where all the words appear within the document text. Use W/DOC in place of AND when searching for keywords within "Citation and Document Text" or "Document Text" to retrieve more comprehensive results.Example: Internet W/DOC education

proximity operators • PRE/number Find documents where the first word appears some number numberof words before the second word. • Use when searching for keywords within "Citation and Document Text" or "Document Text."Example: world pre/3 web

field syntax • It is possible to limit a search for a term to a field. • This is done by writing field( term)

abstract • ABS() search article abstracts for your terms. • Examples: ABS(customer delight) ABS(ozone)

appendix • APX() searches the appendix of a document. The appendix usually comes at the end of the document, identified by a header • Use Keywords to search this field. • Example: APX(Michigan)

author • AU() is used to find articles written by an author or reviewer. • Example AU(Thomas Krichel)

Classification code (ABI) • Use Classification Codes when searching business topics. Classification Codes are a fast way to precisely target a search by topic, industry or market, geographical area, or article type. • Examples: CC(1120) for Economic Policy & Planning • This only applies to a subset of data from ABI inform, which has these codes.

Coden • This is use to search the coden index. A coden is an alphanumeric code used for shelving/ordering books and journals in libraries, often based on a publication’s title. • Example: CODEN(EDUSBI)

Column / Document Column Head • The title of a column in a periodical or newspaper, such as “The Week in Review”. This search field finds all articles where the search words are in the column head. • Examples: COL(futures) COL("The Week In Review")

company / organization • CO() searches for an organization featured prominently in an article, • Associations and cooperatives • Companies and their divisions • Governmental organizations and olitical parties • sports teams, music bands and churches • native american tribes • Comes with LCO({}) option for full matches.

publication date • PDN() searches the publication date in numeric format (mm/dd/yyyy). • You can use the < and > signs to indicate dates before and after a date, or between specific dates. • For example, PDN(>1/1/2002) AND PDN(<1/5/2002) will find results from publications with numeric dates between January 1 2002 and January 5 2002.

dateline • DLN() searches article Datelines. The dateline occurs frequently in newspapers, just after the article title, giving the date and place of the articles origin. You can use Boolean, proximity and truncation operators. • DLN(lebanon pre/1 ohio)

document features • SF() is used to search document features, such as an index or auxiliary materials, that may be included in or accompany a document. • The document features indexed are: • Graphs and Illustrations • Maps • References • Tables

search by proquest handle • ID() Searches the unique database ID for articles and documents in ProQuest. • Examples: ID(356894)

document language • LA() is used to search Language index. This field contains the language in which the document was published originally. • Examples: LA(french) LN(french or english)

document text • Searches only the full text of articles for your search terms. Article abstracts are not included in this search. AND, OR, and other search operators are treated as such unless enclosed in quotes. • Examples: TEXT(Kofi Annan) TEXT("North Sea oil")

title searches • TI() searches the title of a document, such as “Seigniorage, Taxation and Myopia in EMU”

document type • DT() is used to look for search words or phrases in documents of a certain type. • Examples DT(commentary) DT(editorial cartoon) DT(review) DT(arts/exhibits review) DT(television review-no opinion)

company number • DUNS() searches Dunn and Bradstreet trading partner identification number. These numbers provide a universal system for computer identification of companies. • Examples: DUNS(00 695 7856) DUN(03 575 3920)

footnote • FOOT() searches the article footnotes for your terms. • Examples: FOOT(326 U.S. 465)

volume • Volume() searches the volume. • Examples: VO(100)

word count • WC() restricts the number of words in the article text. Use this search field to locate articles under (<) or over (>) a certain length. • Examples: • WC(<1000) • WC(>500) • WC(>750 AND <1000)

year • Year searches the publication year • Examples: YR(1986) YR(1986-1987) YR(>1998) YR(<1998)

location • GEO() is used this search field to look for articles in which a geographical area or location figures prominently in the text. • Examples: GEO(Midwest) GN(UK) GEO(New South Wales) GN(Black Forest) • Comes with LGEO({})

headnote • HEAD() looks for words that occur in the headnotes of an article. Headnotes are short introductions, explanations, or comments at the beginning of an article. They are different from abstracts in that they do not attempt to summarize the content of the article. • Examples: HEAD(escalator accidents) HDN(digital tv) HEAD(Global Economy)

caption texts • CAP() This search field looks for occurrences of search words in the caption text accompanying article illustrations, graphs, and photographs. • Examples: CAP(Chart)

Understanding Boolean and Vector Models in Information Retrieval

Understanding Boolean and Vector Models in Information Retrieval

Presentation Transcript

Lecture 6 Eigenvalue and Vector Space

6 Scoring, Term Weighting Vector Space Model

LIS6 18 lecture 8 Credo and Gale

Lecture 18: Advanced model building

LIS6 54 lecture repository interoperability

LIS6 18 lecture 0 Introduction to the course

Generalized Vector Model

LIS6 18 lecture 2 preparing and preprocessing

LIS6 54 lecture 5 repository interoperability

LIS6 18 lecture 4 before searching + introduction to dialog

LIS6 18 lecture 2 the Boolean model

LIS6 18 lecture 4 before searching + introduction to dialog

Lecture 6: Boolean to Vector

Lecture 6: Scoring, Term Weighting and the Vector Space Model

Finite Model Theory Lecture 18

LIS6 18 lecture 9 Web retrieval

Vector Space Model

Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Vector Space Model

Vector Data Model

Lecture 6 Eigenvalue and Vector Space