INFO624 - Week 2 Models of Information Retrieval

INFO624 - Week 2Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University

Reviews of Last Week • Challenges of Information Retrieval • Translate user’s information needs to queries. • Match queries to stored information. • Evaluate if the query results match the user’s information needs • Differences between • Data, information, and knowledge • Data retrieval and information retrieval

Assignment 1 • Some of my favorite Search Software Packages • IBM’ Content Management (high-cost) • AOL PLS Search Engine (free) • GreenStone Digital Library Software (open-source) • SWISH (open source) • mnoGoSearch (free) • Apache Lucene (open source components)

Documents • Documents are logical units of text • Units of records (text & other components) • Units that can be stored, retrieved, and displayed as an unique entity • Units of semantic entity • units of text grouped together for a purpose • Units of unformatted text • Text as written by authors of documents.

Document Models • Documents need to be processed and represented in a concise and identifiable formats/structures. • Documents are full of text. • Not every words of the text are meaningful for searching/retrieval. • Documents themselves do not have identifiable attributes such as authors and titles.

Figure 1.2:Logical view of a document: from full text to a set of index terms.

Document Representation • Documents should be represented to help users identify and receive information from the system. • to identify authors and titles • to identify subjects • to provide summaries/abstracts • to classify subject categories

Document Surrogates • Each document should have one or more short and descriptive labels/attributes • Level 1: • Title: • Author: • Keywords: • Level 2: • Level 1 +Abstract: • Level 3: • Level 2 + full text

A Formal IR Models • An information retrieval model is a quadruple (D, Q, F, R(qi, dj)) where • D is a set composed of logical views (or representations) for the documents in the collection. • Q is a set composed of logical views (or representations) for the information needs. Such representations are called queries. • F is a framework for modeling document representations, queries, and their relationships • R(qi, dj) is a ranking function which associated a real number with a queryqi and a document representation dj. Scuh ranking defines an ordering among the documents with regard to the query qi.

Computerized Indexing • Title indexing: • Sort all the titles alphabetically • Not consider the beginning “a” or “the” • Convert all letters to uppercases. • Matching always starts from the beginning of the title (not individual words). • Most early IR systems (such as library catalogs) used title indexing

Word indexing • Parsing every individual words from documents • First decision: What is a word? • Are digits words? • How about the letter and digit combination: B6, B12 • Is F-16 one word or two words? • Hyphens • Online, on-line, on line ? • F-16 • Singular or plural ? • List all the words alphabetically with points back to documents – inverted indexing.

Inverted Indexing • Inverted indexing consists of an ordered list of indexing terms, each indexing term is associated with some document identification numbers. • Retrieval is done by first searching in the ordered list to find the indexing term, then using the document identification numbers to locate documents

Example: Create an inverted indexing for the following:

Boolean Logic • Logical operators defined on sets • True and false: • A set is a collection of items with certain common characteristics. • Any item either belongs to the set (true) or not belong to the set (false) • AND • combine two sets, A and B, to create a smaller (or at least not larger) set C. • any items in C must be in BOTH set A and set B. • OR • Union of two sets, A and B, to create a larger set C. • any item in C must be either in set A or in set B. • Not • to exclude items in a set.

Example: • Given: A={1, 3, 7, 12, 14, 25,36,} B={1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26} C={2,4,6,8,10,11,12,13,14} • Derive: • A AND B • A OR B • A AND B AND C • (A AND B) NOT C • (A AND B) OR C • (A OR B) AND C • A AND (B OR C)

Boolean Logic • Venn Diagram • graphical representation of Boolean logic • A and (B or C) • A and B or (C and D)

Boolean Query • Terms connected by Boolean operators • The system retrieves a set of documents based on the Boolean logic of the query. • Examples: • (network or networks or structured or system or systems) and (information or retrieval)

Advantages of Boolean Search • Simple and specific • Effective • AND reduces the number of hits very quickly • OR expands search scope • Strong logic-based • proved mathematical foundations

Problems of Boolean Search: • Boolean search is an exact search • either retrieving or not retrieving a document. • Requesting “computer” would not find “computing” unless more programming is done • No weighting can be done on terms • in query, A and B, you can’t specify A is more important than B.

No Ranking • Retrieved sets can not be ordered based on the Boolean logic. • Every retrieved document are treated equally. • Possible order confusion • A AND B OR C

Vectors • A numerical representation for a point in a multi-dimensional space. • (x1, x2, … … xn) • Dimensions of the space need to be defined • A measure of the space needs to be defined.

Vector Representation of Document Space • Each indexing term is a dimension • Each document is a vector • Di = (ti1, ti2, ti3, ti4, ... tin) • Dj = (tj1, tj2, dj3, tj4, ..., tjn) • Document similarity is defined as

Example: • A document Space is defined by three terms: • hardware, software, user • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved?

In Boolean query matching: • document A4, A7 will be retrieved by “ANDing” the two query terms • retrieved:A1, A2, A4, A5, A6, A7, A8, A9 if two query terms are “ORed” together. • In Vector query matching: • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with order)= • {A4, A7, A1, A2, A5, A6, A8, A9}

Weights in the Vector Space • A main advantage of Vector representation is that items in vectors don’t have to be just 0 or 1 (true or false). • A1=(0.7, 0.5, 0.3) • A2=(0.5, 0.2, 0.7) • A3=(0.3, 0.6, 0.9) • A4=(0.7, 0.9, 1.0) • Queries may also be weighted: • Q=(0.7, 0.3, 0)

TF and IDF • TF – term frequency • number of times a term occurs in a document • DF –Document frequency • Number of documents that contain the term. • IDF – inversed document frequency • =log(N/ni) • N –the total number of documents • ni – number of documents that contains term i.

Salton’s Vector Space • A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term is in the document • Wi= 0 if the term is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*(1+log(N/dfi))where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

In vector space, documents and queries are treated the same. • It is easier to do similarity search: • “find documents like this one” • It is easier to do document clusters: • “group documents into categories and subcategories” • It’s easier to display search results graphically • “Giving meaning to place or location in the multi-dimensional space”

Web Indexing • Most web indexing is Vector-based indexing, with variances: • robot indexing software keeps traverse the web to collect more pages and terms • Servers establish a huge inverted indexing and vector indexing database • Search engines conduct different types of vector query matching • only a few search engines implement truly Boolean query matching

The real differences among different search engines are • their indexing weight schemes • their query process methods • their ranking algorithms • None of these are published by any of the search engines firms.

Alternative IR Models • Probabilistic Model • Given a document d, how likely would the user consider it relevant? • How likely would the user consider it no relevant? • If these two are known, Similarity of document d and query q can be defined as: • S(d, q) = probability of d is relevant to q probability of d is not relevant to q

Examples: • If a document is 80% likely to be relevant to query q, what is its (probabilistic) similarity? • If a document is only 30% likely to be relevant, what is the similarity?

If there are 100 documents, 10 are relevant to a query, • what is the probability of relevance for a randomly select document? • What is the similarity of this document to the query? • Any retrieve systems must do must better than that. • In general, retrieval systems should retrieve those S>1

Advantages of the Probabilistic model • Documents can be ranked by its relevance probability. • Relevance probability can be improved through the interaction process. • Good mathematic model • Disadvantages: • Involved many assumptions • Not very practical

Fuzzy Set Model • Fuzzy Set Theory • Extension of Boolean set theory • Instead of a binary membership definition, fuzzy set Membership is continuously defined between 0 and 1. • Example: • { Male students in our class} • {tall students in our class} • One is Boolean set and one is fuzzy set.

The set of retrieved documents should be considered as a fuzzy set. • Documents are not just relevant or not-relevant. • Documents can be somehow relevant. • Documents can be 80% likely to be relevant. • Good Mathematical Models but not widely implemented and tested.

Latent Semantic Indexing Model • Map documents from a high-dimensional space to a lower dimensional space, while maintaining document relationships. • For clustering • For visualization • It’s a popular advanced retrieval technique. • It’s computationally expensive.

Neural Network Model • Organize the document collection as a semantic network through learning • Use known queries/relevant documents to to train the network, and later allow the network to predict relevance for new queries. (supervised learning) • Use document-document relationships to “self-organize” the network and move relevant documents close to each other. (un-supervised learning).

The Fusion Model • Retrieve documents based on text indexing (Boolean model or Vector Space Model, etc.) • Retrieve documents based on link models (Citations, Google’s PageLink, etc.)\ • Retrieve documents based on classification models (The classification schemes, thesauri, Yahoo categories, etc). • “Fusion” results together before response to the user

Models for Browsing • Flat Model • No particular organizations of materials • Hierarchical model • Assign documents into a hierarchical structure. • Hypertext Model • Define appropriate links among related documents.

INFO624 - Week 2 Models of Information Retrieval

INFO624 - Week 2 Models of Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Relevance Models In Information Retrieval

INFO624 -- Week 9 Effective Information Retrieval

Applications (1 of 2): Information Retrieval

CHAPTER 2 Information retrieval

Chapter 2 Information Retrieval

Lecture 2: Retrieval Models

Retrieval models { week 13}

Information Retrieval – Language models for IR

Advanced Information- Retrieval Models

Information Retrieval Models

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Information Retrieval (2)

Information Retrieval Models

Information Retrieval: Models and Methods

Discriminative Models for Information Retrieval

Chapter 2 Information Retrieval

Information Retrieval Part 2