180 likes | 275 Views
Learn about theoretical models for Information Retrieval, their significance in research and practice, advantages, and classic and further models utilized in science and browsing. Understand formal specifications, Boolean and Vector models, and their advantages and disadvantages. Explore the relationship between queries and document representations. Find out about performance evaluation and different models like neural networks and belief networks in Information Retrieval.
 
                
                E N D
Web Search - Summer Term 2006II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University
Organizational Remarks Exercises: Please, register for the exercises by sending me (huerst@informatik.uni-freiburg.de) an email containing- Your name,- Matrikelnummer,- Studiengang (BA, MSc, Diploma, ...)- Plans for exam (yes, no, undecided) This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course. Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).
DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for practice, e.g. because they increase our understanding, allow more fact-based statements, etc. General advantages of theoretical models:Behavior can be clearly understood and reconstructed, characteristics can be proven, etc.Plug-and-play, i.e. easily build on previous work, strong theoretical background and framework, etc.
Models for IR - Taxonomy Fuzzy set model Extended Boolean model Generalized vector model Latent semantic indexing Neural networks Inference networks Belief network Classic models: Boolean model(based on set theory) Vector space model (based on algebra) Probabilistic models (based on probability theory) Further models: Structured Models Models for Browsing Filtering SOURCE: R. BAEZA-YATES [1], PAGE 20+21
Formal Specification of the Task Definition: An information retrieval model is a quadrupel [D, Q, F, R(qi, dj)] where D is a set composed of logical views (or represen-tations) for the documents in the collection Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries. F is a framework for modeling document representations, queries, and their relationships. R(qi, dj) is a ranking function which associates a real number with a query qi in Q and a document representation djin D. Such ranking defines an ordering among the documents with regard to the query qi. SOURCE: R. BAEZA-YATES [1], PAGE 23
Generally, we represent the query and documents through a set of terms T = {t1, ..., tk} where k is the number of all unique index terms in the system. We assume wi,j to be a weight for term ti in document dj with wi,j = 0 if ti is not in dj. Document dj can be represented as an index term vector dj = (w1,j, w2,j, ..., wk,j). gi represents a function for which gi(dj) = wi,j(i.e. given a document dj, gi delivers the weight of term ti in dj). Formal Specific. of the Task (Cont.) CF. R. BAEZA-YATES [1], PAGE 25
Classic Retrieval Models 1. Boolean Model (set theoretic)
Documents: Index term vector dj = (w1,j, ..., wk,j) with wi,j{0,1} Boolean Retrieval Model - Queries Based on set theory and Boolean algebra Queries: Terms combined with AND, OR, NOT Boolean expression in disjunctive normal form (DNF) Example: CF. R. BAEZA-YATES [1], CH. 2.5.2
A query q is defined as a Boolean expression qdnf in DNF with qcc being the conjunctive elements from qdnf. wi,j = 0 or 1 are the index term weight variables. We define the similarity sim of a document dj with query q as Boolean Retr. Model - Definition (A document is considered relevant if sim = 1 and irrelevant otherwise)
Boolean Retrieval Model Advantages:Precise, clean formalismOffers great control and transparency,Simplicity, easy math, easy implementation Good for domains with ranking by other means than relevance, i.e. chronological Disadvantages:Query might be hard to specifyBinary decision (relevant or not)Often too many or too few results
Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic)
Vector Model - Definition Based on vector algebra Main advantage (compared to Boolean models):Considers non-binary weights and calculates similarity measure between query and document Formal Definition: wi,q is defined as the weight associated with the pair (ti, q) and wi,q = 0 or > 0 k describes the number of all unique index terms With this, we can define Query vector q = (w1,q, w2,q, …, wk,q) Document vector dj = (w1,j, w2,j, …, wk,j)
Vector Model - Definition (Cont.) The similarity between a query and a document can then be quantified by the correlation of the respective vectors, e.g. Using the inner product (arithmetical): Using the cosinus of the angle between the 2 vectors Weights: Often TF*IDF (or variants of it)
Vector Model - Illustration Easy example w. 3 terms:
Vector Model Advantages:Fast and easy,Finds similar documents (no binary decision),Ranking based on similarityOften better results than Boolean search (because of the term weighting) Disadvantages:Terms are assumed to be independent
Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic)
Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic) 3. Probabilistic Models (probabilistic)