1 / 18

Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)

Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.). (c) Wolfgang Hürst, Albert-Ludwigs-University. Organizational Remarks. Exercises:

quincy
Download Presentation

Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search - Summer Term 2006II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

  2. Organizational Remarks Exercises: Please, register for the exercises by sending me (huerst@informatik.uni-freiburg.de) an email containing- Your name,- Matrikelnummer,- Studiengang (BA, MSc, Diploma, ...)- Plans for exam (yes, no, undecided) This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course. Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).

  3. DOCS. RESULTS RESULT REPRESENTATION RANKING SEARCHING Recap: IR System & Tasks Involved INFORMATION NEED User Interface DOCUMENTS QUERY SELECT DATA FOR INDEXING QUERY PROCESSING (PARSING & TERM PROCESSING) PARSING & TERM PROCESSING INDEX LOGICAL VIEW OF THE INFORM. NEED PERFORMANCE EVALUATION

  4. Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for practice, e.g. because they increase our understanding, allow more fact-based statements, etc. General advantages of theoretical models:Behavior can be clearly understood and reconstructed, characteristics can be proven, etc.Plug-and-play, i.e. easily build on previous work, strong theoretical background and framework, etc.

  5. Models for IR - Taxonomy Fuzzy set model Extended Boolean model Generalized vector model Latent semantic indexing Neural networks Inference networks Belief network Classic models: Boolean model(based on set theory) Vector space model (based on algebra) Probabilistic models (based on probability theory) Further models: Structured Models Models for Browsing Filtering SOURCE: R. BAEZA-YATES [1], PAGE 20+21

  6. Formal Specification of the Task Definition: An information retrieval model is a quadrupel [D, Q, F, R(qi, dj)] where D is a set composed of logical views (or represen-tations) for the documents in the collection Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries. F is a framework for modeling document representations, queries, and their relationships. R(qi, dj) is a ranking function which associates a real number with a query qi in Q and a document representation djin D. Such ranking defines an ordering among the documents with regard to the query qi. SOURCE: R. BAEZA-YATES [1], PAGE 23

  7. Generally, we represent the query and documents through a set of terms T = {t1, ..., tk} where k is the number of all unique index terms in the system. We assume wi,j to be a weight for term ti in document dj with wi,j = 0 if ti is not in dj. Document dj can be represented as an index term vector dj = (w1,j, w2,j, ..., wk,j). gi represents a function for which gi(dj) = wi,j(i.e. given a document dj, gi delivers the weight of term ti in dj). Formal Specific. of the Task (Cont.) CF. R. BAEZA-YATES [1], PAGE 25

  8. Classic Retrieval Models 1. Boolean Model (set theoretic)

  9. Documents: Index term vector dj = (w1,j, ..., wk,j) with wi,j{0,1} Boolean Retrieval Model - Queries Based on set theory and Boolean algebra Queries: Terms combined with AND, OR, NOT Boolean expression in disjunctive normal form (DNF) Example: CF. R. BAEZA-YATES [1], CH. 2.5.2

  10. A query q is defined as a Boolean expression qdnf in DNF with qcc being the conjunctive elements from qdnf. wi,j = 0 or 1 are the index term weight variables. We define the similarity sim of a document dj with query q as Boolean Retr. Model - Definition (A document is considered relevant if sim = 1 and irrelevant otherwise)

  11. Boolean Retrieval Model Advantages:Precise, clean formalismOffers great control and transparency,Simplicity, easy math, easy implementation Good for domains with ranking by other means than relevance, i.e. chronological Disadvantages:Query might be hard to specifyBinary decision (relevant or not)Often too many or too few results

  12. Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic)

  13. Vector Model - Definition Based on vector algebra Main advantage (compared to Boolean models):Considers non-binary weights and calculates similarity measure between query and document Formal Definition: wi,q is defined as the weight associated with the pair (ti, q) and wi,q = 0 or > 0 k describes the number of all unique index terms With this, we can define Query vector q = (w1,q, w2,q, …, wk,q) Document vector dj = (w1,j, w2,j, …, wk,j)

  14. Vector Model - Definition (Cont.) The similarity between a query and a document can then be quantified by the correlation of the respective vectors, e.g. Using the inner product (arithmetical): Using the cosinus of the angle between the 2 vectors Weights: Often TF*IDF (or variants of it)

  15. Vector Model - Illustration Easy example w. 3 terms:

  16. Vector Model Advantages:Fast and easy,Finds similar documents (no binary decision),Ranking based on similarityOften better results than Boolean search (because of the term weighting) Disadvantages:Terms are assumed to be independent

  17. Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic)

  18. Classic Retrieval Models 1. Boolean Model (set theoretic) 2. Vector Model (algebraic) 3. Probabilistic Models (probabilistic)

More Related