1 / 36

Digital Days

This text is an introduction to models and techniques used in information retrieval, covering topics such as logical, vector processing, and probabilistic models. It also explores the drawbacks of certain models and offers insights into the development of new fuzzy set models.

joyl
Download Presentation

Digital Days

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Digital Days Information Retrieval Bassiou Nikoletta Artificial Intelligence and Information Analysis Lab Aristotle University of Thessaloniki Informatics Department

  2. Information Retrieval • Introduction • Models / Techniques • Evaluation of Results • Clustering • References Aristotle University of Thessaloniki Informatics Department

  3. Information Retrieval • Introduction • Research in developing algorithms and models for retrieving information from document repositories (document / text retrieval) • Main activities: • Indexing: representation of documents • Searching: way documents are examined to be characterized as relevant Aristotle University of Thessaloniki Informatics Department

  4. Information Retrieval • Models / Techniques • Logical • Vector Processing • Probabilistic • Cognitive Aristotle University of Thessaloniki Informatics Department

  5. Information Retrieval • Models / Techniques (cont.) • Logical (Boolean) Model • Documents: represented by index terms or keywords • Requests: logical combinations (AND, OR, NOT) of these terms • Document retrieved when it satisfies the logical expression of the request • Example: D1={A, B}, D2={B, C}, D3={A, B, C} Q=A^B^ ~C A={D1} Aristotle University of Thessaloniki Informatics Department

  6. Information Retrieval • Models / Techniques (cont.) • Logical (Boolean) Model (cont.) • Drawbacks: • Formulation of the query is difficult / trained intermediaries have to search on behalf of the user • Results: partition of the database into two discrete subjects  no mechanism of ranking according to decreasing probability of relevance • All query terms are considered to be equal: they are either present or not • Closed word Assumption: absence of an index term in a document  false index for that document • Development of fuzzy set models Aristotle University of Thessaloniki Informatics Department

  7. Information Retrieval • Models / Techniques (cont.) • Vector Processing Model • Documents/queries are represented in a high-dimensional space • Each dimension corresponds to a word in the document collection • Most relevant documents for a query:documents represented by the vectors closest to the query Aristotle University of Thessaloniki Informatics Department

  8. Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Document: t-dimensional vector Di= (di1, di2, …, dit) dij: weights of the j-th term dij = 0 when j-th term is absent form document Di • Indexing of documents : number of term occurrences in documents, number of documents each word are present, or other • Query : Qj= (qj1, qj2, …, qjt) Aristotle University of Thessaloniki Informatics Department

  9. Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Similarity Computation : • Inner Product: • Cosine: *** When applied to normalized vectors  same ranking of similarities as Euclidean Distance Aristotle University of Thessaloniki Informatics Department

  10. Information Retrieval • Models / Techniques (cont.) • Vector Processing Model (cont.) • Drawbacks • Use of indexing terms to define the dimension of the space involves the incorrect assumption that the terms are orthogonal • Practical limitations : for discriminating ranking several query terms are needed while in Boolean models two or three ANDed terms are enough • Difficulty of explicitly specifying synonymic and phrasal relationships Aristotle University of Thessaloniki Informatics Department

  11. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model • Probability Ranking Principle: Ranking of documents in order of decreasing probability of relevance to a user’s information need • Term-Weight Specification: selectivity/what makes a good term good  whether it can pick any of the few relevant documents from the many non-relevant ones Aristotle University of Thessaloniki Informatics Department

  12. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Collection Frequency: terms in few documents  more valuable n : number of document term t (i) occurs in N : the number of documents in the collection • Term Frequency: terms occurring more often in a document are more likely to be important for that document Aristotle University of Thessaloniki Informatics Department

  13. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Term Frequency • Document Length Aristotle University of Thessaloniki Informatics Department

  14. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Normalized Document Length: serves for the evaluation of Term Frequency • Combined Weight: combination of the above weight measures  used for score calculation k1(=2) : affects the extent of influence of Term Frequency b(=0.75) : affects the extent of Document Length’s influence Aristotle University of Thessaloniki Informatics Department

  15. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Term-Weighting Components Aristotle University of Thessaloniki Informatics Department

  16. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Typical term-weighting formulas Aristotle University of Thessaloniki Informatics Department

  17. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Iterative searching: Terms Reweigthing / Query Expansion • Relevance weighting: relation between the relevant and non-relevant documents for a search term r=the number of known relevant documents term t(i) appears R=the number of known relevant documents for a request Aristotle University of Thessaloniki Informatics Department

  18. Information Retrieval • Models / Techniques (cont.) • Probabilistic Model (cont.) • Iterative Combination: • Query Expansion: adding to a query new search terms taken by documents assessed as relevant Aristotle University of Thessaloniki Informatics Department

  19. Information Retrieval • Models / Techniques (cont.) • Cognitive Model • Focus on: • user’s information-seeking behaviour • ways in which IR systems are used in operational environments • Experiments on • the way in which a user’s information needs may change during his interaction with the IR system  more flexible interfaces Aristotle University of Thessaloniki Informatics Department

  20. Information Retrieval • Evaluation of Results • Precision: proportion of retrieved documents that are relevant • Recall: proportion of relevant documents that are retrieved • Fallout: proportion of non-relevant documents that are not retrieved • Generality: proportion of relevant documents within the entire collection Aristotle University of Thessaloniki Informatics Department

  21. Information Retrieval • Evaluation of Results (cont.) • Example: Aristotle University of Thessaloniki Informatics Department

  22. Information Retrieval • Evaluation of Results (cont.) • Precision-Recall graph Aristotle University of Thessaloniki Informatics Department

  23. Information Retrieval • Evaluation of Results (cont.) • Three-point average precision: averaging the precision in three different recall levels • Eleven-point average precision: averaging the precision in eleven different recall levels Aristotle University of Thessaloniki Informatics Department

  24. Information Retrieval • Clustering • Non-Exclusive (Overlapping): ex. Fuzzy Clustering  degree of belongingness • Exclusive • Extrinsic (Supervised) • Intrinsic (Unsupervised)*: Agglomerative-Devisive • Hierarchical: nested sequence of partitions • Partitional: single partition Aristotle University of Thessaloniki Informatics Department

  25. Information Retrieval • Clustering (cont.) • Hierarchical: transformation of the proximity matrix (similarity/dissimilarity indices) into a sequence of nested partitions • Threshold graph G(v) for each dissimilarity level v: inserting an edge (i, j) between nodes i and j if objects i, j are less dissimilar than v((i ,j)  G(v)if and only ifd(i, j)  v) Aristotle University of Thessaloniki Informatics Department

  26. Information Retrieval • Clustering (cont.) • Single-Link Clustering Algorithm • Every object is placed in a unique cluster (G(0)). Set k1. • G(k) formation: if the number of components (maximally connected subgraphs) in G(k) is less than the number of clusters in the current clustering , redefine the current clustering by naming each component of G(k) as a cluster. • If G(k) consists of a single connected graph, stop. Else, set kk+1 and go to previous step. Aristotle University of Thessaloniki Informatics Department

  27. Information Retrieval • Clustering (cont.) • Complete-Link Clustering Algorithm • Every object is placed in a unique cluster (G(0)). Set k1. • G(k) formation: if the two of the current clusters form a clique (maximally completed subgraph) in G(k), redefine the current clustering by merging these two clusters into a single cluster. • If k=n(n-1)/2, so that G(k) is the complete graph on the n nodes, stop. Else, set kk+1 and go to previous step. Aristotle University of Thessaloniki Informatics Department

  28. Information Retrieval • Clustering (cont.) • Example Aristotle University of Thessaloniki Informatics Department

  29. Information Retrieval • Clustering (cont.) • Other Algorithms: • Hubert’s Algorithm for Single-Link and Complete Link • Graph Theory Algorithm for Single-Link Aristotle University of Thessaloniki Informatics Department

  30. Information Retrieval • Clustering (cont.) • Matrix Updating Algorithms for Single-Link and Complete-Link • Begin with disjoint clustering having level L(0)=0 and sequence number m=0. • Find the least dissimilar pair in the current clustering, {(r), (s)}, according to d[(r), (s)]=min {d[(i), (j)]} • Set mm+1. Merge clusters (r) and (s). Set the level to L(m)=d[(r) , (s)] • Update the proximity matrix by deleting the rows and columns corresponding to clusters (r) and (s) by adding a row and column for the newly formed cluster. Aristotle University of Thessaloniki Informatics Department

  31. Information Retrieval • Clustering (cont.) • Matrix Updating Algorithms for Single-Link and Complete-Link (cont.) • The proximity between the new cluster (r, s) and old cluster (k) is defined as follows d[(k), (r, s)]=min {d[(k), (r)], d[(k), (s)]} (single-link) d[(k), (r, s)]=max {d[(k), (r)], d[(k), (s)]} (complete-link) • !!! Generalized Formula d[(k), (r, s)]= ar d[(k), (r)] + as d[(k), (s)] + βd[(r), (s)] + γ|d[(k), (r)]-d[(k), (s)]| Aristotle University of Thessaloniki Informatics Department

  32. Information Retrieval • Clustering (cont.) • Coefficient Values for Matrix Updating Algorithms Aristotle University of Thessaloniki Informatics Department

  33. Information Retrieval • Clustering (cont.) • Main characteristics of the methods • Single-link: closest pair of objects  clusters with little homogeneity • Complete-link (more conservative):most distant pair  clusters not well separated • UPGMA: weights equally the contribution of each object taking into account the sizes of the clusters • WPGMA: weights objects in small clusters more heavily than patterns in large clusters • UPGMC-WPGMC: • proximity measure is Euclidean distance • geometric interpretation  distance between centroids Aristotle University of Thessaloniki Informatics Department

  34. Information Retrieval • Clustering (cont.) • Example Aristotle University of Thessaloniki Informatics Department

  35. Information Retrieval • References • Manning C.D. and Schutze H., Foundations of Statistical Natural Language Processing, MIT Press, 1999. • Jones K.S. and Willett P., Readings in Information Re-trieval,Morgan Kaufman Publishers, San Francisco, California, 1997. • Salton G., Wong A., and Yang C.S., “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, pp. 613–620, 1975. • G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” 1988. Aristotle University of Thessaloniki Informatics Department

  36. Information Retrieval • References (cont.) • Salton G., “The smart environment for retrieval system evaluation-advantages and problem areas,” In K. Sparck Jones (Ed.), Information Retrieval Experiment, pp. 316–329, 1981. • Robertson S.E., “The probability ranking principle in ir,” Journal of Documentation, vol. 33, pp. 126–148, 1977. • Robertson S.E. and Jones S.K., “Simple, proven approaches to text retrieval”, TR 356, Cambridge University, Computer Laboratory, May 1977. • A.K. Jain and R.C. Dubes, “Algorithms for Clustering Data”, Prentice-Hall, 1988. Aristotle University of Thessaloniki Informatics Department

More Related