1 / 36

CS533 Information Retrieval

This lecture discusses the advantages and disadvantages of Boolean and ranked information retrieval systems. It examines the behavior, complexity, and output order of Boolean systems, as well as the understandability and usefulness of ranked systems. Various ranking models, such as vector space, fuzzy Boolean, probabilistic, knowledge-based, latent semantic indexing, inference networks, neural networks, and genetic algorithms, are explored. The concept of relevance, indexing effectiveness, stemming, n-grams, the vector space model, probabilistic information retrieval, fuzzy Boolean models, latent semantic indexing, knowledge-based IR, inference networks, and evaluation methods are also covered. The lecture concludes with discussions on building inverted files, alternative data structures, search engines, and metasearch engines.

spraguew
Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #26 May 11, 2000

  2. AI and IR • Started at about the same time • Feigenbaum and Feldamn - “Computers and thought” McGraw Hill 1963. • Minsky - “Semantic Information Processing” MIT Press, 1968. • Salton “Automatic Information Organization and Retrieval” McGraw Hill, 1968.

  3. Advantages of Boolean Systems • Easy to understand behavior • Enables formulating complex very specific queries

  4. Disadvantages of Boolean Systems • Difficult to formulate complex Boolean query • Output order is not by relevance

  5. Disadvantages of Boolean Systems • All or nothing systems • When users specify (A and B and C and D) should an item with A, B, and C but not D be rejected? • Are all query terms equally important? • Difficult to control size of output. • Too much or too little

  6. The concept of rank • Retrieved documents ordered by decreasing "goodness" (increasing rank) • Rank often computed using a similarityfunction that compares a document and a query

  7. Advantages of ranked systems • In successful IR systems a high percentage of the top document are useful to users

  8. Disadvantages of ranked systems • Behavior of system harder to understand

  9. Ranking IR system - models • Vector space • Fuzzy Boolean • Probabilistic

  10. Ranking IR system - models • Knowledge based • Latent semantic indexing • Inference nets • Neural network and genetic algorithms *

  11. The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgements • Same users may judge differently at different times • Degree of relevance of different documents will vary

  12. The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not

  13. Indexing Effectiveness • Indexing exhaustively and • Term specificity

  14. Stop lists • A stop list is a list of terms which are not included in an index • Traditionally most frequently occurring English words. • “computer, machine, program, source, language” in a computer science collection • Some loss of content “to be or not to be”

  15. Stemming is used to: • Enhance query formulation (and improve recall) by providing term variants • Reduce size of index files by combining term variants into single index term

  16. n-grams • Fixed length consecutive series of “n” characters • Bigrams: • Sea colony -> (se ea co ol lo on ny) • Trigrams • Sea colony -> (sea col olo lon ony), or -> (#se sea ea# #co col olo lon ony ny#)

  17. Usage of n-grams • Used in world war II by cryptographers • Spell checking • Text compression • Signature files • Stemming

  18. The Vector Space Model • Queries and documents are represented by vectors • Assumes document terms and query terms are independent • Term weight • Variants and meaning of tf and idf • Different normalization schemes

  19. Probabilistic information retrieval • Binary independence model • Non-binary independence models

  20. Binary independence model

  21. Fuzzy Boolean Models • Limitations of the Boolean model • Fuzzy models • basic • MMM • Paice • p-norm

  22. Designed to overcome: Language variability problem where a user expresses a concept with different words than those used in a document The multiple meaning of words Uses SVD or two-mode factor analysis Latent semantic indexing

  23. Knowledge Based IR • Knowledge based information retrieval attempts to identify the occurrence of high level concepts in • Concepts and their relationship represent the knowledge needed for retrieval • Evidential reasoning provide the link between a document and its concepts

  24. Inference Networks for IR • Turtle and Croft introduced the inference network model for information retrieval • This is a probability-based method • Ranks documents by probability of satisfying a user's information need.

  25. Evaluation • Fallout • Recall and precision • 11 point recall/precision • Average precision

  26. Building inverted files • Memory based • Sort based • Text partitioning • Lexical partitioning (FASTINV)

  27. Signature file • Alternative to inverted index • A compressed representation of documents • Uses n-grams and hashing • Enable searching for prefix and part of words • No ranks • Techniques to increase efficiency

  28. An alternative data structure to using inverted files • Patricia trees (also called suffix trees) • PAT arrays (also called suffix arrays)

  29. Search engines • Robots and indexing • Using hypertext links to improve retrieval • PageRank - importance of documents • Hubs and Authorities • Webor

  30. Metasearch Engine Two observations about search engines: • Web pages a user needs are frequently stored in multiple search engines. • The coverage of each search engine is limited. • Combining multiple search engines may increase the coverage. A metasearch engine is a good mechanism for solving these problems.

  31. Metasearch Engines • Data selection problem • Query formulation problem • Result merging problem

  32. Clustering • Some clustering algorithms • Document clustering • Term clustering • Cluster based retrieval

  33. Phrases and Thesaurus • Usages • Phrase generation and recognition • Techniques for automatic building of corpus based thesaurus

  34. Relevance feedback • The main idea • Issues • Query modification examples

  35. Extracts/intelligent abstracts • IR Extracts are lists of fragments of text • IE extracts - extracts words/phrases to generate an abstract • Intelligent abstracts re-phrase content coherently (no redundant text, may use generalizations, etc.)

  36. Themes, and text traversals • Text traversals provide a reader with a path of text excerpts • User can specify how large text traversal should be • The traversal can also be in response to a query

More Related