170 likes | 305 Views
This collection of notes from the course INF 722 on Information Organisation covers the FOA process, including query formulation, retrieval algorithms, and relevance feedback. Key topics include query languages, methods for constructing answers, information need translation, and techniques such as lemmatization and the removal of stop words. The notes also discuss facets of documents, indexing strategies, Zipf’s Law, and the balance between precision and recall in information retrieval. This comprehensive guide serves as a valuable resource for students and professionals in the field of information science.
E N D
Inf 722 Information Organisation Class notes: Information Retrieval Jagdish S. Gangolly Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • FOA Process • Asking a question (Query formulation) • Constructing an answer (retrieval algorithms) • Assessing the answer (feedback on relevance) Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Query language • Natural or artificial • Vocabulary • Syntax: operators, arguments • Query expansion, specialization, disambiguation, relevance feedback Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Constructing the answer • Information need accurately translated in the query? • How to provide answer in a form suitable to the user? • Provide background to the user so (s)he can verbalise the information need better? • How to represent the query as well as the corpus efficiently and effectively Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Constructing the answer (Contd) • Generate a set of index terms which render the documents in the collection as different as possible • Conflation algorithms • Removal of function/fluff/stop words (usually from closed class words) • Stripping suffixes (lemmatization) • Detection of equivalent/associated words Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Facets of documents: • Structure (dtd) • Format (css, xsl) • Content (xsd) • Unit of interest • Tagging of corpora • content tagging, grammatical tagging Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Step 1: Selection of corpora to build • Population from which documents to be included are selected (domain, genre,..) • Step 2: Selection of Tagging, if necessary • grammatical or other tagging schemes Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Step 3: Indexing • Index: doci {kwj} • Index-1: {kwj} doci • Extracting lexical features: • Step a: Selection of tokens, separators • Step b: Stemming decisions on number, gender (for some languages), hyphenation, phrases, idioms, morphological features,… • Step c: Removal of stop words using a list Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Use of Zipf’s Law in indexing Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Zipf’s Law Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Explanations of Zipf’s Law • Zipf: Principle of Least Effort • Mandelbrot: A more general version of Zipf law, and the similarity with cantor dust (fractals) Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Word occurrences as Poisson process and the detection of stop words Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Resolving power of words in discrimination between documents • relationship between word frequencies and word significance (non function words), I.e., words are more frequently used to signify their importance • To be index terms, words must help discriminate between documents Inf 722 Information Organisation (Fall 2007) (Gangolly)
FAO Process • Precision v. Recall Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • Specificity v. Exhaustivity • An index is specific if it reflects the information needs of the users • An index is exhaustive if it reflects all topics covered by the documents • There is tension between the two Inf 722 Information Organisation (Fall 2007) (Gangolly)
FOA Process • word frequency: the number of times that a word is used in a document • inverse document frequency: the number of documents in the corpus in which a word is used. • Robertson - Sparck-Jones weighting Inf 722 Information Organisation (Fall 2007) (Gangolly)
Vector Space Model Vector Space model: Inf 722 Information Organisation (Fall 2007) (Gangolly)