Inf 722 Information Organisation - PowerPoint PPT Presentation

inf 722 information organisation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Inf 722 Information Organisation PowerPoint Presentation
Download Presentation
Inf 722 Information Organisation

play fullscreen
1 / 17
Inf 722 Information Organisation
94 Views
Download Presentation
azra
Download Presentation

Inf 722 Information Organisation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Inf 722 Information Organisation Class notes: Information Retrieval Jagdish S. Gangolly Inf 722 Information Organisation (Fall 2007) (Gangolly)

  2. FOA Process • FOA Process • Asking a question (Query formulation) • Constructing an answer (retrieval algorithms) • Assessing the answer (feedback on relevance) Inf 722 Information Organisation (Fall 2007) (Gangolly)

  3. FOA Process • Query language • Natural or artificial • Vocabulary • Syntax: operators, arguments • Query expansion, specialization, disambiguation, relevance feedback Inf 722 Information Organisation (Fall 2007) (Gangolly)

  4. FOA Process • Constructing the answer • Information need accurately translated in the query? • How to provide answer in a form suitable to the user? • Provide background to the user so (s)he can verbalise the information need better? • How to represent the query as well as the corpus efficiently and effectively Inf 722 Information Organisation (Fall 2007) (Gangolly)

  5. FOA Process • Constructing the answer (Contd) • Generate a set of index terms which render the documents in the collection as different as possible • Conflation algorithms • Removal of function/fluff/stop words (usually from closed class words) • Stripping suffixes (lemmatization) • Detection of equivalent/associated words Inf 722 Information Organisation (Fall 2007) (Gangolly)

  6. FOA Process • Facets of documents: • Structure (dtd) • Format (css, xsl) • Content (xsd) • Unit of interest • Tagging of corpora • content tagging, grammatical tagging Inf 722 Information Organisation (Fall 2007) (Gangolly)

  7. FOA Process • Step 1: Selection of corpora to build • Population from which documents to be included are selected (domain, genre,..) • Step 2: Selection of Tagging, if necessary • grammatical or other tagging schemes Inf 722 Information Organisation (Fall 2007) (Gangolly)

  8. FOA Process • Step 3: Indexing • Index: doci {kwj} • Index-1: {kwj}  doci • Extracting lexical features: • Step a: Selection of tokens, separators • Step b: Stemming decisions on number, gender (for some languages), hyphenation, phrases, idioms, morphological features,… • Step c: Removal of stop words using a list Inf 722 Information Organisation (Fall 2007) (Gangolly)

  9. FOA Process • Use of Zipf’s Law in indexing Inf 722 Information Organisation (Fall 2007) (Gangolly)

  10. FOA Process • Zipf’s Law Inf 722 Information Organisation (Fall 2007) (Gangolly)

  11. FOA Process • Explanations of Zipf’s Law • Zipf: Principle of Least Effort • Mandelbrot: A more general version of Zipf law, and the similarity with cantor dust (fractals) Inf 722 Information Organisation (Fall 2007) (Gangolly)

  12. FOA Process • Word occurrences as Poisson process and the detection of stop words Inf 722 Information Organisation (Fall 2007) (Gangolly)

  13. FOA Process • Resolving power of words in discrimination between documents • relationship between word frequencies and word significance (non function words), I.e., words are more frequently used to signify their importance • To be index terms, words must help discriminate between documents Inf 722 Information Organisation (Fall 2007) (Gangolly)

  14. FAO Process • Precision v. Recall Inf 722 Information Organisation (Fall 2007) (Gangolly)

  15. FOA Process • Specificity v. Exhaustivity • An index is specific if it reflects the information needs of the users • An index is exhaustive if it reflects all topics covered by the documents • There is tension between the two Inf 722 Information Organisation (Fall 2007) (Gangolly)

  16. FOA Process • word frequency: the number of times that a word is used in a document • inverse document frequency: the number of documents in the corpus in which a word is used. • Robertson - Sparck-Jones weighting Inf 722 Information Organisation (Fall 2007) (Gangolly)

  17. Vector Space Model Vector Space model: Inf 722 Information Organisation (Fall 2007) (Gangolly)