1 / 47

Information Retrieval: Indexing

Information Retrieval: Indexing. Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow). Roadmap. What is a document? Representing the content of documents Luhn's analysis Generation of document representatives Weighting Inverted files. Indexing Language.

oshin
Download Presentation

Information Retrieval: Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval:Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

  2. Roadmap • What is a document? • Representing the content of documents • Luhn's analysis • Generation of document representatives • Weighting • Inverted files

  3. Indexing Language • Language used to describe documents and queries • index terms – selected subset of words • derived from the text or arrived at independently • Keyword searching • Statistical analysis of document based of word occurrence frequency • Automated, efficient and potentially inaccurate • Searching using controlled vocabularies • More accurate results but time consuming if documents manually indexed

  4. Luhn's analysis • Resolving power of significant words: • ability of words to discriminate document content • peak at rank order position half way between the two cut-offs

  5. Generating document representatives

  6. Generating document representatives • Input text: full text, abstract, title • Document representative: list of (weighted) class names, each name representing a class of concepts (words) occurring in input text • Document indexed by a class name if one of its significant words occurs as a member of that class • Phases: • identify words - Lexical Analysis (Tokenising) • removal of high frequency words • suffix stripping (stemming) • detecting equivalent stems • thesauri • others (noun-phrase, noun group, logical formula, structure) • Index structure creation

  7. Document Lexical Analysis Stopwords removal stemming Indexing features Process View

  8. Lexical Analysis • The process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) • treating digits, hyphens, punctuation marks, and the case of the letters.

  9. Stopword Removal • Removal of high frequency words • list of stop words (implement Luhn's upper cut-off) • filtering out words with very low discrimination values for retrieval purposes • example: “been", “a", “about", “otherwise" • compare input text with stop list • reduction: between 30 and 50 per cent

  10. Conflation • Conflation reduces word variants into a single form • similar words generally have similar meaning • retrieval effectiveness increased if the query is expanded with those which are similar in meaning to those originally contained within it. • Stemming algorithm is a conflation procedure • reduces all words with same root into a single root

  11. Different forms - stemming • Stemming • Matching the query term “forests” to “forest” and “forested” • “choke", “choking", “choked" • Suffix removal • removal of suffixes - worker • Porter algorithm: remove longest suffix • error: “equal" -> “eq": heuristic rules • more effective than ordinary word forms • Detecting equivalent stems • example: ABSORB- and ABSORPT- • Stemmers remove affixes • prefixes? - megavolt

  12. Plural stemmer • Plurals in English • If word ends in “ies” but not “eies”, “aies” • “ies” -> “y” • if word ends in “es” but not “aes, “ees”, “oes” • “es” -> “e” • if word ends in “s” but not “us” or “ss” • “s” -> “” • First applicable rule is the one used

  13. Processing • “The destruction of the amazon rain forests” • Case normalisation • Stop word removal. • From fixed list • “destruction amazon rain forests” • Suffix removal (stemming). • “destruct amazon rain forest”

  14. Thesauri • A collection of terms along with some structure or relationships between them. Scope notes etc.. • provide standard vocabulary for indexing & searching • assist user locating terms for proper query formulation • provide classification hierarchy for broadening and narrowing current query according to user need • Equivalence: synonyms, preferred terms • Hierarchical: broader/narrower terms (BT/NT) • Association: related terms across the hierarchy (RT)

  15. Thesauri Examples: WordNet

  16. Faceted Classification

  17. Thesauri Examples: AAT Art and Architecture Thesaurus

  18. Action associated with a door Kind of a door Something attached to a door Hierarchical Classifications • Alphanumeric coding schemes • Subject classifications • A taxonomy that represents a classification or kind-of hierarchy. • Examples: Dewey Decimal, AAT, SHIC, ICONCLASS 41A32 Door 41A322 Closing the Door 41A323 Monumental Door 41A324 Metalwork of a Door 41A3241 Door-Knocker 41A325 Threshold 41A327 Door-keeper, houseguard

  19. Terminology/Controlled vocabulary • The descriptors from a thesauri form a controlled vocabulary • Normalise indexing concepts • Identification of indexing concepts with clear semantics • Retrieval based on concepts rather than terms • Good for specific domains (e.g., medical) • Problematic for general domains (large, new, dynamic)

  20. No One Classification

  21. No One Classification

  22. Generating document representatives - Outcome • Class • words with the same stem • Class name • stem • Document representative: • list of class names (index terms or keywords) • Same process applied to query

  23. Precision and Recall • Precision • Ratio of the number of relevant documents retrieved to the total number of documents retrieved. • The number of hits that are relevant • Recall • Ratio of number of relevant documents retrieved to the total number of relevant documents • The number of relevant documents that are hits

  24. Relevant Documents Retrieved Documents Document Space Low Precision Low Recall High Precision High Recall Low Precision High Recall Precision and Recall High Precision Low Recall

  25. Relevant Documents |R| Retrieved Documents |A| Information Space |RA| Recall = |R| |RA| Precision = |A| Precision and Recall • The user isn’t usually given the answer set RA at once • The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A |RA|

  26. 100% Recall 100% Precision Precision and Recall Trade Off • Increase number of documents retrieved • Likely to retrieve more of the relevant documents and thus increase the recall • But typically retrieve more inappropriate documents and thus decrease precision

  27. Index term weighting • Effectiveness of an indexing language: • Exhaustivity • number of different topics indexed • high exhaustivity: high recall and low precision • Specificity • ability of the indexing language to describe topics precisely • high specificity: high precision and low recall

  28. Index term weighting • Exhaustivity • related to the number of index terms assigned to a given document • Specificity • number of documents to which a term is assigned in a collection • related to the distribution of index terms in collection • Index term weighting • index term frequency: occurrence frequency of a term in document • document frequency: number of documents in which a term occurs

  29. IR as Clustering • A query is a vague spec of a set of objects, A • IR is reduced to the problem of determining which documents are in set A and which ones are not • Intra clustering similarity: • What are the features that better describe the objects in A • Inter clustering dissimilarity: • What are the features that better distinguish the objects A from the remaining objects in C A: Retrieved Documents x x x x x x C: Document Collection

  30. Index term weighting

  31. occ(t,d) tf(t,d) = occ(tmax, d) N idf(t) = log n(t) Index term weighting Normalised frequency of term t in document d Intra-clustering similarity • The raw frequency of a term t inside a document d. • A measure of how well the document term describes the document contents Inter-cluster dissimilarity • Inverse document frequency • Inverse of the frequency of a term t among the documents in the collection. • Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one. Inverse document frequency Weight(t,d) = tf(t,d) x idf(t)

  32. occ(t,d) weight(t,d) = x occ(tmax, d) 0.5occ(t,q) x 0.5 + weight(t,d) = occ(tmax, q) N N log log n(t) n(t) Inverse document frequency Term frequency Term weighting schemes • Best known • Variation for query term weights

  33. Example Nuclear 7 Computer 9 Poverty 5 Unemployment 1 Luddites 3 Machines 19 People 25 And 49 Weight(machine) = 19/25 x log(100/50) = 0.76 x 0.3013 = 0.228988 Weight(luddite) = 3/25 x log(100/2) = 0.12 x 1.69897 = 0.2038764 Weight(poverty) = 5/25 x log(100/2) = 0.2 x 1.69897 = 0.339794

  34. Inverted Files • Word-oriented mechanism for indexing test collections to speed up searching • Searching: • vocabulary search (query terms) • retrieval of occurrence • manipulation of occurrence

  35. Original Document view Cosmonaut astronaut moon car truck D1 1 0 1 1 1 D2 0 1 1 0 0 D3 0 0 0 1 1

  36. Inverted view D1 D2 D3 Cosmonaut 1 0 0 astronaut 0 1 0 moon 1 1 0 Car 1 0 1 truck 1 0 1

  37. Inverted index cosmonaut D1 astronaut D2 moon D1 D2 car D1 D3 truck D1 D3

  38. Inverted File The speed of retrieval is maximised by considering only those terms that have been specified in the query This speed is achieved only at the cost of very substantial storage and processing overheads

  39. Components of an inverted file Header Information frequency Document number pointer term frequency Field type Postings file

  40. Producing an Inverted file Postings Term Inverted File Doc 3 Doc 1 Doc 2 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 aid 0 0 0 1 0 0 0 1 4, 8 AI A all 0 1 0 1 0 1 0 0 2, 4, 6 AL back 1 0 1 0 0 0 1 0 1, 3, 7 BA B brown 1 0 1 0 1 0 1 0 1, 3, 5, 7 BR come 0 1 0 1 0 1 0 1 2, 4, 6, 8 C dog 0 0 1 0 1 0 0 0 3, 5 D fox 0 0 1 0 1 0 1 0 3, 5, 7 F good 0 1 0 1 0 1 0 1 2, 4, 6, 8 G jump 0 0 1 0 0 0 0 0 3 J lazy 1 0 1 0 1 0 1 0 1, 3, 5, 7 L men 0 1 0 1 0 0 0 1 2, 4, 8 M now 0 1 0 0 0 1 0 1 2, 6, 8 N over 1 0 1 0 1 0 1 1 1, 3, 5, 7, 8 O party 0 0 0 0 0 1 0 1 6, 8 P quick 1 0 1 0 0 0 0 0 1, 3 Q their 1 0 0 0 1 0 1 0 1, 5, 7 TH T time 0 1 0 1 0 1 0 0 2, 4, 6 TI

  41. An Inverted file Term Inverted File Postings aid 4, 8 AI A all 2, 4, 6 AL back 1, 3, 7 BA B brown 1, 3, 5, 7 BR come 2, 4, 6, 8 C dog 3, 5 D fox 3, 5, 7 F good 2, 4, 6, 8 G jump 3 J lazy 1, 3, 5, 7 L men 2, 4, 8 M now 2, 6, 8 N over 1, 3, 5, 7, 8 O party 6, 8 P quick 1, 3 Q their 1, 5, 7 TH T time 2, 4, 6 TI

  42. Searching Algorithm • For each document D, Score(D) =0; • For each query term • Search the vocabulary list • Pull out the postings list • for each document J in the list, • Score(J) +=Score(J) +1

  43. What Goes in a Postings File? • Boolean retrieval • Just the document number • Ranked Retrieval • Document number and term weight (TF*IDF, ...) • Proximity operators • Word offsets for each occurrence of the term • Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

  44. How Big Is the Postings File? • Very compact for Boolean retrieval • About 10% of the size of the documents • If an aggressive stopword list is used • Not much larger for ranked retrieval • Perhaps 20% • Enormous for proximity operators • Sometimes larger than the documents • But access is fast - you know where to look

  45. Tokenize Tokenize Stop word Stemming Doc Score Term 2 dj s1 Term 3 di s2 dk s3 s1>s2>s3> ... Query Documents indexing indexing Stop word Stemming Matching Indexing features Query features Storage: inverted index Term 1 di dj dk

  46. Similarity Matching • The process in which we compute the relevance of a document for a query • A similarity measure comprises • term weighting scheme which allocates numerical values to each of the index terms in a query or document reflecting their relative importance • similarity coefficient - uses the term weights to compute the overall degree of similarity between a query and a document 

More Related