Information Retrieval: Document Indexing Methods

Information Retrieval:Indexing Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Roadmap • What is a document? • Representing the content of documents • Luhn's analysis • Generation of document representatives • Weighting • Inverted files

Indexing Language • Language used to describe documents and queries • index terms – selected subset of words • derived from the text or arrived at independently • Keyword searching • Statistical analysis of document based of word occurrence frequency • Automated, efficient and potentially inaccurate • Searching using controlled vocabularies • More accurate results but time consuming if documents manually indexed

Luhn's analysis • Resolving power of significant words: • ability of words to discriminate document content • peak at rank order position half way between the two cut-offs

Generating document representatives

Generating document representatives • Input text: full text, abstract, title • Document representative: list of (weighted) class names, each name representing a class of concepts (words) occurring in input text • Document indexed by a class name if one of its significant words occurs as a member of that class • Phases: • identify words - Lexical Analysis (Tokenising) • removal of high frequency words • suffix stripping (stemming) • detecting equivalent stems • thesauri • others (noun-phrase, noun group, logical formula, structure) • Index structure creation

Document Lexical Analysis Stopwords removal stemming Indexing features Process View

Lexical Analysis • The process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms) • treating digits, hyphens, punctuation marks, and the case of the letters.

Stopword Removal • Removal of high frequency words • list of stop words (implement Luhn's upper cut-off) • filtering out words with very low discrimination values for retrieval purposes • example: “been", “a", “about", “otherwise" • compare input text with stop list • reduction: between 30 and 50 per cent

Conflation • Conflation reduces word variants into a single form • similar words generally have similar meaning • retrieval effectiveness increased if the query is expanded with those which are similar in meaning to those originally contained within it. • Stemming algorithm is a conflation procedure • reduces all words with same root into a single root

Different forms - stemming • Stemming • Matching the query term “forests” to “forest” and “forested” • “choke", “choking", “choked" • Suffix removal • removal of suffixes - worker • Porter algorithm: remove longest suffix • error: “equal" -> “eq": heuristic rules • more effective than ordinary word forms • Detecting equivalent stems • example: ABSORB- and ABSORPT- • Stemmers remove affixes • prefixes? - megavolt

Plural stemmer • Plurals in English • If word ends in “ies” but not “eies”, “aies” • “ies” -> “y” • if word ends in “es” but not “aes, “ees”, “oes” • “es” -> “e” • if word ends in “s” but not “us” or “ss” • “s” -> “” • First applicable rule is the one used

Processing • “The destruction of the amazon rain forests” • Case normalisation • Stop word removal. • From fixed list • “destruction amazon rain forests” • Suffix removal (stemming). • “destruct amazon rain forest”

Thesauri • A collection of terms along with some structure or relationships between them. Scope notes etc.. • provide standard vocabulary for indexing & searching • assist user locating terms for proper query formulation • provide classification hierarchy for broadening and narrowing current query according to user need • Equivalence: synonyms, preferred terms • Hierarchical: broader/narrower terms (BT/NT) • Association: related terms across the hierarchy (RT)

Thesauri Examples: WordNet

Faceted Classification

Thesauri Examples: AAT Art and Architecture Thesaurus

Action associated with a door Kind of a door Something attached to a door Hierarchical Classifications • Alphanumeric coding schemes • Subject classifications • A taxonomy that represents a classification or kind-of hierarchy. • Examples: Dewey Decimal, AAT, SHIC, ICONCLASS 41A32 Door 41A322 Closing the Door 41A323 Monumental Door 41A324 Metalwork of a Door 41A3241 Door-Knocker 41A325 Threshold 41A327 Door-keeper, houseguard

Terminology/Controlled vocabulary • The descriptors from a thesauri form a controlled vocabulary • Normalise indexing concepts • Identification of indexing concepts with clear semantics • Retrieval based on concepts rather than terms • Good for specific domains (e.g., medical) • Problematic for general domains (large, new, dynamic)

No One Classification

Generating document representatives - Outcome • Class • words with the same stem • Class name • stem • Document representative: • list of class names (index terms or keywords) • Same process applied to query

Precision and Recall • Precision • Ratio of the number of relevant documents retrieved to the total number of documents retrieved. • The number of hits that are relevant • Recall • Ratio of number of relevant documents retrieved to the total number of relevant documents • The number of relevant documents that are hits

Relevant Documents Retrieved Documents Document Space Low Precision Low Recall High Precision High Recall Low Precision High Recall Precision and Recall High Precision Low Recall

Relevant Documents |R| Retrieved Documents |A| Information Space |RA| Recall = |R| |RA| Precision = |A| Precision and Recall • The user isn’t usually given the answer set RA at once • The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A |RA|

100% Recall 100% Precision Precision and Recall Trade Off • Increase number of documents retrieved • Likely to retrieve more of the relevant documents and thus increase the recall • But typically retrieve more inappropriate documents and thus decrease precision

Index term weighting • Effectiveness of an indexing language: • Exhaustivity • number of different topics indexed • high exhaustivity: high recall and low precision • Specificity • ability of the indexing language to describe topics precisely • high specificity: high precision and low recall

Index term weighting • Exhaustivity • related to the number of index terms assigned to a given document • Specificity • number of documents to which a term is assigned in a collection • related to the distribution of index terms in collection • Index term weighting • index term frequency: occurrence frequency of a term in document • document frequency: number of documents in which a term occurs

IR as Clustering • A query is a vague spec of a set of objects, A • IR is reduced to the problem of determining which documents are in set A and which ones are not • Intra clustering similarity: • What are the features that better describe the objects in A • Inter clustering dissimilarity: • What are the features that better distinguish the objects A from the remaining objects in C A: Retrieved Documents x x x x x x C: Document Collection

Index term weighting

occ(t,d) tf(t,d) = occ(tmax, d) N idf(t) = log n(t) Index term weighting Normalised frequency of term t in document d Intra-clustering similarity • The raw frequency of a term t inside a document d. • A measure of how well the document term describes the document contents Inter-cluster dissimilarity • Inverse document frequency • Inverse of the frequency of a term t among the documents in the collection. • Terms which appear in many documents are not useful for distinguishing a relevant document from a non-relevant one. Inverse document frequency Weight(t,d) = tf(t,d) x idf(t)

occ(t,d) weight(t,d) = x occ(tmax, d) 0.5occ(t,q) x 0.5 + weight(t,d) = occ(tmax, q) N N log log n(t) n(t) Inverse document frequency Term frequency Term weighting schemes • Best known • Variation for query term weights

Example Nuclear 7 Computer 9 Poverty 5 Unemployment 1 Luddites 3 Machines 19 People 25 And 49 Weight(machine) = 19/25 x log(100/50) = 0.76 x 0.3013 = 0.228988 Weight(luddite) = 3/25 x log(100/2) = 0.12 x 1.69897 = 0.2038764 Weight(poverty) = 5/25 x log(100/2) = 0.2 x 1.69897 = 0.339794

Inverted Files • Word-oriented mechanism for indexing test collections to speed up searching • Searching: • vocabulary search (query terms) • retrieval of occurrence • manipulation of occurrence

Original Document view Cosmonaut astronaut moon car truck D1 1 0 1 1 1 D2 0 1 1 0 0 D3 0 0 0 1 1

Inverted view D1 D2 D3 Cosmonaut 1 0 0 astronaut 0 1 0 moon 1 1 0 Car 1 0 1 truck 1 0 1

Inverted index cosmonaut D1 astronaut D2 moon D1 D2 car D1 D3 truck D1 D3

Inverted File The speed of retrieval is maximised by considering only those terms that have been specified in the query This speed is achieved only at the cost of very substantial storage and processing overheads

Components of an inverted file Header Information frequency Document number pointer term frequency Field type Postings file

Producing an Inverted file Postings Term Inverted File Doc 3 Doc 1 Doc 2 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 aid 0 0 0 1 0 0 0 1 4, 8 AI A all 0 1 0 1 0 1 0 0 2, 4, 6 AL back 1 0 1 0 0 0 1 0 1, 3, 7 BA B brown 1 0 1 0 1 0 1 0 1, 3, 5, 7 BR come 0 1 0 1 0 1 0 1 2, 4, 6, 8 C dog 0 0 1 0 1 0 0 0 3, 5 D fox 0 0 1 0 1 0 1 0 3, 5, 7 F good 0 1 0 1 0 1 0 1 2, 4, 6, 8 G jump 0 0 1 0 0 0 0 0 3 J lazy 1 0 1 0 1 0 1 0 1, 3, 5, 7 L men 0 1 0 1 0 0 0 1 2, 4, 8 M now 0 1 0 0 0 1 0 1 2, 6, 8 N over 1 0 1 0 1 0 1 1 1, 3, 5, 7, 8 O party 0 0 0 0 0 1 0 1 6, 8 P quick 1 0 1 0 0 0 0 0 1, 3 Q their 1 0 0 0 1 0 1 0 1, 5, 7 TH T time 0 1 0 1 0 1 0 0 2, 4, 6 TI

An Inverted file Term Inverted File Postings aid 4, 8 AI A all 2, 4, 6 AL back 1, 3, 7 BA B brown 1, 3, 5, 7 BR come 2, 4, 6, 8 C dog 3, 5 D fox 3, 5, 7 F good 2, 4, 6, 8 G jump 3 J lazy 1, 3, 5, 7 L men 2, 4, 8 M now 2, 6, 8 N over 1, 3, 5, 7, 8 O party 6, 8 P quick 1, 3 Q their 1, 5, 7 TH T time 2, 4, 6 TI

Searching Algorithm • For each document D, Score(D) =0; • For each query term • Search the vocabulary list • Pull out the postings list • for each document J in the list, • Score(J) +=Score(J) +1

What Goes in a Postings File? • Boolean retrieval • Just the document number • Ranked Retrieval • Document number and term weight (TF*IDF, ...) • Proximity operators • Word offsets for each occurrence of the term • Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File? • Very compact for Boolean retrieval • About 10% of the size of the documents • If an aggressive stopword list is used • Not much larger for ranked retrieval • Perhaps 20% • Enormous for proximity operators • Sometimes larger than the documents • But access is fast - you know where to look

Tokenize Tokenize Stop word Stemming Doc Score Term 2 dj s1 Term 3 di s2 dk s3 s1>s2>s3> ... Query Documents indexing indexing Stop word Stemming Matching Indexing features Query features Storage: inverted index Term 1 di dj dk

Similarity Matching • The process in which we compute the relevance of a document for a query • A similarity measure comprises • term weighting scheme which allocates numerical values to each of the index terms in a query or document reflecting their relative importance • similarity coefficient - uses the term weights to compute the overall degree of similarity between a query and a document 

Information Retrieval: Document Indexing Methods

Information Retrieval: Document Indexing Methods

Presentation Transcript

Video Indexing and Retrieval using an MPEG7 Based Inference Network

Babel revisited: A taxonomy for ordinary images indexing in a bilingual retrieval context

Introducing Information Retrieval and Web Search

Introducing Information Retrieval and Web Search

Web Information Retrieval

Indexing

Information Retrieval

CS 430 / INFO 430 Information Retrieval

Information Retrieval Project

Query-Driven Indexing for P2P Text Retrieval

Indexing and Searching (File Structures)

Information Retrieval Models

Information Retrieval - Query expansion

Information Retrieval and Web Search

CS276 Information Retrieval and Web Search

ITCS 6265 Information Retrieval and Web Mining

CSE 535 Information Retrieval

Information Retrieval and Search Engines