Indexing

Indexing

Documents x terms matrix t1 t2 . . . tj . . . tm nf d1 w11 w12 . . . w1j . . . w1m 1/|d1| d2 w21 w22 . . . w2j . . . w2m 1/|d2| . . . . . . . . . . . . . . di wi1 wi2 . . . wij . . . wim 1/|di| . . . . . . . . . . . . . . dn wn1 wn2 . . . wnj . . . wnm 1/|dn| wij is the weight of the term tj in document di Most of the wij’s are zero. Efficient Retrieval

Given the query q = (q1, q2, …, qj, …, qn), nf = 1/|q|. How to evaluate q (i.e., compute the similarity between q and each document)? Method 1: Compare q with each document: Data structure for documents: di : ((t1, wi1), (t2, wi2), . . ., (tj, wij), . . ., (tm, wim ), 1/|di|) Only terms with positive weights are considered. Terms are in alphabetic order. Data structure for the query: q : ((t1, q1), (t2, q2), . . ., (tj, qj), . . ., (tm, qm ), 1/|q|) Naïve Retrieval

Method 1: Compare q with all the documents (cont.) Algorithm: Initialize all sim(q, di) = 0; for each document di (i = 1, …, n) { for each term tj (j = 1, …, m) if tj appears in q and di sim(q, di) += qjwij; sim(q, di) = sim(q, di) (1/|q|) (1/|di|); } rank the documents in descending order and show the k better ones to the users. Naïve Retrieval

Method 1 is not efficient: all zero entries in the documents x terms matrix are accessed. Method 2: Use of a file of inverted indexes; Several data structures: For each term tj, an inverted list with all the documents that contain t is createdj. I(tj) = { (d1, w1j), (d2, w2j), …, (di, wij), …, (dn, wnj) } di is the identification of the i-th document; Only non zero entries are considered. Indexing

Method 2: Use of inverted index file (cont.) Several data structures: Normalization factors of the docs. are pre-computed and stored in a vector: nf[i] stores 1/|di|. A hash table for all the terms in the collection is created: . . . . . . tj points to I(tj) . . . . . . Inverted lists are typically stored in disks; Typically the number of different terms is very high. Indexing

Dictionary Pointers Inverted file creation

Algorithm: Initialize all sim(q, di) = 0; for each term tj in q { find I(t) using the hash table; for each (di, wij) in I(t) sim(q, di) += qjwij; } for each document di sim(q, di) = sim(q, di) (1/|q|) nf[i]; rank the documents in descending order and show the better k to the user. Retrieval with inverted lists

Observations about method 2: If a document d does not contain any term of the query q, so d is not involved in the evaluation of q; Only non-zero entries of the matrix documents x terms are employed for query evaluation; The computation of similarities of several documents are made simultaneously. Retrieval with inverted lists

Example (Method 2): q = { (t1, 1), (t3, 1) }, 1/|q| = 0.7071 d1 = { (t1, 2), (t2, 1), (t3, 1) }, nf[1] = 0.4082 d2 = { (t2, 2), (t3, 1), (t4, 1) }, nf[2] = 0.4082 d3 = { (t1, 1), (t3, 1), (t4, 1) }, nf[3] = 0.5774 d4 = { (t1, 2), (t2, 1), (t3, 2), (t4, 2) }, nf[4] = 0.2774 d5 = { (t2, 2), (t4, 1), (t5, 2) }, nf[5] = 0.3333 I(t1) = { (d1, 2), (d3, 1), (d4, 2) } I(t2) = { (d1, 1), (d2, 2), (d4, 1), (d5, 2) } I(t3) = { (d1, 1), (d2, 1), (d3, 1), (d4, 2) } I(t4) = { (d2, 1), (d3, 1), (d4, 1), (d5, 1) } I(t5) = { (d5, 2) } Retrieval with inverted lists

After t1 preprocessing: sim(q, d1) = 2, sim(q, d2) = 0, sim(q, d3) = 1 sim(q, d4) = 2, sim(q, d5) = 0 After t3 preprocessing: sim(q, d1) = 3, sim(q, d2) = 1, sim(q, d3) = 2 sim(q, d4) = 4, sim(q, d5) = 0 After normalization: sim(q, d1) = .87, sim(q, d2) = .29, sim(q, d3) = .82, sim(q, d4) = .78, sim(q, d5) = 0 Retrieval with inverted lists

The storage of the weights is good for efficiency but bad for flexibility; A re-computing is necessary if the formulas tf and idf change Flexibility can be improved storing tf and df information, but efficiency is worst; A compromise exists: Storing the weights tf; Use of the weights idf with the weights of the query terms tf instead of the weights of the document terms tf. Efficiency x flexibility

Is the main structure for the indexes; Main idea: To invert the documents in a big index file; Basic steps: To create a “dictionary” with all the tokens in the collection; For each token, to list all the documents in which the token occur; To treat the structure to avoid redundancy. Inverted indexes

Inverted indexes An inverted file is composed by vectors in a way that each row corresponds to a list of documents; it corresponds to the transpose of the document matrix.

Inverted index files creation • Documents are analysed for token extraction; these ones are save with doc-ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

Inverted index files creation After the analysis of all the documents, the index file is sorted alphabetically.

Inverted index files creation • Multiple entries of terms for a single document can de merged. • Information about term frequencies are compiled.

Inverted index files creation • Then the file can be separated in : • a “Dictionary” file; and • a “Pointer” file.

Inverted index files • Allow a faster access to individual terms; • For each term, a list is obtained with: • The identity of the document: doc-ID; • The term frequency in the document; • The position of the term in the document. • These lists can be used to answer Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • They can be also employed in ranking algorithms.

Use of inverted files Dictionary Pointers Query: “time” AND “dark” 2 docs with “time” in the dictionary -> IDs 1 and 2 in the pointer file 1 doc with “dark” in the dictionary -> ID 2 in the pointer file. So, only doc 2. Satisfy the query.

Indexing

Indexing

Presentation Transcript

Indexing

Indexing:

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing