Algorithms for Information Retrieval

Algorithms forInformation Retrieval Is algorithmic design a 5-mins thinking task ???

Toy problem #1: Max Subarray Algorithm • Compute P[1,n] array of Prefix-Sums over A • Compute M[1,n] array of Mins over P • Find end such that P[end]-M[end] is maximum. • start is such that P[start] is minimum. • Goal: Find the time window achieving the best “market performance”. • Math Problem: Find the subarray of maximum sum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7 P = 2 -3 3 4 2 6 9 -4 5 -1 6 M = 2 -3 -3 -3 -3 -3 -3 -4 -4 -4 -4 • Note: • Find maxsumx≤y A[x,y] • = maxx≤y P[y] – P[x] • = maxy [ P[y] – (minx≤yP[x]) ]

Toy problem #1(solution 2) Algorithm • sum=0; • For i=1,...,n do • If (sum + A[i] ≤ 0) sum=0; else MAX(max_sum, sum+A[i]); sum +=A[i]; ≥0 Optimum A = ≤0 A = 2 -5 6 1 -2 4 3 -13 9 -6 7 • Note: • Sum = 0 when OPT starts; • Sum > 0 within OPT

Problems if ≤ n/2 Toy problem #2: Top-freq elements Algorithm • Use a pair of variables <X,C> • For each item s of the stream, • if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} • Return X; • Goal: Top queries over a stream of n items (S large). • Math Problem: Find the item y whose frequency is > n/2, using the smallest space. (i.e. If mode occurs > n/2) A = b a c c c d c b a a a c c b c c c <b,1> <a,1><c,1><c,2><c,3> <c,2><c,3><c,2> <c,1> <a,1><a,2><a,1><c,1><b,1><c,1>.<c,2><c,3> Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > n...

Toy problem #3 : Indexing • Consider the following TREC collection: • N = 6 * 109 size • n = 106 documents • TotT= 109 (avg term length is 6 chars) • t = 5 * 105 distinct terms • What kind of data structure we build to support word-based searches ?

Solution 1: Term-Doc matrix n = 1 million t=500K 1 if play contains word, 0 otherwise Space is 500Gb !

2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 Solution 2: Inverted index We can do still better: i.e. 3050% original text Brutus Calpurnia Caesar 13 16 • Typically <termID,docID,pos> use about 12 bytes • We have 109 total terms  at least 12Gb space • Compressing 6Gb documents gets 1.5Gb data • Better index but yet it is >10 times the text !!!!

Toy problem #4 : sorting • How to sort tuples (objects) on disk • 109 objects of 12 bytes each, hence 12 Gb • Key observation: • Array A to sort is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to memory locations A[i] and A[j] • If we use qsort, this is an indirect sort !!! • W(n log n) random memory accesses !! (I/Os ?) Memory containing the tuples (objects) A

Cost of Quicksort on large data • Some typical parameter settings • N=109 tuples of 12 bytes each • Typical Disk (Seagate Cheetah 150Gb): seek time~5ms • Analysis of qsort on disk: • qsort is an indirect sort: W(n log2 n) random memory accesses • [5ms] * n log2 n = 109 * log2 (109) * 5ms ≥ 3years • In practice a little bit better because of caching, but...

What about listing tuples in order ? B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB • n=109 insertions  Data get distributed arbitrarily !!! B-tree internal nodes B-tree leaves (“tuple pointers") Tuples Possibly 109 random I/Os = 109 * 5ms 2 months

Binary Merge-Sort Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first half of A into array A1 03 Copy the second half of A into array A2 04 Merge-Sort(A1) 05 Merge-Sort(A2) 06 Merge(A, A1, A2) Divide Conquer Combine

4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 3 8 13 8 7 15 4 19 3 12 2 9 6 11 1 5 10 17 Merge-Sort Recursion Tree log2 n How do we exploit the disk features ??

Main-memory sort Main-memory sort Main-memory sort 3 4 8 15 6 11 12 17 3 4 6 8 11 12 15 17 7 9 13 19 17 4 5 1 13 9 19 15 7 8 3 12 6 11 External Binary Merge-Sort • Increase the size of initial runs to be merged! 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 External two-way merge 1 2 5 7 9 10 13 19 External two-way merges 1 2 5 10 Main-memory sort 10 2 N/M runs, each level is 2 passes (R/W) over the data

Cost of External Binary Merge-Sort • Some typical parameter settings: • n=109 tuples of 12 bytes each, N=12 Gb of data • Typical Disk (Seagate): seek time~8ms • avg transfer rate is 100Mb per sec = 10-8 secs/byte • Analysis of binary-mergesort on disk (M = 10Mb = 106 tuples): • Data divided into (N/M) runs:  103 runs • #levels is log2 (N/M)  10 • It executes 2 * log2 (N/M)  20 passes (R/W) over the data • I/O-scanning cost: 20 * [12 * 109] * 10-8 2400 sec = 40 min

Multi-way Merge-Sort • Sort N items using internal-memory M and disk-pages of size B: • Pass 1: Produce (N/M) sorted runs. • Pass 2, …: merge X M/Bruns each pass. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT X Disk Disk Main memory buffers of B items

Multiway Merging Bf1 p1 min(Bf1[p1], Bf2[p2], …, Bfx[pX]) Bf2 Fetch, if pi = B Bfo p2 po Bfx pX Flush, if Bfo full Current page Current page Current page EOF Run 1 Run 2 Run X=M/B Out File: Merged run

Cost of Multi-way Merge-Sort • Number of passes = logM/B #runs  logM/B N/M • Cost of a pass = 2 * (N/B) I/Os Tuning depends on disk features • Parameters • M = 10Mb; B = 8Kb; N = 12 Gb; • N/M 103 runs; #passes = logM/B N/M  1 !!! • I/O-scanning: 20 passes (40m)  2 passes (4 m) • Increasing the fan-out (M/B) increases #I/Os per pass!

Does compression may help? • Goal: enlarge M and reduce N • #passes = O(logM/B N/M) • Cost of a pass = O(N/B)

Please !! Do not underestimate the features of disks in algorithmic design

Algorithms for Information Retrieval

Algorithms for Information Retrieval

Presentation Transcript

Information retrieval

Information Retrieval

Information retrieval

Galago for Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval