An Implementation of the Language Model Based IR System on the GPU

An Implementation of the Language Model Based IR System on the GPU Sudhanshu Khemka

Outline • Background and Related Work • Motivation and goal • Our contributions: • A GPU based implementation of the Good Turing smoothing algorithm • A GPU based implementation of the Kneser Ney smoothing algorithm • An efficient implementation of Ponte and Croft’s document scoring model on the GPU • A GPU friendly version of the single link hierarchical clustering algorithm • Discussion • Conclusion

Outline • Background and Related Work • The GPU Architecture • The structure of an IR System • Ponte and Croft’s document scoring model • Clustering • Related Work

GPU Programming Model • Allows the programmer to define a grid of thread blocks. • Each thread block executes independently of other blocks. • All threads in a thread block can also execute independently of each other; however, one can synchronize their execution using barrier synchronization methods, such as __syncthreads().

GPU Memory Hierarchy

Ding et al.’s GPU based architecture for IR • GPU cannot access main memory directly. • Thus, transfer cost associated with transferring the data from the CPU’s main memory to the GPU’s global memory. • In some cases, this transfer cost is higher than the speed up obtained by using the GPU.

Structure of an IR system

Inverted Index and Smoothing Inverted List NG4 occurs 6 times in Doc 2 Inverted Index After Add one smoothing • Smoothing assigns a small non-zero probability to n grams that were not seen in the document.

The Language Model based approach to IR • Builds a probabilistic language model for each document d () • Ranks documents according to the probability of generating the query (Q) given their language model representation ((Q|Md)) • Ponte and Croft’s model is an enhanced version of the above

Ponte and Croft’s model • As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate And And = *

Clustering • Enables a search engine to present information in a more effective manner by displaying similar documents together. • Particularly useful when the search term has different word senses • For example, consider the query “jaguar.” • jaguar can refer to a car, an animal, or the Apple operating system • If a user is searching for documents related to the animal jaguar, he will have to manually search through the top-k documents to find documents related to the animal. • Clustering alleviates problem

Related Work • Ding et al. propose data parallel algorithms for compressing, decompressing, and intersecting sorted inverted lists for a Vector Space model based information retrieval system. • Example of their list intersection algorithm to intersect two lists A and B: • Randomly pick few elements from list A and for each element, find the pair Bj, Bj+1 in B such that Bj < A ≤ Bj+1 • This implicitly partitions both A and B into segments as shown below: • Intersect corresponding segments in parallel.

Related Work contd • Chang et al. implement hierarchical clustering on the GPU. However, • They apply clustering to DNA microarray experiments. We apply it to information retrieval. • They use the Pearson correlation coefficient as the distance metric to compute the distance between two elements. We use cosine similarity. • 3) We present a more optimized version of their code.

Motivation and Goal • No published papers propose an implementation of the LM based IR system on the GPU • However, a probabilistic language model based approach to retrieval significantly outperforms standard tf.idf weighting (Ponte and Croft, 1998) • Goal: We hope to be the first to contribute algorithms to realize a Language model based IR system on the GPU

Good Turing Smoothing • Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times. Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = • In the above definition, Nc is the number of N grams that occur c times.

Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = 2 phases: Calculate the Nc values Smooth counts

Calculating NC values on the GPU Doc1: Sort: Positions: Stream compaction Doc1: N0 = 1 , N1 = 2 , N2 = 1

Smooth Ngram counts Doc1 Thread 1 Thread 2 Thread 3 Thread 0 Let one thread compute the smoothed count for each Ngram Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) =

Experimental results

Kneser Ney smoothing • The Good Turing algorithm assigns the same probability of occurrence to all 0 count n-grams • Example, if count(BURNISH THE) = count(BURNISH THOU) = 0, then using Good Turing • P(THE|BURNISH) = P(THOU|BURNISH) • However, intuitively, P(THE|BURNISH) > P(THOU| BURNISH), as THE is much more common than THOU • The Kneser Ney smoothing algorithm captures this intuition • Calculate P(wi|wi-1) based on the number of different contexts the word wi has appeared in. (Assuming count(wi-1wi) = 0)

Kneser Ney smoothing if > 0 Step 1 otherwise. Step 4 Step 2 Step 3 =

GPU based implementation Step 1: Compute for each wi : Launch a kernel such that each thread visits one bigram in the bigram dictionary and checks if count(wi-1wi) > 0. If yes, it increments contextW[wi] by 1 an example is this

Step 2 : Compute Apply GPU based parallel reduction on result of step 1. Please refer to technical paper by Mark Harris for an efficient implementation of the parallel reduction operation on the GPU. For us, = 2 Step 3: Compute for each wi-1. As we have already completed steps 1 and 2, step 3 can easily be done by asking one thread to compute the for each wi-1. Step 4: According to the value of count(wi-1wi), we use the correct version of the Kneser Ney algorithm to get the following array:

Ponte and Croft’s document scoring model • Computation of the score of a document given the query is independent of the computation of the score of another document given the query • Embarrassingly parallel.

Single link hierarchical clustering • The algorithm can be divided into two phases: • Phase 1: • Compute pairwise similarity between documents. i.e., Compute sim(di,dj) for all i,j belongs to {1….N} • Phase 2: • Merging. During each iteration, merge the 2 most similar clusters. Let the new cluster be called X. Update similarity of X with all other active clusters. Find new most similar cluster for X.

Phase 1 : Computing pairwise distances • Input matrix • We launch a 2 * 2 grid of thread blocks where each block’s dimension is also 2*2

Focus on block 1 • Each thread computes the similarity between a pair of documents • However, as the threads within a block share common documents, they can synchronize their execution. E.g., Both Thread 0 and Thread 1 in block 1 require document 0 • The above is a very important observation as it allows us to exploit the shared memory of a block. We only need to load d0 into the block’s shared memory once. However, both thread 0 and thread 1 can use it.

Similarity computation for block 1 • Process the input matrix in chunks. • In order to process each chunk, each thread in block 1 loads 2 values into the shared memory. Loaded by thread 0 Shared Memory of block • Do partial similarity computation. Eg., for doc0 and doc2, we can find partial dot product by multiplying (.2)(.5) + (.3)(.1). Store this result.

After done processing 1st chunk, move to second chunk Shared Memory of block • Earlier we had computed the following for doc0 and doc 2 : (.2)(.5) + (.3)(.1). • Based on the next chunk, we can complete the dot product by adding (.1)(.2) + (.4)(.2)

GPU based pairwise distance computation

Phase 2 : Merge clusters For n <- 1 to N-X do i1<-argmax{i:I[i] = i} NBM[i].sim i2<- I[NBM[i1].index] Merge i1 and i2 for i<- 1 to N do if I[i] = i and i ≠ i1 and i ≠ i2 C[i1][i].sim <- C[i][i1].sim <- max(C[i1][i].sim, C[i2][i].sim) if I[i] = i2 then I[i] <- i1 NBM[i1] <- argmaxX Ɛ {C[i1][i] : I[i] = i and i ≠ i1}X.sim Implement the parallel reduction algorithm on the GPU that directly returns i1 and i2 Launch GPU kernel with blocks = Implement parallel reduction on the GPU that directly returns NBM[i1]

GPU based merging

Discussion • From our experiments we observed that GPU based algorithms are primarily useful when dealing with large size datasets. • The GPU is suitable for solving problems that can be divided into non overlapping sub problems • If one is running several iterations of the same GPU code, he should minimize the data transfer between the CPU and the GPU within those iterations

Conclusion • We have contributed the following novel algorithms for GPU based IR: • A GPU based implementation of the Good Turing smoothing algorithm • A GPU based implementation of the Kneser Ney smoothing algorithm • An efficient implementation of Ponte and Croft’s document scoring model on the GPU • A GPU friendly version of the single link hierarchical clustering algorithm • We have experimentally shown that our GPU based implementations are significantly faster than similar CPU based implementations • Future work: • Implement pseudo relevance feedback on the GPU • Investigate methods to implement an image retrieval system on the GPU

References [1] Cederman, D. and Tsigasy, P (2008).A Practical Quicksort algorithm for Graphics Processors. In Proceedings of the 16th annual European symposium on Algorithms (ESA '08), Springer-Verlag, Berlin, Heidelberg, 246-258. [2] CUDPP. http://code.google.com/p/cudpp/ [3] Ding, S., He, J., and Suel T. Using graphics processors for high performance IR query processing. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 421-430. [4] Fagin, R., Kumar, R. and Sivakumar, D. Comparing top k lists. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (SODA '03). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 28-36. [5] Harris, M. http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf [6] Hoare, C.A.R (1962) .Quick Sort. Computer Journal, Vol. 5, 1, 10-15.

References [7] Indri. http://lemurproject.org/indri/ [8] Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–20. [9] Jurafsky, D. and Martin, J. Speech and Language Processing. [10] NVIDIA CUDA C programming guide. [11] Ponte, J.M and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98). ACM, New York, NY, USA, 275-281. [12] Salton, G.,Wong, A., Yang, C.S. A vector space model for automatic indexing.Commun. ACM 18, 11 (November 1975), 613-620. [13] Sanders, J. and Kandrot, E. CUDA by example: An introduction to General Purpose GPU programming. [14] Spink, Amanda. U.S. VERSUS EUROPEAN WEB SEARCHING TRENDS [15] Thrust. http://code.google.com/p/thrust/

Thank you!!!

Ponte and Croft’s model • For non-occurring terms, estimate as follows: • In the above, is the raw count of term t in the collection and csis the total number of tokens in the collection • As we are estimating using a document sized sample, we cannot be very confident about our maximum likelihood estimates. Therefore, Ponte and Croft suggest to use the mean probability of term t in documents containing t in order to estimate And And = *

An Implementation of the Language Model Based IR System on the GPU

An Implementation of the Language Model Based IR System on the GPU

Presentation Transcript

An Array-Based Implementation of the ADT List

Language Model in Turkish IR

Physically-Based Simulation on the GPU

A Neural Network Implementation on the GPU

(Based on the START system)

STUDENTS’ AND TEACHERS’ PERCEPTION ON THE IMPLEMENTATION OF THE ZERO BASED GRADING SYSTEM

FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

SRILM Based Language Model

An Analytical Model for a GPU

The CHR-based Implementation of the SCIFF Abductive System

Acceleration of Peruvian Anchovy Individual Based Model on a Single GPU

Operational implementation of the SWIF model in DIAS system

The Java Language Implementation

Improving the Implementation of the Performance-based Allocation System

The Java Language Implementation

Implementation of the ASSURE Model

Act Now: An Incremental Implementation of an Activity-Based Model System in Puget Sound

An Earth System Model based on the ECMWF Integrated Forecasting System

PtidyOS: An Operating System based on the PTIDES Programming Model

Implementation of the Relational Model

The CHR-based Implementation of the SCIFF Abductive System

Implementation of the Relational Model