170 likes | 329 Views
FYP Progress. Sudhanshu Khemka. Outline. CIKM Implementation of Smoothing techniques on the GPU Re running experiments using the wt2g collection The Future. CIKM. Began my FYP by revamping my UROP report and submitting it for publication to CIKM Learnt the importance of succinct writing
E N D
FYP Progress SudhanshuKhemka
Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future
CIKM • Began my FYP by revamping my UROP report and submitting it for publication to CIKM • Learnt the importance of succinct writing • Importance of re-drawing images • Importance of re-writing equations • Comments by reviewers • Definition of language modeling approach not clear • We should use a standard dataset • Also noticed that we need to improve our smoothing model. Direction for my FYP
Implemented smoothing models • The Good Turing smoothing algorithm • The Kneser Ney Smoothing algorithm
Good Turing Smoothing • Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times. Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = • In the above definition, Nc is the number of N grams that occur c times.
Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = 2 phases: Calculate the Nc values Smooth counts
Calculating NC values Doc1: Sort: Positions: Stream compaction Doc1: N0 = 1 , N1 = 2 , N2 = 1
Smooth Ngram counts Doc1 Thread 1 Thread 2 Thread 3 Thread 0 Let one thread compute the smoothed count for each Ngram Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) =
Knesser Ney Discounting • Intuition: Let us assume we are smoothing bigram counts. In order to smooth the count for a bigram wi-1wi , do the following: • If C > 0 , subtract D from the count : • If C(wi-1wi) = 0, base our estimate on the number of different contexts word w has appeared in. otherwise. =
Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future
Re run experiments using the wt2g collection • Provided by University of Glasgow • Cost : 350 pounds. Size = 2 GB Re run experiments LM indexer HTML parser Text Inverted Index Webpage Results!
HTML Parser and LM Indexer • Both written in Python • Used the lxml API for HTML parsing : from lxml import html • Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text • Beautiful soap is also a good option • Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)
Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future • Implementing Ponte and Croft’s Model • Re running experiments using the TREC GOV 2 collection
Future Work • Modify code to implement Ponte and Croft’s model • Re run experiments using the TREC GOV 2 collection.