1 / 16

FYP Progress

FYP Progress. Sudhanshu Khemka. Outline. CIKM Implementation of Smoothing techniques on the GPU Re running experiments using the wt2g collection The Future. CIKM. Began my FYP by revamping my UROP report and submitting it for publication to CIKM Learnt the importance of succinct writing

kaipo
Download Presentation

FYP Progress

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FYP Progress SudhanshuKhemka

  2. Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future

  3. CIKM • Began my FYP by revamping my UROP report and submitting it for publication to CIKM • Learnt the importance of succinct writing • Importance of re-drawing images • Importance of re-writing equations • Comments by reviewers • Definition of language modeling approach not clear • We should use a standard dataset • Also noticed that we need to improve our smoothing model. Direction for my FYP

  4. Implemented smoothing models • The Good Turing smoothing algorithm • The Kneser Ney Smoothing algorithm

  5. Good Turing Smoothing • Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times. Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = • In the above definition, Nc is the number of N grams that occur c times.

  6. Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = 2 phases: Calculate the Nc values Smooth counts

  7. Calculating NC values Doc1: Sort: Positions: Stream compaction Doc1: N0 = 1 , N1 = 2 , N2 = 1

  8. Smooth Ngram counts Doc1 Thread 1 Thread 2 Thread 3 Thread 0 Let one thread compute the smoothed count for each Ngram Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) =

  9. Experimental results

  10. Knesser Ney Discounting • Intuition: Let us assume we are smoothing bigram counts. In order to smooth the count for a bigram wi-1wi , do the following: • If C > 0 , subtract D from the count : • If C(wi-1wi) = 0, base our estimate on the number of different contexts word w has appeared in. otherwise. =

  11. Experimental results

  12. Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future

  13. Re run experiments using the wt2g collection • Provided by University of Glasgow • Cost : 350 pounds. Size = 2 GB Re run experiments LM indexer HTML parser Text Inverted Index Webpage Results!

  14. HTML Parser and LM Indexer • Both written in Python • Used the lxml API for HTML parsing : from lxml import html • Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text • Beautiful soap is also a good option • Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)

  15. Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future • Implementing Ponte and Croft’s Model • Re running experiments using the TREC GOV 2 collection

  16. Future Work • Modify code to implement Ponte and Croft’s model • Re run experiments using the TREC GOV 2 collection.

More Related