FYP Progress

FYP Progress SudhanshuKhemka

Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future

CIKM • Began my FYP by revamping my UROP report and submitting it for publication to CIKM • Learnt the importance of succinct writing • Importance of re-drawing images • Importance of re-writing equations • Comments by reviewers • Definition of language modeling approach not clear • We should use a standard dataset • Also noticed that we need to improve our smoothing model. Direction for my FYP

Implemented smoothing models • The Good Turing smoothing algorithm • The Kneser Ney Smoothing algorithm

Good Turing Smoothing • Intuition : We estimate the probability of things that occur c times using the probability of things that occur c+1 times. Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = • In the above definition, Nc is the number of N grams that occur c times.

Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) = 2 phases: Calculate the Nc values Smooth counts

Calculating NC values Doc1: Sort: Positions: Stream compaction Doc1: N0 = 1 , N1 = 2 , N2 = 1

Smooth Ngram counts Doc1 Thread 1 Thread 2 Thread 3 Thread 0 Let one thread compute the smoothed count for each Ngram Smoothed Count: c* = (c+1) Smoothed Probability: P(c*) =

Experimental results

Knesser Ney Discounting • Intuition: Let us assume we are smoothing bigram counts. In order to smooth the count for a bigram wi-1wi , do the following: • If C > 0 , subtract D from the count : • If C(wi-1wi) = 0, base our estimate on the number of different contexts word w has appeared in. otherwise. =

Experimental results

Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future

Re run experiments using the wt2g collection • Provided by University of Glasgow • Cost : 350 pounds. Size = 2 GB Re run experiments LM indexer HTML parser Text Inverted Index Webpage Results!

HTML Parser and LM Indexer • Both written in Python • Used the lxml API for HTML parsing : from lxml import html • Do not use inbuilt HTML parser provided by Python. Cannot handle broken HTML very well while extracting text • Beautiful soap is also a good option • Used the nltk library for stemming (nltk.stem.porter) and indexing (nltk.word_tokenize)

Outline • CIKM • Implementation of Smoothing techniques on the GPU • Re running experiments using the wt2g collection • The Future • Implementing Ponte and Croft’s Model • Re running experiments using the TREC GOV 2 collection

Future Work • Modify code to implement Ponte and Croft’s model • Re run experiments using the TREC GOV 2 collection.

FYP Progress

FYP Progress

Presentation Transcript

FYP Project LYU0303:

FYP-Seminar Series

UniKL MICET FYP IMPLEMENTATION

FYP Briefing

FYP 14-15

2014-15 FYP Briefing

M479 FYP

WSW FYP 2014

FYP Presentation

FYP PROJECT PRESENTATON

FYP Induction

FYP PRESENTATION 2 WORK OF PROGRESS (From Sept2003 to Dec2003)

FICT FYP 2011

An Overview of Poverty Reduction Strategies: First FYP to Fifth FYP

FYP 2 Briefing

FYP titles 0809

FYP projects 2012-3

FYP Presentation

FYP Briefing

FYP Progress Presentation

FYP Safety

UniKL MICET FYP IMPLEMENTATION