natural language processing assignment group members soumyajit de naveen bansal sanobar nishat n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Outline PowerPoint Presentation
Download Presentation
Outline

Loading in 2 Seconds...

play fullscreen
1 / 60

Outline - PowerPoint PPT Presentation


  • 146 Views
  • Uploaded on

Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat. Outline. POS tagging T ag wise accuracy G raph- tag wise accuracy P recision recall f-score Improvements In POS tagging Implementation of tri-gram POS tagging with smoothing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Outline' - zola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
natural language processing assignment group members soumyajit de naveen bansal sanobar nishat
Natural Language Processing Assignment Group Members:SoumyajitDeNaveen BansalSanobarNishat
outline
Outline
  • POS tagging
  • Tag wise accuracy
  • Graph- tag wise accuracy
  • Precision recall f-score
  • Improvements In POS tagging
  • Implementation of tri-gram
  • POS tagging with smoothing
  • Tag wise accuracy
  • Improved precision, recall and f-score
  • Comparison between Discriminative and Generative Model
  • Next word prediction
  • Model #1
  • Model #2
  • Implementation method and details
  • Scoring ratio
  • perplexity ratio
outline1
Outline
  • NLTK
  • Yago
      • Different examples by using Yago
  • Parsing
      • Different examples
      • Conclusions
  • A* Implementation – A Comparison with Viterbi
precision recall f score
Precision, Recall, F-Score
  • Precision : tp/(tp+fp) = 0.92
  • Recall: tp/(tp+fn) = 1
  • F-score: 2.precision.recall/(precision + recall)

= 0.958

improvement in hmm based pos tagger
Improvement in HMM Based POS Tagger
  • Implementation of Trigram

* Issues - sparcity

* Solution – Implementation of smoothing techniques

* Results – increases overall accuracy up to 94%

smoothing technique
Smoothing Technique
  • Implementation of Smoothing Technique

* Linear Interpolation Technique

* Formula:

i.e.

* Finding value of lambda

(discussed in “ TnT- A Statistical Part-of-Speech Tagger ”)

precision recall f score1
Precision, Recall, F-Score
  • Precision : tp/(tp+fp) = 0.9415
  • Recall: tp/(tp+fn) = 1
  • F-score: 2.precision.recall/(precision + recall)

= 0.97

error analysis tag wise accuracy
Error Analysis (Tag Wise Accuracy)

VVB - finite base form of lexical verbs (e.g. forget, send, live, return)

Count: 9916

error analysis cont
Error Analysis (cont..)

ZZ0 - Alphabetical symbols (e.g. A, a, B, b, c, d) (Accuracy - 63%)

Count: 337

error analysis cont1
Error Analysis (cont..)

ITJ - Interjection (Accuracy - 65%)

Count: 177

Reason: ITJ Tag appeared so less number of times, that it didn't miss classified that much, but yet its percentage is so low

error analysis cont2
Error Analysis (cont..)

UNC - Unclassified items (Accuracy - 23%)

Count: 756

conclusion
Conclusion
  • Since its unigram, Discriminative and Generative Model are giving same performance, as expected
model 1
Model # 1

When only previous word is given

Example: He likes -------

model 2
Model # 2

When previous Tag & previous word are known.

Example: He_PP0 likes_VB0 --------

Previous Work

model 2 cont
Model # 2 (cont..)

Current Work

evaluation method
Evaluation Method

1. Scoring Method

  • Divide the testing corpus into bigram
  • Match the testing corpus 2nd word of bigram with predicted word of each model
  • Increment the score if match found
  • The final evaluation is the ratio of the two scores of each model i.e. model1/model2
  • If ratio > 1 => model 1 is performing better and vice-verca.
implementation detail
Implementation Detail

Look Up Table

Look up is used in predicting the next word

slide33

2. Perplexity:

Comparison:

remarks
Remarks
  • Model 2 is performing poorer than model 1 because of words are sparse among tags.
remarks1
Remarks
  • Perplexity is found to be decreasing in this model.
  • Overall score has been increased.
example 1
Example #1

Query : Amitabh and Sachin

wikicategory_Living_people -- <type> -- Amitabh_Bachchan -- <givenNameOf> -- Amitabh

wikicategory_Living_people -- <type> -- Sachin_Tendulkar -- <givenNameOf> -- Sachin

ANOTHER-PATH

wikicategory_Padma_Shri_recipients -- <type> -- Amitabh_Bachchan -- <givenNameOf> -- Amitabh

wikicategory_Padma_Shri_recipients -- <type> -- Sachin_Tendulkar -- <givenNameOf> -- Sachin

example 2
Example#2

Query : India and Pakistan

PATH

wikicategory_WTO_member_economies -- <type> -- India

wikicategory_WTO_member_economies -- <type> -- Pakistan

ANOTHER-PATH

wikicategory_English-speaking_countries_and_territories -- <type> -- India

wikicategory_English-speaking_countries_and_territories -- <type> -- Pakistan

ANOTHER-PATH

Operation_Meghdoot -- <participatedIn> -- India

Operation_Meghdoot -- <participatedIn> -- Pakistan

slide43

ANOTHER-PATH

Operation_Trident_(Indo-Pakistani_War) -- <participatedIn> -- India

Operation_Trident_(Indo-Pakistani_War) -- <participatedIn> -- Pakistan

ANOTHER-PATH

Siachen_conflict -- <participatedIn> -- India

Siachen_conflict -- <participatedIn> -- Pakistan

ANOTHER-PATH

wikicategory_Asian_countries -- <type> -- India

wikicategory_Asian_countries -- <type> -- Pakistan

slide44

ANOTHER-PATH

Capture_of_Kishangarh_Fort -- <participatedIn> -- India

Capture_of_Kishangarh_Fort -- <participatedIn> -- Pakistan

ANOTHER-PATH

wikicategory_South_Asian_countries -- <type> -- India

wikicategory_South_Asian_countries -- <type> -- Pakistan

ANOTHER-PATH

Operation_Enduring_Freedom -- <participatedIn> -- India

Operation_Enduring_Freedom -- <participatedIn> -- Pakistan

ANOTHER-PATH

wordnet_region_108630039 -- <type> -- India

wordnet_region_108630039 -- <type> -- Pakistan

example 3
Example #3

Query: Tom and Jerry

wikicategory_Living_people -- <type> -- Tom_Green -- <givenNameOf> -- Tom

wikicategory_Living_people -- <type> -- Jerry_Brown -- <givenNameOf> -- Jerry

example 5
Example#5

Example#4

conclusion1
Conclusion

1. VBZ always comes at the end of the parse tree in Hindi and Urdu.

2. The structure in Hindi and Urdu is always expand or reset to NP VB e.g. S=> NP VP (no change) OR VP => VBZ NP (interchange)

3. For exact translation in Hindi and Urdu, merging of sub-tree in englishis sometimes required

4. One word to multiple words mapping is common while translating from English to Hindi/Urdu e.g. donar => aatiyashuda OR have => rakhtahai

5. Phrase to phrase translation is sometimes required, so chunking is required e.g. hand in hand => choli daman ka saath (Urdu) => sathsathhain (Hindi)

6. DT NN or DT NP doesn’t interchange

7. In example#7: correct translation won’t require merging of two sub-trees MD and VP e.g. could be => jasaktahai

nltk toolkit
NLTK Toolkit
  • NLTK is a suite of open source Python modules
  • Components of NLTK : Code, Corpora >30 annotated data sets
  • corpus readers
  • tokenizers
  • stemmers
  • taggers
  • parsers
  • wordnet
  • semantic interpretation
a heuristic
A* - Heuristic

A B C D

^

$

Selected Route

Transition probability

Fixed cost at each level (L)= (Min cost)* No. of Hops

heuristic h
Heuristic(h)

F(B) = g(B) + h(B)

where h(B) = min(i,j){-log(Pr(Cj|Bi) * Pr(Wc|Cj))} +

min(i,j){-log(Pr(tj|ti) * Pr(Wtj|tj))} * (n-2) +

min(k){-log(Pr($|Dk) * Pr($|$))}

Here,

n = #nodes from Bi to $ (including $)

Wc = word emitted from next node, C

ti ,tj = any combination of tags in the graph

Wtj = word emitted from the node tj

result
Result

Viterbi / A* Ratio :

score(Viterbi) / score(A*) = 1.0

Where,

score(Algo) = #correct predictions in the test corpus

conclusion2
Conclusion
  • Since we are making bigram assumption and Viterbi is pruned in a careful way that is guaranteed to find the optimal path in a bigram HMM, its giving the optimal path. For A*, since our heuristic is underestimating and also maintains triangular inequality, A* is also giving the optimal path in the graph.
  • Since A* has to backtrack at times, it requires more time and memory to find the solution compared to Viterbi.