1 / 31

Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063

Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063 Tarun Gujjula , 09005068 Asok Ramachandran, 09005072. Part 1 : POS Tagger. Tasks Completed. Implementation of Viterbi – Unigram, Bigram. Five Fold Evaluation. Per POS Accuracy. Confusion Matrix.

Download Presentation

Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Assignment – Final Presentation VarunSuprashanth, 09005063 TarunGujjula, 09005068 Asok Ramachandran, 09005072

  2. Part 1 : POS Tagger

  3. Tasks Completed • Implementation of Viterbi – Unigram, Bigram. • Five Fold Evaluation. • Per POS Accuracy. • Confusion Matrix.

  4. Per POS Accuracy for Bigram Assumption.

  5. Screen shot of Confusion Matrix

  6. Part 2 : Discriminative VS Generative

  7. Problem Statement • Generate unigram parameters of P(t_i|w_i). You already have the annotated corpus. • Compute the argmax of P(T|W); do not invert through Bayes theorem. • Compare with unigram based unigram performance between (2) and the HMM based system.

  8. Tasks Completed • Generated unigram parameters of P(ti|wi). • Computed the argmax of P(T|W). • Compared with unigram based unigram performance between the HMM based system and the above. • Better results were produced by the generative model in cases of ambiguous sentences.

  9. Discriminative • P(T|W) = P(| ) = P(| ). P(| ). ……… P(| ) • Assuming word tag pair to be independent, • P(T|W) = • precision 0.896788 • F-measure 0.896788

  10. Per-PoS Accuracy

  11. Generative • P(T|W) = P(T|W). P(T). • Assuming unigram assumption and word tag pairs to be independent, • P(T|W) = P(| ). P()

  12. Part 3 : Analysis of Corpora Using Word Prediction

  13. Tasks Completed • Predicted the next word on the basis of the patterns occurring in both the corpora. • First Corpus had untagged-word sentences and the second one had tagged-word sentences. • The corpus with the tagged words gives better results for word prediction.

  14. Untagged Corpus • P(|) = • Where c() is the count. • By Bigram Assumption, • P(|) = • By Trigram Assumption, • P(|) =

  15. Tagged Corpus • P(|,) = • Using Bigram Assumption, • P(|,) = • Using Trigram Assumption, • P(|,) =

  16. Examples. • Example 1 : • TO0_to VBI_beCJC_or XX0_not TO0_to • VBI_be • to be or not to • The • Example 2: • AJ0_complete CJC_andAJ0_utter • NN1_contempt • complete and utter • Loud

  17. Examples Cont. • Example 3: • PNQ_whoVBZ_isDPS_your AJ0-NN1_favourite  • NN1_gardening • who is your favourite • is

  18. Results • Raw text LM : • Word Prediction Accuracy: 13.21% • POS tagged text LM : • Word Prediction Accuracy : 15.53%

  19. Part 4 : A-star Implementation

  20. Problem Statement • The goal is to see which algorithm is better for POS tagging, Viterbi or A* • Look upon the column of POS tags above all the words as forming the state space graph. • The start state S is '^' and the goal stage G is '$'. 6. Your job is to come up with a good heuristic. One possibility is that the heuristic value h(N), where N is a node on a word W, is the product of the distance of W from '$' and the least arc cost in the state space graph. • G(N) is the cost of the best path found so far to W from '^'. • Run A* with this heuristic and see the result. • Compare the result with Viterbi.

  21. A-Star Implementation. • precision 0.937254 • F-measure 0.937254

  22. Screen shot of Confusion Matrix

  23. Heuristics. • h = g * (N - n)/ n • Where N is the length of the sentence, and n is the index of the current word in the sentence.

  24. A-star Vs. Viterbi

  25. Part 5 : YAGO

  26. ProblemStatement • Take as input two words and show A PATH between them listing all the concepts that are encountered on the way. • For example, in the path from 'bulldog' to 'cheshire cat', one would presumably encounter 'bulldog-dog-mammal-cat-cheshire cat'.  Similarly for 'VVS Laxman' and 'Hyderabad', 'Tendulkar' and 'Tennis' (you will be surprised!!).

  27. Part 6: Parser Projection

  28. Example • English: Dhoniis the captain of India. • Hindi: dhonibhaaratkekaptaanhai. • Hindi -parse: [             [ [dhoni]NN]NP             [  [[[bhaarat]NNP]NP [ke]P ]PP [kaptaan]NN]NP [hai]VBZ ]VP    ]S • English -parse:      [ [ [Delhi]NN]NP [ [is]VBZ [[the]ART [capital]NN]NP [[of]P [[India]NNP]NP]PP]VP ]S

  29. Problems and Conclusions • Many Idioms in English are translated directly, even though they mean something else, • E.g. Phrases like “break a leg”, “He Lost His Head”, “French kiss”, “Flip the bird” • Noise because of misalignments.

  30. Natural Language Tool Kit • The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. • NLTK includes graphical demonstrations and sample data. • It is accompanied by extensive documentation, including a book that explains the underlying concepts behind the language processing tasks supported by the toolkitIt provides lexical resources such as WordNet. • It has a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

  31. [EOF]

More Related