1 / 46

Comparable corpora and its application

Comparable corpora and its application. Guide: Dr. Pushpak Bhattacharya. Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) Ashutosh Nirala (10305906) Brijesh Bhatt(10405301). Outline. Motivation Comparable Corpora (Non-parallel Corpora) Basic Architecture

april
Download Presentation

Comparable corpora and its application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparable corporaand its application Guide: Dr. Pushpak Bhattacharya Presented by Srijit Dutt(10305056) Janardhan Singh(10305067) AshutoshNirala(10305906) Brijesh Bhatt(10405301)

  2. Outline • Motivation • Comparable Corpora (Non-parallel Corpora) • Basic Architecture • Geometrical view • Improvements

  3. Motivation • Corpus is holy grail in NLP • Bilingual Dictionary Generation • Parallel Corpora • One to one correspondence in content • Parallel corpora is rare • Resource constraint language (Punjabi - Spanish) • Monolingual corpus readily available • World Wide Web(Non-parallel corpus) • Techniques to work on non-parallel corpus

  4. Non-parallel corpora • Characteristics • No parallel sentence • No parallel paragraphs • Fewer overlapping terms and words • Four dimension • Author • Domain • Topic • Time Finding terminology translations from non parallel corpora, Fung et al, 1997

  5. Comparable Corpora OneIndia.in

  6. Comparable Corpora Navbharat Times

  7. Postulates for non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • If a domain specific term A is related to another term B in some text T then its counterpart A' is related to B' in some other text T' Finding terminology translations from non parallel corpora, Fung et al, 1997

  8. Using non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • If A is less associated with E then A' is less associated with E' Finding terminology translations from non parallel corpora, Fung et al, 1997

  9. Using non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • Given a large set of words, a words is only associated with some of the words. Finding terminology translations from non parallel corpora, Fung et al, 1997

  10. Using non-parallel corpora • Basic postulate (Fung et al. 1997) T T’ B’ B C C’ A’ E A E’ D’ D • If A is closely associated with word B, C in varying degree then A' is also closely associated with the same varying degrees to B’ and C’. Finding terminology translations from non parallel corpora, Fung et al, 1997

  11. Histogram (Debenture) Corpus2 Corpus1 Frequency Seed Words Seed Words

  12. Histogram (Administration) Corpus1 Corpus2 Frequency Seed Words Seed Words

  13. Co-occurrence Relation Known seed words of both languages (Online dictionary) किताब book

  14. Co-occurrence matrix Base Lexicon/ Dictionary Knowledge Tree Library Target Language Matrix Words in Corpus Book 1 … 0 … 1 ज्ञान पेड़ पुस्तककालय Word in source language Co-occurence vector किताब 1 0 1 Base Lexicon/ Dictionary

  15. Improvements on Basic Architecture • Co-occurrence Counts • Similarity Measure • Window Size • Is it same for all words ? • Dictionary • Polysemous and Synonym Words • What if dictionary is not available ?

  16. Context vector Word Co-occurrence Count Word A A X B A B X X B A Window Size : 3 B X A B occurs in dictionary X is any word Automatic Identification of Word Translations, Rapp, 1999

  17. Co-occurrence Counts • Mutual Information (Church et al 1989) • Conditional Probability (Fung et al 1996) • Chi-Square Test (Dunning et al 1993) • Log-likelihood Ratio (Rapp 1998) • TF-IDF (Fung et al 1997)

  18. Conditional Probability • k11= frequency of common occurrence of word ws and word wt • k12 = corpus frequency of word ws – k11 • k21 = corpus frequency of word wt – k11 • k22 = size of corpus (no. of tokens) – corpus frequency of ws - corpus frequency of wt • Marginal and joint probability Finding terminology translations from non parallel corpora, Fung et al, 1997

  19. Co-occurrence Counts • Mutual information • TF-IDF Finding terminology translations from non parallel corpora, Fung et al, 1997

  20. Co-occurrence Counts • Log Likelihood • k11= frequency of common occurrence of word ws and word wt • k12 = corpus frequency of word ws – k11 • k21 = corpus frequency of word wt – k11 • k22 = size of corpus (no. of tokens) – corpus frequency of ws - corpus frequency of wt where Automatic Identification of Word Translations, Rapp, 1999

  21. Similarity Measures • Cosine Similarity • Jaccard Similarity • Euclidian\L2 • Manhattan\L1\City-Block

  22. Window Size • What is ideal context size ? • Same window size • “amount” : more frequent • “debenture” : less frequent Window Size

  23. Dependency Tree Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

  24. Modeling context using dependency tree • The four vectors for positions are mapped as follows: -1 – Immediate parent +1 – Immediate child -2 – grand parent +2 – grand child Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

  25. Context vector v/s dependency parsing Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

  26. Dependency Tree • Context is better captured in dependency information rather than adjacent words • Long distance dependencies capture associated words • Languages with different word orders : parent and child relationship • Higher Accuracy Improving Translation Lexicon Induction from Monolingual Corpora via dependency context and POS equivalence, N Garera et al. , 2009

  27. Dictionary as seed word list (issues) • Multiple translation • Polysemous words • Words in one text may not be present in other • Word may not be in dictionary format Finding terminology translations from non parallel corpora, Fung et al, 1997

  28. Geometrical View (translation) A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora, E. Gaussier,J M Renders, I Matveeva, C Goutte, H Dejean,2004

  29. Geometric View (Extended Approach) (translation) A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora, E. Gaussier,J M Renders, I Matveeva, C Goutte, H Dejean,2004

  30. Translation without dictionary • What if dictionary is not available? • Find language for which dictionary is available. • Use that language as intermediate language between source and target language.

  31. Use of pivot language Unavailability of bilingual lexicon Use pivot language for which bilingual lexicon is available. Y Z X Source Language Pivot Language Target Language What if Y is polysemous??? Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

  32. Use of pivot language Source : Hindi Pivot: English X = प्रकाश Y = light Y Z X Source Language Pivot Language Target Language Lexicons are intransitive. Results in noisy translation. Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

  33. Corpus to handle intransitivity C1 : Source Corpus C2: Target Corpus Pivot X Z {Z1,Z2} S(X) C1 C2 S(X) = Signature of X Z1, Z2 Target signature NAS(s,t) =Z = Winning signature Bilingual Lexicon Generation Using Non-Aligned Signatures, Shezaf et al,

  34. Limitation of Context-based Approach • Lexical context around translation candidates. • Words may appear in similar context but are not translation of each other. So leads to false translation. • E.g.# using Chinese English comparable corpus we get (using definition of Fung 1995) • Distance between vector 1 & 2 is 0.084 > distance between vector 1 and 3 which is 0.075 • Does not use rich syntactic information other than bag-of-words. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  35. Dependency Heterogeneity • Dependency Heterogeneity phenomena:a word in source language shares similar head and modifiers with its translation in target language, no matter whether they occur in similar context or not. • Uses rich syntactic information. • E.g.# • big(MOD) brown(MOD) dog(HEAD) • Bird(MOD) song(HEAD) • Song(MOD) bird(HEAD) Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  36. Does it work? Frequently used Modifier Frequently used Head Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  37. Comparable Corpora Preprocessing Refinements • Stemming on translation candidate. • Removal of stop words. • Sentences having more than k (= 30) words are removed. Focus is on Chinese-English bilingual dictionary extraction for single-nouns Raw corpora: Chinese and English pages from Wikipedia with inter-language links Morphological Analyzer POS tagger MaltParser to get syntactic dependency. Refinement to get preprocessed corpora. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  38. Dependency Heterogeneity Vector Calculation Where: • NMOD : noun modifier • SUB : subject • OBJ : object, are the dependency labels produced by MaltParser. No Bilingual Dictionary is needed Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  39. Bilingual Dictionary Extraction (contd) • From this method distance(DH) between • DH(经济学, economics) = 0.222 & • DH(经济学, medicine) = 0.496. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  40. Results of Bilingual Dictionary Extraction • Performed on 250 Chinese/English single-noun pairs • only-mod: (HNMODMod ) • only-head: (HNMODHead ,HSUBHead ,HOBJHead ) • only-NMOD: (HNMODHead ,HNMODMod ) Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity, Kun Yu & Junichi Tsujii, 2009

  41. Result

  42. Conclusion • Use of non-parallel corpora is inevitable and reduces the efforts of development of parallel corpora. • Modern techniques achieve accuracy upto 70% with non-parallel corpora. • Polysemy and sense disambiguation remains major challenge. • It becomes difficult to compare different implementation due to different nature of language and corpus.

  43. References • Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora,Hong Kong, 192-202. • Fung, P.; Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. In: Proceedings of COLING-ACL 1998,Montreal, Vol. 1,414-420. • R. Rapp. 1999. Automatic Identification of Word Translations from Unrelated English and German Corpora. In Proc. of the ACL-99. pp. 1–17. College Park, USA.

  44. References • Gaussier, Eric, Jean-Michel Renders, Irina Matveeva, Cyril Goutte, and Herve Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain. • X.Robitaille, Y.Sasaki, M.Tonoike, S.Sato and T.Utsuro. 2006. Compiling French Japanese Terminologies from the Web. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. • E.Morin, B.Daille, K.Takeuchi and K.Kageura. 2007. Bilingual Terminology Mining – Using Brain, not Brawn Comparable Corpora. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. pp. 664-671.

  45. References • Nikesh Garera, Chris Callison-Burch, and David Yarowsky. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Conference on Natural Language Learning (CoNLL), Boulder, Colorado. • K.Yu and J.Tsujii. 2009. Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT 2009). • Daphna Shezaf, Ari Rappoport, Bilingual lexicon generation using non-aligned signatures Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 98–107, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics

  46. THANK YOU Questions ?

More Related