1 / 19

Using Technology Transfer to Advance Automatic Lemmatisation for Setswana

Using Technology Transfer to Advance Automatic Lemmatisation for Setswana. Introduction Lemmatisation Methodology Conclusion. Overview. Introduction Lemmatisation Lemmatisation in Setswana Lemmatisation in Afrikaans Methodology Memory-based Learning Architecture Data Implementation

maille
Download Presentation

Using Technology Transfer to Advance Automatic Lemmatisation for Setswana

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Technology Transfer to Advance Automatic Lemmatisation for Setswana

  2. Introduction Lemmatisation Methodology Conclusion Overview • Introduction • Lemmatisation • Lemmatisation in Setswana • Lemmatisation in Afrikaans • Methodology • Memory-based Learning • Architecture • Data • Implementation • Conclusion

  3. Introduction Lemmatisation Methodology Conclusion Introduction I 31 March 2009; Athens • South Africa has 11 official languages • English has the most HLT resources • Situation is changing • SA Government is supporting initiatives to develop core linguistic resources and technologies

  4. Introduction Lemmatisation Methodology Conclusion Introduction II 31 March 2009; Athens • Focus: Using technology transfer for • Improving existing linguistic resources • Fast-tracking development • Improving an existing Setswana lemmatiser by applying a method developed for Afrikaans

  5. Introduction Lemmatisation Methodology Conclusion Lemmatisation: Overview Overview Setswana Afrikaans 31 March 2009; Athens • Process whereby the inflected forms of a word are converted/normalised under the lemma or base form • swim, swimming, swam -> swim • Lemmatisation is an important process for many NLP tasks • Information Retrieval • Morphological Analysis

  6. Introduction Lemmatisation Methodology Conclusion Lemmatisation: Overview Overview Setswana Afrikaans 31 March 2009; Athens • Not to be confused with Stemming • The process whereby a word is reduced to its stem by removing both inflectional and derivational morphemes • Two popular approaches to lemmatisation • Rule-based approach • Statistically/data-driven approach

  7. Introduction Lemmatisation Methodology Conclusion Lemmatisation: Setswana Overview Setswana Afrikaans 31 March 2009; Athens • First Automatic Lemmatiser for Setswana developed by Brits (2006) • Found that only stems (and not roots) can act independently as words • Stems should be accepted as lemmas • Brits formalised rules for determining lemmas • Implemented as Finite-state transducers • Accuracy: 62.17% when evaluated on a dataset containing 295 randomly selected words

  8. Introduction Lemmatisation Methodology Conclusion Lemmatisation: Afrikaans Overview Setswana Afrikaans 31 March 2009; Athens • 2003: Ragel – Accuracy of 67% when evaluated on a 1,000 word data set • Disappointing accuracy motivated development of another lemmatiser using a different approach • New Lemmatiser called Lia • Based on data-driven machine learning method • 73,000 lemma-annotated words • Accuracy 92,8% on new data • Motivated the application of machine learning methods for lemmatisation in Setswana

  9. Introduction Lemmatisation Methodology Conclusion Methodology: Memory-based Learning Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • Based on k-NN algorithm • All instances of a certain problem correspond to points in a n-dimensional space • Nearest neighbours computed by some form of distance metric

  10. Introduction Lemmatisation Methodology Conclusion Methodology: Architecture Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Based on k-NN algorithm • All instances of a certain problem correspond to points in a n-dimensional space • Nearest neighbours computed by some form of distance metric

  11. Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • MBL requires large amounts of data • Only 2,947 lemma-annotated Setswana words available (Brits’s evaluation set) • 2,947 words are a very small data set in memory-based learning terms

  12. Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • MBL requires that lemmatisation be performed as a classification task • Data should consist of feature vectors with assigned class labels • Feature vectors: letters of the word • Class label: Transformation from word to lemma

  13. Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • Deriving class labels • Longest common substring • Indicates the string that needs to be removed, as well as possible replacement strings during the transformation from word form to lemma • Positions of the character strings that need to be removed are indicated as L (left) or R (right) • If the word form and lemma are identical, the awarded class is “0”

  14. Introduction Lemmatisation Methodology Conclusion Methodology: Data Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Deriving classes

  15. Introduction Lemmatisation Methodology Conclusion Methodology: Implementation Memory-based Learning Architecture Data Implementation 31 March 2009; Athens • Data • 90% for training • 10% for evaluation • First version (default algorithmic parameters) • 46.25% Accuracy • Parameter optimisation • 58.98% • Accuracy is below that of the rule-based version of Brits

  16. Introduction Lemmatisation Methodology Conclusion Methodology: Implementation Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Error analysis indicated obvious mistakes

  17. Introduction Lemmatisation Methodology Conclusion Methodology: Implementation Memory-based Learning Architecture Data Implementation 31 March 2009; Athens Solution: Add class distributions to the output and implement a “back-off” mechanism Resulted in a further increase in accuracy: 64.06%

  18. Introduction Lemmatisation Methodology Conclusion Conclusion 31 March 2009; Athens • The machine learning-based lemmatiser is only 1.9% more accurate than the rule-based version • Small in comparison to the 25% increase obtained for Afrikaans • Size of the training data • 2,652 words compared to 73,000 for Afrikaans • Increasing the amount of training data will increase the accuracy • Most important result: Technology Transfer

  19. Introduction Lemmatisation Methodology Conclusion Acknowledgements 31 March 2009; Athens The work of Jeanetta H. Brits, performed under the supervision of Rigardt Pretorius and Gerhard B. van Huyssteen

More Related