1 / 19

Building a Tagged Corpus of Russian: A Bazaar Approach

Chris Tessone Modern Languages Department Knox College ctessone@knox.edu Committee: Charles Mills Don Blaheta Jay Krumbholz Steven Clancy, U. of Chicago. Building a Tagged Corpus of Russian: A Bazaar Approach. Corpora of Natural Language Texts. Applications:

tekli
Download Presentation

Building a Tagged Corpus of Russian: A Bazaar Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chris Tessone Modern Languages Department Knox College ctessone@knox.edu Committee: Charles Mills Don Blaheta Jay Krumbholz Steven Clancy, U. of Chicago Building a Tagged Corpus of Russian: A Bazaar Approach

  2. Corpora of Natural Language Texts • Applications: • Linguistics: Empirical word distribution data • Education: Morphological helps for new words • Corpora now available in many languages: • Czech National Corpus (100,000,000 words) • NEGRA (350,000 words) • Kyoto University Corpus (1,000,000 words)

  3. Problems with Traditional Corpus Development • Annotating data requires many man-hours of skilled labor • Resulting corpora for non-commercial use only • Features may not match community expectations • Ethical problems with correcting errors

  4. The Bazaar Model • First used in software engineering • Fetchmail • Linux kernel • Users develop the features they need • Healthy projects require a critical mass of volunteers

  5. Advantages of the Bazaar Model for Corpus Development • Corpora can be developed quickly and inexpensively • Most important tools and data emerge first • Open licensing gives corporations incentive to help • IBM, SGI, and the Linux kernel • Netscape and Mozilla

  6. A Tagged Corpus of Russian • Texts taken from Russian LiveJournals • Varying registers • International community • Software licensed under GPL, data under Creative Commons • First such corpus of Russian texts in the world

  7. XML Basics • Similar to HTML • Each unit of data is marked by start and end tags • Start tags can include attributes • Wide range of XML aware software

  8. The Annotation Process • Removal of HTML markup • Sentence boundary annotation • Tokenization • Part of speech tagging

  9. Example Post: After HTML Removal

  10. Sentence Boundary Detection • Place sentence boundaries where . ? and ! precede a capital letter • Also place boundaries where an emoticon precedes a capital letter • Disqualify a sentence boundary if period is part of certain abbreviations

  11. Example Post: After Sentence Boundary Detection

  12. Tokenization • Any string surrounded by whitespace tentatively considered a token • Most punctuation also separated • Periods in abbreviations not separated • Emoticons and ellipses considered a single token.

  13. Example Post: After Tokenization

  14. Part of Speech Tagging • Many words are ambiguous, even in Russian • segodnja • chto • Part of speech depends on surrounding words • Suffix probabilities help in tagging previously unseen words.

  15. The Viterbi Algorithm • Eliminates some combinations of tags, saving on calculations • Proven to preserve most likely combination

  16. Example Post: After Part of Speech Tagging

  17. Results • Training 1500 words, testing 500 words • Naïve method: 58.1% • With 2-letter suffix function: 72.6% • With variable-length suffix function: 74.0% • Training 2300 words, testing 500 words • Naïve method: 60.1% • With 2-letter suffix function: 76.7% • With variable-length suffix function: 78.3%

  18. Conclusions • Feedback is positive. • Results suggest automatic annotation useful with small training (1500 words). • New data at 450 words per hour • Sentence boundaries at 18,000 words/hour • Tokenization at 10,000 words/hour • Part of speech tagging at 500 words/hour

  19. Chris Tessone Candidate for Honors in Russian Modern Languages Department Knox College ctessone@knox.edu

More Related