1 / 15

A Text Processing Tool for the Romanian Language

A Text Processing Tool for the Romanian Language. Oana Frunza and Diana Inkpen David Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada

sugar
Download Presentation

A Text Processing Tool for the Romanian Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Text Processing Tool for the Romanian Language Oana Frunza and Diana Inkpen David Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada {ofrunza,diana}@site.uottawa.ca David.Nadeau@nrc-cnrc.gc.ca

  2. Outline • BALIE System • RO-BALIE • Capabilities • Improvements • Evaluation & Results • Future Work

  3. BALIE- BaseLine Information Extraction • Multilingual information extraction system • Language identification • Tokenization • Sentence boundary detection • Part-of-speech tagging for English, French, German, Spanish [1] • Java trainable open source system • Uses WEKA [2] a Machine Learning Tool • Uses QTag [3] – a language independent probabilistic part-of-speech tagger

  4. BALIE- BaseLine Information Extraction (cont.) • Input Example 1.Introduction    Information  Extraction  (IE)  is   the  name  given   to  any  process  which  selectively  structures  and    combines data  which  is   found,  explicitly  stated  or  implied,  in  one  or  more  texts.

  5. BALIE- BaseLine Information Extraction (cont.) • Output <?xml version="1.0" ?> <balie> <tokenList> <s> <token type="2" pos="number" canon="1">1</token> <token type="1" pos="period" canon=".">.</token> <token type="2" pos="noun" canon="introduction">Introduction</token> </s> <s> <token type="2" pos="noun" canon=“information">Information</token> … </s> </tokenList> </balie>

  6. RO-BALIE • Improvements • Easier manipulation of the input and output texts • A new tag set that maps the numerical tag set internally used by BALIE • More information in the output provided by the system Available at: http://www.site.uottawa.ca/~ofrunza/RO-Balie/RO-Balie.html

  7. RO-BALIE • Language Identification • 2-grams (sequence of 2 characters) • Naïve Bayes classifier • Overall accuracy is: 99.25%.

  8. RO-BALIE (cont.) • Tokenization • Split each compound word based on “-” and “/” • Examples: iat-o,socio-economic Tokenization results:

  9. RO-BALIE (cont.) • Sentence Boundary Detection • Training – 106 hand-tagged English sentences • Decision Tree Classifier • Features • Beginning of the sentence – first token • Previous token • Current token • Next token

  10. RO-BALIE (cont.) • Sentence Boundary Detection (cont.) • Feature values • Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. • A list with Romanian abbreviations (510) • Evaluation on Orwell’s 1984 novel

  11. RO-BALIE (cont.) • Part-of-speech tagging – QTag tagger • Used a corpus of 40 million words of newspaper articles • Romanian newspapers 3-year period • The training corpus is 98% accurate • Our system has a tagset of 14 tags for POS and 30 tags for punctuations

  12. RO-BALIE (cont.) • Output for Apel tirziu si inutil NISTORESCU. <?xml version="1.0" ?> <balie> <Language ID="Romanian"> <tokenList> <Tokens Count="896"> <s id="1"> <token type="2" pos="NN" canon="apel">Apel</token> <token type="2" pos="ADV" canon="tirziu">tirziu</token> <token type="2" pos="CJ" canon="si">si</token> <token type="2" pos="NN" canon="inutil">inutil</token> <token type="2" pos="PN" canon="nistorescu">NISTORESCU</token> <token type="1" pos="PER" canon=".">.</token> </s> </Tokens> </tokenList> </Language> </balie>

  13. RO-BALIE (cont.) • Future Work • Use machine learning for the tokenization task • Add new services: morphological analysis, named entity recognition, etc. • Add more specific information for each supported language.

  14. RO-BALIE (cont.) • References 1. http://balie.sourceforge.net/index.html 2. http://www.cs.waikato.ac.nz/~ml/weka/ 3.http://www.english.bham.ac.uk/staff/omason/software/qtag.html http://www.site.uottawa.ca/~ofrunza/RO-Balie/RO-Balie.html

  15. THANK YOU! ?? ? ?

More Related