Artenten a new vast corpus for arabic
This presentation is the property of its rightful owner.
Sponsored Links
1 / 8

arTenTen A new, vast corpus for Arabic PowerPoint PPT Presentation


  • 152 Views
  • Uploaded on
  • Presentation posted in: General

arTenTen A new, vast corpus for Arabic. Yonatan Belinkov , Nizar Habash , AdamKilgarriff , Noam Ordan , Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz. We all want corpora to be. Bigger Better More text types Richer metadata Cleaner

Download Presentation

arTenTen A new, vast corpus for Arabic

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Artenten a new vast corpus for arabic

arTenTenA new, vast corpus for Arabic

YonatanBelinkov, NizarHabash, AdamKilgarriff, NoamOrdan, Ryan Roth, VitSuchomel

MIT/Columbia/Lexical Computing Ltd./ UnivSaarlandes/MasarykUnivCz


We all want corpora to be

We all want corpora to be

  • Bigger

  • Better

    • More text types

    • Richer metadata

    • Cleaner

    • Better linguistic processing


Arabic

Arabic

  • Since 2003: Arabic Gigaword

    • Good on most fronts except variety

    • Newswire only

  • Leeds

    • 2005 Arabic web corpus (oldish)

  • Others

    • Mostly

      • small

      • or not available

      • or newswire


  • Artenten

    arTenTen

    • TenTen family

      • See paper in main conference

    • Web crawled

      • Spiderling

        • Pomikalek and Suchomel, WAC 2012

    • Cleaning and deduplication

      • justText, Onion (Pomikalek)


    Artenten a new vast corpus for arabic

    Size

    • 5.8 b space-separated tokens

      Fully processed:

    • 200M words

      • Tokenise, lemmatise, POS-tag by MADA, Columbia U

      • Sketch grammar: new work (Belinkov)

        Varieties/dialects

  • We don’t know yet


  • Availability

    Availability

    • In Sketch Engine

    • demo


    Encoding

    Encoding

    • ‘Vertical’ format

      • Sketch Engine input format

    • One word per line, tab-separated columns

      • Twenty-nine

    • Structural markup: XML


    For each word

    For each word

    word (as written, in Arabic) transdiac lemmalemma_arnon_voc_lemmanon_voc_lemma_ar stem tagbw pref3 pref3tag pref2 pref2tagpref1

     pref1tag pref0 pref0tag person aspectvox modus gender number state case encliticgloss

    source


  • Login