artenten a new vast corpus for arabic
Download
Skip this Video
Download Presentation
arTenTen A new, vast corpus for Arabic

Loading in 2 Seconds...

play fullscreen
1 / 8

arTenTen A new, vast corpus for Arabic - PowerPoint PPT Presentation


  • 196 Views
  • Uploaded on

arTenTen A new, vast corpus for Arabic. Yonatan Belinkov , Nizar Habash , AdamKilgarriff , Noam Ordan , Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz. We all want corpora to be. Bigger Better More text types Richer metadata Cleaner

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'arTenTen A new, vast corpus for Arabic' - gzifa


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
artenten a new vast corpus for arabic

arTenTenA new, vast corpus for Arabic

YonatanBelinkov, NizarHabash, AdamKilgarriff, NoamOrdan, Ryan Roth, VitSuchomel

MIT/Columbia/Lexical Computing Ltd./ UnivSaarlandes/MasarykUnivCz

we all want corpora to be
We all want corpora to be
  • Bigger
  • Better
    • More text types
    • Richer metadata
    • Cleaner
    • Better linguistic processing
arabic
Arabic
  • Since 2003: Arabic Gigaword
      • Good on most fronts except variety
      • Newswire only
  • Leeds
    • 2005 Arabic web corpus (oldish)
  • Others
    • Mostly
      • small
      • or not available
      • or newswire
artenten
arTenTen
  • TenTen family
      • See paper in main conference
    • Web crawled
      • Spiderling
        • Pomikalek and Suchomel, WAC 2012
    • Cleaning and deduplication
      • justText, Onion (Pomikalek)
slide5
Size
  • 5.8 b space-separated tokens

Fully processed:

  • 200M words
      • Tokenise, lemmatise, POS-tag by MADA, Columbia U
      • Sketch grammar: new work (Belinkov)

Varieties/dialects

  • We don’t know yet
availability
Availability
  • In Sketch Engine
  • demo
encoding
Encoding
  • ‘Vertical’ format
      • Sketch Engine input format
    • One word per line, tab-separated columns
      • Twenty-nine
    • Structural markup: XML
for each word
For each word

word (as written, in Arabic) transdiac lemmalemma_arnon_voc_lemmanon_voc_lemma_ar stem tagbw pref3 pref3tag pref2 pref2tagpref1

 pref1tag pref0 pref0tag person aspectvox modus gender number state case encliticgloss

source

ad