1 / 27

Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP .

Arabic Text Categorization Based on Arabic Wikipedia. Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP . Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.

oro
Download Presentation

Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arabic Text Categorization Based on Arabic Wikipedia Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP.

  2. Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

  3. Motivation  A challenge due to the correlationbetween certain subcategories and overlap between main categories. EX:

  4. Objectives • To solve this, we use algorithm and further adopt the two approaches .

  5. CATEGORIZATION CORPORA - Training Data Related Tags Approach

  6. Testing Data 10 categories with 40 documents in each category

  7. Methodology - PREPROCESSING TECHNIQUES • Root Extraction (RE) • Light Stemming (LS) • Special Expressions Extraction

  8. Methodology- CATEGORIZATION PROCESS Categorize the input text in two phases Phase one: we categorize the text into one of the main categories. Phase two: We further categorize the input text based on subcategories:

  9. Methodology - Basic Categorization Algorithm (BCA)

  10. Methodology - Percentage and Difference Categorization (PDC) Algorithm has frequency 7 in the 300-word

  11. Methodology - Percentage and Difference Categorization (PDC) Algorithm The category with the highest sum of flag values is considered to be the best match for the input text.

  12. Methodology – PDC Algorithm vs.BCA Algorithm

  13. Methodology – Enhancing Main/Subcategories Grouping Problem : The possible high correlation between subcategories of different main categories (1) Overlapping Main Categories for Phase Two

  14. Methodology – Enhancing Main/Subcategories Grouping (2) Replacing Main Categories by Groups of Related Categories

  15. Methodology – Enhancing Main/Subcategories Grouping

  16. Methodology - Word Filtration Techniques within Categories   

  17. Methodology - The result of applying the three techniques

  18. Modified PDC with N Scales 1 0.5 0 Definea scaling of 0.5 0.25 0 1 0.75

  19. Further Testing on the PDC Algorithm ToolRoot Extraction Tool Light Stemming & Light10 Tool Double Words Tool Expressions Extraction

  20. Using Testing Data from the Reference Categories

  21. Training Data Characteristics

  22. COMPARISON WITH RELATED WORK

  23. Using Testing Data from the Reference Categories

  24. Conclusions • To use training and testing data from same source by splitting the corpus into test and training components. This consistently gives better results. • However, we believe that the second method • (different source ) makes more sense, as the tests will be more credible and indicative of performance in real-life environments.

  25. Comments • Advantages • To. • Applications • Arabic Text Categorization .

More Related