1 / 43

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Using Wikipedia for Hierarchical Finer Categorization of Named Entities. Aasish Pappu Language Technologies Institute Carnegie Mellon University. PACLIC 2009. outline. 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion.

aileen
Download Presentation

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Wikipedia for Hierarchical Finer Categorizationof Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009

  2. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

  3. Structured and organized encyclopedic corpus is a suitable training corpus. • a wide range of topics • provides hyperlinks 1 Introduction

  4. In this paper • Discuss the usability of Wikipedia • Induce WordNet and Wikipedia domain taxonomy into the feature space • Using Maximum Entropy and SVM classifier 1 Introduction

  5. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

  6. Kazama and Torisawa (2007) • extracted gloss text • Dakka and Cucerzan (2008) • tagging the Wikipedia data • Bunescu and Pasca (2006) • built a disambiguation system 2 Related Work

  7. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

  8. 10-18-2007 English version of Wikipedia • 2 million articles • 292,384 categories • a taxonomy with a depth about 10 • 5882 Wikipedia Stub categories • 105 domains 3 Corpus Creation

  9. 3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure

  10. taxonomy • constituted by categories • linked to other categories across depth and breadth • contains cycles • Tackled by Zesch and Gurevych, 2007 • wikipedia taxonomy is not a tree 3.1 Categories in Wikipedia

  11. 3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure

  12. the domain hierarchy • 17 basic domains • 88 sub-domains 3.2 Named entity categories

  13. to avoid the bias towards any particular domain • rules to choose set of categories • To ensure diversity in the categorization task • To ensure we select balanced categories • consider category with each parameter closest to mean value under that domain 3.2 Named entity categories

  14. 3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure

  15. extract named entity phrases • using Stanford POS tagger • extract typed dependency relationships • extract the content words around a named entity • collect the NPs (noun phrases) and VPs (verb phrases) 3.3 Procedure

  16. Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity. • If, there are more than one such titles, consider the target title using minimum edit distance metric. • Pick all articles that fall under the same category as the target article. • Look for those articles that fall under the special categories that are chosen for the classification task. • Find the article that shares maximum number of categories with the target article and label the target article with the its special category. 3.3 Procedure

  17. About 10,000 samples • Training 75% • Testing 25% 3.3 Procedure

  18. 3.3 Procedure

  19. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

  20. four types of feature sets • a syntactic feature set • three semantic features 4 Features

  21. 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features

  22. phrase structure parse • nesting of multi-word constituents • dependency parse • dependencies between individual words • dependency relations gives a clue about probable semantic relations that can be associated with the named entity. 4.1 Typed Dependency Feature

  23. 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features

  24. preferred to have a hypernym feature which is semantically specific • hypernyms of all synsets are inversely ordered according to their depth in the hypernymtree • deepest hypernymin the lot is choosen as the target feature for that content word 4.2 Hypernyms

  25. 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features

  26. 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System

  27. Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH) • There are 5 top-level domains and 46 basic domains in WDH. 4.3.1 Wordnet domains

  28. 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System

  29. indexed Wikipedia • search content words in the index for the categories that contain more number of pages containing a content word • Especially, pages with links are weighed double the pages that contains the word without a hyperlink. 4.3.2 Wikipedia domains

  30. 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System

  31. 4.3.3 WDH vsWikipedia Domain System

  32. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

  33. 5 Experiments

  34. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis

  35. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis

  36. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis

  37. outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

  38. presented a named entity categorization system • employs Wikipedia categories as classes • adapted hierachial categorization of Wikipedia • mine relations among named entities

More Related