1 / 20

Mining Wiki Resources for M ultilingual Named Entity Recognition

Mining Wiki Resources for M ultilingual Named Entity Recognition. Department of Defense ACL 2008. Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen. Alexander E. Richman & Patrick Schone. Introduction.

Download Presentation

Mining Wiki Resources for M ultilingual Named Entity Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Wiki Resources for Multilingual Named Entity Recognition Department of Defense ACL 2008 Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen Alexander E. Richman & Patrick Schone

  2. Introduction • Using the multilingualWikipedia to automatically create an annotated corpus of text in any given language. • Languages : French, Ukrainian, Spanish, Polish, Russian, and Portuguese. • Do not use of any non-English linguistic resources outside of the Wikimedia domain and any semantic resources such as WordNet or POS tagger. • Use an internally modified variant of BBN's IdentiFinder (Bikel et al., 1999), specifically modified to emphasize fast text processing, called “PhoenixIDF.” 2

  3. Related Work Toral and Muñoz (2006) used Wikipedia to create lists of named entities. Rely on WordNet, and need a manual supervision step Kazama and Torisawa (2007) used Wikipedia to building entity dictionaries. Rely on POS tagger Cucerzan (2007) used Wikipedia primarily for Named Entity Disambiguation, following the path of Bunescu and Pasca (2006) Using Category, but specific to English

  4. Wikipedia • Multilingual, collaborative encyclopedia on the Web which is freely available • As of October 2007, there were over 2 million articles in English, and 30 languages with at least 50,000 articles and another 40 with at least 10,000 articles. 4

  5. Wikipedia - feature • Article links, links from one article to another of the same language. • Category links, links from an article to special “Category” pages. • Interwiki links, links from an article to a presumably equivalent, article in another language. • Redirect pages, short pages which often provide equivalent names for an entity • Disambiguation pages, a page with little content that links to multiple similarly named articles. • Example: http://en.wikipedia.org/wiki/FBI 5

  6. Training Data Generation • Initial Set-up • English Language Categorization • Multilingual Categorization • The Full System 6

  7. Initial Set-up ACE Named Entity types: PERSON, GPE (Geo-Political Entities), ORGANIZATION, VEHICLE, WEAPON, LOCATION, FACILITY, DATE, TIME, MONEY, and PERCENT. MUC tags like <ENAMEX TYPE=“GPE”>Place Name</ENAMEX> Process: Identifies words and phrases that might represent entities. Uses category links and/or interwiki links to associate that phrase with an English language phrase or set of Categories. Determines the appropriate type of the English language data and assumes that the original phrase is of the same type.

  8. English Language Categorization(1) Wiki UsefulCategory => Key Category Phrase => Disambiguation Pages? => Wiktionary Useful Category: “Category:Living People” :PERSON “Category:Cities in Norway”:GPE Useless Category: “Category:1912 Establishments” which includes articles on Fenway Park (a facility), the Republic of China (a GPE), and the Better Business Bureau (an organization).

  9. English Language Categorization(2)

  10. Multilingual Categorization Not all articles have English equivalent, but many of the most useful categories have English equivalents. French: “Catégorie:Commune des Côtes-d'Armor,” “Catégorie:Ville portuaire de France,” “Catégorie:Port de plaisance,” and “Catégorie:Station balnéaire.” English: “Category: Communes of Côtes-d'Armor,” UNKNOWN, “Category:Marinas,” and “Category:Seaside resorts”

  11. The Full System The first pass uses the explicit article links within the text. We then search an associated English language article, if available, for additional information. A second pass checks for multi-word phrases that exist as titles of Wikipedia articles. We look for certain types of person and organization instances. We perform additional processing for alphabetic or space-separated languages, including a third pass looking for single word Wikipedia titles. We use regular expressions to locate additional entities such as numeric dates.

  12. Evaluation – All • Wiki test set • Three human annotated newswire test sets: Spanish, French and Ukrainian. 12

  13. Evaluation – Spanish (1) • Spanish is a substantial, well-developed Wikipedia, consisting of more than 290,000 articles at October 2007. • Newswire: 25,000 words from the ACE 2007 test set, manually modified extended MUC-style standards. • Wiki test set: 335,000 words.

  14. Evaluation – Spanish (2) • Either Wikipedia is relatively poor in Organizations or that PhoenixIDF underperforms when identifying Organizations relative to other categories or a combination. • Traditional Training: trained PhoenixIDF on ACE 2007 data converted to MUC-style tag.

  15. Evaluation – French • French is one of the largest Wikipedias, containing more than 570,000 articles at October 2007. • Newswire: 25,000 words from Agence France Presse • Wiki test set: 920,000 words. • Similar to Spanish. 15

  16. Evaluation – Ukrainian (1) • Ukrainian is a medium-sized Wikipedia with 74,000 articles at October 2007. • The typical article is shorter and less well-linked to other articles than in the French or Spanish versions. • Newswire: approximately 25,000 words from various online news sites covering primarily political topics. • Wiki test set: 395,000 words. • Traditional Training: trained PhoenixIDF Newswire data 16

  17. Evaluation – Ukrainian (2) • The Ukrainian newswire contained a much higher proportion of organizations than the French or Spanish versions. • The Ukrainian language Wikipedia contains very few articles on organizations relative to other types 17

  18. Conclusion • Wikipedia can create a NER system with performance comparable to one developed human-annotated Newswire, while not requiring any linguistic expertise. • This level of performance can likely be obtained currently in 20-40 languages. • Wikipedia-derived system could be used as a supplement to other systems for many more languages. • An automatically generated entity dictionary embedded in our system . 18

  19. Future Work Automatically generate the list of key words and phrases for useful English language categories. The authors also believe performance could be improved by using higher order non-English categories and better disambiguation. Lists of organizations might be particularly useful, and “List of” pages are common in many languages. 19

  20. Thank you! 20

More Related