1 / 14

20 th of May 2004

20 th of May 2004. Mixed-Lingual Entity Recognition. Beatrice Alex School of Informatics The University of Edinburgh. Named Entity Recognition. What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g.

Download Presentation

20 th of May 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 20th of May 2004 Mixed-Lingual Entity Recognition Beatrice Alex School of InformaticsThe University of Edinburgh

  2. Named Entity Recognition • What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g. “John Lennon” = NE of type person “T-Mobile” = NE of type organisation “Edinburgh” = NE of type location • How are they recognised? Use of internal and external context

  3. NER Methods • Rule-based • hand-written patterns • rely on punctuation, capitalisation and other features in the text • Statistical-based • data-driven approaches • exploit the statistical properties of real language to learn models • Hybrid Methods

  4. PhD ProposalSupervisors: Claire Grover, Stephen Clark • Proposed research topic: mixed-lingual NER, i.e. the detection and classification of NEs in a different language from the base language of the text • Examples: „Das Central Command erklärte, das Schicksal des Piloten sei noch ungeklärt.“ “Germany's Die Welt reports that four people died in the heat wave last week.”

  5. Background and Motivation • Multi-lingual and language-independent NER - active research area in NLP circles (MET-1/2, CoNLL02/03) • Many errors in German NER due to amount of foreign language material in German articles (Rössler, 2002) • Mixed-lingual NER - unspecified or beyond capabilities of existing approaches

  6. Beneficiaries • Performance improvements of applications where NER is standardly applied (IE, QA, text summarisation, topic identification) • Valuable information to polyglot TTS synthesis • Pre-processing tool for MT systems

  7. Denglish • English: dominant language of science & technology, air-traffic control, advertising • Increasing influence on German The live eventwas really cool. There were tickets, fast food, drinks in the basement.

  8. Preliminary Research • Analysis of English inclusions in German newspaper articles on different domains: • (1) Internet & Telecoms, (2) EU and (3) space travel • Corpus: 16,000 tokens per domain from German newspaper (FAZ) • Automatic classification of English tokens (NN and FM) by means of a simple lookup procedure • More than 90% of all English inclusions are nouns (Yang, 1999; Yeandle, 2001; Corr, 2003)

  9. 1. Lookup Procedure • CELEX lookup (NN|FM) in German and English databases • only in German database > DE • only in English database > EN • in both databases: • Computer, Trend, Monster • Generation, Union, Mission • Art, Tag, Rat, Fall, All • in neither database > 2. lookup procedure

  10. 2. Lookup Procedure • Google lookup with language preference • German compounds: Mausklick (mouse click) • English unhyphenated compounds: Homepage • Mixed-lingual unhyphenated compounds: Shuttleflug (shuttle flight) • English nouns with German inflections: Receivern • Abbreviations and acronyms: GPS, UKW • Words with spelling mistakes: Abruch (abortion) • English words with American spelling: Center • Classification based on number of hits

  11. Results • Output: Das <EN>Central</EN> <EN>Command</EN> erklärte, das <DE>Schicksal</DE> des <DE>Piloten</DE> sei noch ungeklärt. EN: Central Command explained, the fate of the pilot is still unclear. MT: CentralCommand explained, the fate of the pilot was still unsettled.

  12. English Inclusions

  13. Error Analysis • Sources of Error: • Wrong POS tags • Mixed-lingual unhyphenated compounds • New internationalisms • Abbreviations with several expansions • Unreliable Google hits • Inclusions from other languages • Need for better handling of NEs • Morpheme level analysis for compounds • Extension to other POS tag

  14. Future Work • Collection of more data and annotation for training and evaluation • Development of sequence modelling classifier, e.g. maximum entropy • Implementation of other languages • Application-based evaluation (e.g. MT)

More Related