Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham - PowerPoint PPT Presentation

slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham PowerPoint Presentation
Download Presentation
Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

play fullscreen
1 / 104
Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham
83 Views
Download Presentation
milos
Download Presentation

Named Entity Recognition gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Named Entity Recognition http://gate.ac.uk/http://nlp.shef.ac.uk/ Hamish Cunningham Kalina Bontcheva RANLP, Borovets, Bulgaria, 8th September 2003

  2. Structure of the Tutorial • task definition • applications • corpora, annotation • evaluation and testing • how to • preprocessing • approaches to NE • baseline • rule-based approaches • learning-based approaches • multilinguality • future challenges

  3. Information Extraction (IE) pulls facts and structured information from the content of large text collections. IR - IE - NLU MUC: Message Understanding Conferences ACE: Automatic Content Extraction Information Extraction 3(109)

  4. NE: Named Entity recognition and typing CO: co-reference resolution TE: Template Elements TR: Template Relations ST: Scenario Templates MUC-7 tasks 4(109)

  5. The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc. NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets" CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same TE: the rocket is "shiny red" and Head's "brainchild". TR: Dr. Head works for We Build Rockets Inc. ST: a rocket launching event occurred with the various participants. An Example 5(109)

  6. Vary according to text type, domain, scenario, language NE: up to 97% (tested in English, Spanish, Japanese, Chinese) CO: 60-70% resolution TE: 80% TR: 75-80% ST: 60% (but: human level may be only 80%) Performance levels 6(109)

  7. NER involves identification of proper names in texts, and classification into a set of predefined categories of interest Person names Organizations (companies, government organisations, committees, etc) Locations (cities, countries, rivers, etc) Date and time expressions What are Named Entities? 7(109)

  8. Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc. Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc. MUC-7 entity definition guidelines (Chinchor’97) http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html What are Named Entities (2) 8(109)

  9. Artefacts – Wall Street Journal Common nouns, referring to named entities – the company, the committee Names of groups of people and things named after people – the Tories, the Nobel prize Adjectives derived from names – Bulgarian, Chinese Numbers which are not times, dates, percentages, and money amounts What are NOT NEs (MUC-7) 9(109)

  10. Variation of NEs – e.g. John Smith, Mr Smith, John. Ambiguity of NE types: John Smith (company vs. person) May (person vs. month) Washington (person vs. location) 1945 (date vs. time) Ambiguity with common words, e.g. "may" Basic Problems in NE 10(109)

  11. Issues of style, structure, domain, genre etc. Punctuation, spelling, spacing, formatting, ... all have an impact: Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom Tell me more about Leonardo Da Vinci More complex problems in NE 11(109)

  12. Structure of the Tutorial • task definition • applications • corpora, annotation • evaluation and testing • how to • preprocessing • approaches to NE • baseline • rule-based approaches • learning-based approaches • multilinguality • future challenges 12(109)

  13. Can help summarisation, ASR and MT Intelligent document access Browse document collections by the entities that occur in them Formulate more complex queries than IR can answer Application domains: News Scientific articles, e.g, MEDLINE abstracts Applications 13(109)

  14. Structure of the Tutorial • task definition • applications • corpora, annotation • evaluation and testing • how to • preprocessing • approaches to NE • baseline • rule-based approaches • learning-based approaches • multilinguality • future challenges 14(109)

  15. MUC-6 and MUC-7 corpora - English CONLL shared task corpora http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and Germanhttp://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch TIDES surprise language exercise (NEs in Cebuano and Hindi) ACE – English - http://www.ldc.upenn.edu/Projects/ACE/ Some NE Annotated Corpora 15(109)

  16. 100 documents in SGML News domain 1880 Organizations (46%) 1324 Locations (32%) 887 Persons (22%) http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf The MUC-7 corpus 16(109)

  17. <ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>, <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission. <p> Endeavour, with an international crew of six, was set to blast off from the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness. The MUC-7 Corpus (2) 17(109)

  18. NE Annotation Tools - Alembic 18(109)

  19. NE Annotation Tools – Alembic (2) 19(109)

  20. NE Annotation Tools - GATE 20(109)

  21. Corpora are divided typically into a training and testing portion Rules/Learning algorithms are trained on the training part Tuned on the testing portion in order to optimise Rule priorities, rules effectiveness, etc. Parameters of the learning algorithm and the features used Evaluation set – the best system configuration is run on this data and the system performance is obtained No further tuning once evaluation set is used! Corpora and System Development 21(109)

  22. Structure of the Tutorial • task definition • applications • corpora, annotation • evaluation and testing • how to • preprocessing • approaches to NE • baseline • rule-based approaches • learning-based approaches • multilinguality • future challenges 22(109)

  23. Evaluation metric – mathematically defines how to measure the system’s performance against a human-annotated, gold standard Scoring program – implements the metric and provides performance measures For each document and over the entire corpus For each type of NE Performance Evaluation 23(109)

  24. Precision = correct answers/answers produced Recall = correct answers/total possible correct answers Trade-off between precision and recall F-Measure = (β2 + 1)PR / β2R + P [van Rijsbergen 75] β reflects the weighting between precision and recall, typically β=1 The Evaluation Metric 24(109)

  25. Precision = Correct + ½ Partially correct Correct + Incorrect + Partial Recall = Correct + ½ Partially correctCorrect + Missing + Partial Why: NE boundaries are often misplaced, sosome partially correct results The Evaluation Metric (2) 25(109)

  26. Document: 9601020572 ----------------------------------------------------------------- POS ACT| COR PAR INC | MIS SPU NON| REC PRE ------------------------+-------------+--------------+----------- SUBTASK SCORES | | | enamex | | | organization 11 12| 9 0 0| 2 3 0| 82 75 person 24 26| 24 0 0| 0 2 0| 100 92 location 27 31| 25 0 0| 2 6 0| 93 81 … * * * SUMMARY SCORES * * * ----------------------------------------------------------------- POS ACT| COR PAR INC | MIS SPU NON| REC PRE -----------------------+-------------+--------------+------------ TASK SCORES | | | enamex | | | organizatio 1855 1757|1553 0 37| 265 167 30| 84 88 person 883 859| 797 0 13| 73 49 4| 90 93 location 1322 1406|1199 0 13| 110 194 7| 91 85 The MUC scorer (1) 26(109)

  27. Tracking errors in each document, for each instance in the text ENAMEX cor inc PERSON PERSON "Wernher von Braun" "Braun" ENAMEX cor inc PERSON PERSON "von Braun" "Braun" ENAMEX cor cor PERSON PERSON "Braun" "Braun" … ENAMEX cor cor LOCATI LOCATI "Saturn" "Saturn" … The MUC scorer (2) 27(109)

  28. The GATE Evaluation Tool 28(109)

  29. Need to track system’s performance over time When a change is made to the system we want to know what implications are over the entire corpus Why: because an improvement in one case can lead to problems in others GATE offers automated tool to help with the NE development task over time Regression Testing 29(109)

  30. Regression Testing (2) At corpus level – GATE’s corpus benchmark tool – tracking system’s performance over time 30(109)

  31. Structure of the Tutorial • task definition • applications • corpora, annotation • evaluation and testing • how to • preprocessing • approaches to NE • baseline • rule-based approaches • learning-based approaches • multilinguality • future challenges 31(109)

  32. Format detection Word segmentation (for languages like Chinese) Tokenisation Sentence splitting POS tagging Pre-processing for NE Recognition 32(109)

  33. Knowledge Engineering rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate Learning Systems use statistics or other machine learning developers do not need LE expertise requires large amounts of annotated training data some changes may require re-annotation of the entire training corpus annotators are cheap (but you get what you pay for!) Two kinds of NE approaches 33(109)

  34. System that recognises only entities stored in its lists (gazetteers). Advantages - Simple, fast, language independent, easy to retarget (just create lists) Disadvantages - collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity List lookup approach - baseline 34(109)

  35. Online phone directories and yellow pages for person and organisation names (e.g. [Paskaleva02]) Locations lists US GEOnet Names Server (GNS) data – 3.9 million locations with 5.37 million names (e.g., [Manov03]) UN site: http://unstats.un.org/unsd/citydata Global Discovery database from Europa technologies Ltd, UK (e.g., [Ignat03]) Automatic collection from annotated training data Creating Gazetteer Lists 35(109)

  36. Structure of the Tutorial • task definition • applications • corpora, annotation • evaluation and testing • how to • preprocessing • approaches to NE • baseline • rule-based approaches • learning-based approaches • multilinguality • future challenges 36(109)

  37. Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location: Cap. Word + {City, Forest, Center, River} e.g. Sherwood Forest Cap. Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street Shallow Parsing Approach (internal structure) 37(109)

  38. Ambiguously capitalised words (first word in sentence)[All American Bank] vs. All [State Police] Semantic ambiguity "John F. Kennedy" = airport (location) "Philip Morris" = organisation Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith] Problems with the shallow parsing approach 38(109)

  39. Use of context-based patterns is helpful in ambiguous cases "David Walton" and "Goldman Sachs" are indistinguishable But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly. Shallow Parsing Approach with Context 39(109)

  40. Use KWIC index and concordancer to find windows of context around entities Search for repeated contextual patterns of either strings, other entities, or both Manually post-edit list of patterns, and incorporate useful patterns into new rules Repeat with new entities Identification of Contextual Information 40(109)

  41. [PERSON] earns [MONEY] [PERSON] joined [ORGANIZATION] [PERSON] left [ORGANIZATION] [PERSON] joined [ORGANIZATION] as [JOBTITLE] [ORGANIZATION]'s [JOBTITLE] [PERSON] [ORGANIZATION] [JOBTITLE] [PERSON] the [ORGANIZATION] [JOBTITLE] part of the [ORGANIZATION] [ORGANIZATION] headquarters in [LOCATION] price of [ORGANIZATION] sale of [ORGANIZATION] investors in [ORGANIZATION] [ORGANIZATION] is worth [MONEY] [JOBTITLE] [PERSON] [PERSON], [JOBTITLE] Examples of context patterns 41(109)

  42. Patterns are only indicators based on likelihood Can set priorities based on frequency thresholds Need training data for each domain More semantic information would be useful (e.g. to cluster groups of verbs) Caveats 42(109)

  43. FACILE - used in MUC-7 [Black et al 98] Uses Inxight’s LinguistiX tools for tagging and morphological analysis Database for external information, role similar to a gazetteer Linguistic info per token, encoded as feature vector: Text offsets Orthographic pattern (first/all capitals, mixed, lowercase) Token and its normalised form Syntax – category and features Semantics – from database or morphological analysis Morphological analyses Example:(1192 1196 10 T C"Mrs.""mrs."(PROP TITLE)(ˆPER_CIV_F)(("Mrs." "Title" "Abbr")) NIL)PER_CIV_F – female civilian (from database) Rule-based Example: FACILE 43(109)

  44. Context-sensitive rules written in special rule notation, executed by an interpreter Writing rules in PERL is too error-prone and hard Rules of the kind: A => B\C/D, where: A is a set of attribute-value expressions and optional score, the attributes refer to elements of the input token feature vector B and D are left and right context respectively and can be empty B, C, D are sequences of attribute-value pairs and Klene regular expression operations; variables are also supported [syn=NP, sem=ORG] (0.9) =>\ [norm="university"],[token="of"],[sem=REGION|COUNTRY|CITY] / ; FACILE (2) 44(109)

  45. # Rule for the mark up of person names when the first name is not # present or known from the gazetteers: e.g 'Mr J. Cass', [SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M, LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS => [SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE] \[SYN=NAME, ORTH=I|O, TOKEN=_I]?, [ORTH=C|A, SYN=PROP, TOKEN=_F]?, [SYN=NAME, ORTH=I|O, TOKEN=_I]?, [SYN=NAME, TOKEN=_M]?, [ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE] #proper name, not recognised by a rule /; FACILE (3) 45(109)

  46. Preference mechanism: The rule with the highest score is preferred Longer matches are preferred to shorter matches Results are always one semantic categorisation of the named entity in the text Evaluation (MUC-7 scores): Organization: 86% precision, 66% recall Person: 90% precision, 88% recall Location: 81% precision, 80% recall Dates: 93% precision, 86% recall FACILE (4) 46(109)

  47. Created as part of GATE GATE – Sheffield’s open-source infrastructure for language processing GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging GATE has a finite-state pattern-action rule language, used by ANNIE Example Rule-based System - ANNIE 47(109)

  48. NE Components The ANNIE system – a reusable and easily extendable set of components 48(109)

  49. Needed to store the indicator strings for the internal structure and context rules Internal location indicators – e.g., {river, mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …} Produces Lookup results of the given kind Gazetteer lists for rule-based NE 49(109)

  50. Phases run sequentially and constitute a cascade of FSTs over the pre-processing results Hand-coded rules applied to annotations to identify NEs Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules Use of contextual information Finds person names, locations, organisations, dates, addresses. The Named Entity Grammars 50(109)