multi level ner in a cg framework n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Multi-level NER in a CG framework PowerPoint Presentation
Download Presentation
Multi-level NER in a CG framework

Loading in 2 Seconds...

play fullscreen
1 / 27

Multi-level NER in a CG framework - PowerPoint PPT Presentation


  • 203 Views
  • Uploaded on

Multi-level NER in a CG framework. Eckhard Bick Southern Denmark University Lineb@hum.au.dk. System Structure. 1. NE string recognition at raw text level, pattern based. dancorp.avis, dancorp.pre, dan.pre. 1.1. Name format.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multi-level NER in a CG framework' - callum


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
multi level ner in a cg framework
Multi-level NER in a CG framework

Eckhard Bick

Southern Denmark University

Lineb@hum.au.dk

1 ne string recognition at raw text level pattern based
1. NE string recognition at raw text level, pattern based

dancorp.avis, dancorp.pre, dan.pre

1.1. Name format

Recognition of author and source names (scanning & corpus impurities):

København N. Førerbevis til ældre

LUXEMBORG I over 700 år har ...

RADIO & TV Københavns Sommerunderholdning

Bedste da. resultat er opnået af Kurt Nielsen, der kom i finalen i 1953 og 1955. KuNi

INFORMATION. Det er ikke sandt ...

Kontokort Af BENNY SELANDER Stadig flere ...

HELGE ADAM MØLLER Medlem af folketinget

Headline separation problems:

Glimrende rolle Det er en ...

Forholdet til Walesa "Mit forhold til Walesa er, som det var for ti år siden," siger ...

Uppercase highlighting vs. Abbrebiations

MENS børnene venter, JOURNALIST Michael Larsen

NATO, UNPROFOR, USA (organisations), VHS (brand), DNA (chemical)

number of letters? Author position? Key small words?

lower-casing after triage

1 2 name pattern recognition
1.2. Name pattern recognition

Main principle: Fuse upper case strings

person names: Nyrup=Rasmussen

institutions: Odder=Lille=Friskole

organisations: ABN-AMRO=Asia=Equities

events: Australian=Open

Name chaining particles: prepositions, coordinators, …

Personal names: Maria dos Santos, Paul la Cour, Peter the Great, Margarete den Anden, Ras al Kafji, Osama bin Laden, initially: da Vinci, van Gogh, de la Vega, ten Haaf, von Weizsäcker

organisation names: Dansk Selskab for Akupunktur, University of Michigan, Golf-Centeret for Strategiske Studier, Organisationen af Olieeksporterende Lande

place names: Place de la Concorde

brands: Muscat de Beaumes de Venise

media: le Nouvel Observateur

events: Slaget på Reden

Mid-sentence name chain initiators:

Det ny Lademann Aps.

1 3 name internal punctuation as opposed to abbreviations and clause punctuation
1.3. Name-internal punctuationas opposed to abbreviations and clause punctuation
  • ·initials: P.=Rostrup=Bøjesen, Carl=Th.=Pilipsen
  • ·web-addresses: http://www.corp.hum.sdu.dk
  • ·e-mails: lineb@hum.au.dk
  • ·personal name additions: Mr.=Bush, frk.=Nielsen, Bush=jr./sr., hr.=Jensen, ...
  • ·professional titles: Dr.=A.=Clarke, cand.=polit., mag.=art.
  • ·company names: Aps., D/S=Isbjørn
  • ·vehicles: H.M.S.=Polaris
  • ·geographicals: Nr.=Nissum, Kbh.=K, St.=Bernhard. Mt.=Everest
1 4 name internal numerals
1.4. Name-internal numerals
  • ·yearly events: Landsstævne='98
  • ·card names: Ru=9, Sp=E
  • ·license plates: TF=34=322
  • ·town adresses: 8260=Viby=J
  • ·house addresses: 14a,=st.=tv.
  • ·kings: Christian IV
  • ·dated or versionized products: Windows 98
  • ·vehicle names: Honda Civic 1,4 GL sedan, Citroën ZX Aura 1.6i, Peugeot 206, 1.6 LX van, 1,9 TDI, DC10 fly, købe 50 V44 vindmøller
  • ·news channels: TV2, DR 1, Channel 4
  • ·bible quotes: Mt 28,1-10
1 5 in name and or coordination
1.5. In-name '&' and '/' or coordination?
  • K/S Storkemarken
  • Munster & Co., Møller & Baruah (fused company names?)
  • Hartree & V. Booth: Safety in Biological Laboratories (separate authors?)
  • NATO/FN
  • 1560 kJ/420 kcal

1.6. In-name apostrophe or quote?

  • Quotes make title-recognition easier …
    • Med den melankolske 'Light Years' anslås ...
    • Filmen "Stay Cool" blev trukket ...
    • Han havde læst Den uendelige historie.
  • … but create their own problems:
  • ·genitive: 'The Artist's Album', Bush' korte besked. 'Big Momma's House'
  • ·company names: Kellogg's
  • ·ellision: Pic du Midi d'Osseau, Côte d'Azur, Montfort l'Amaury
  • ·fixed naming systems: O'Connor, O'Neill
2 ne string recognition at raw text level lexicon supported dan pre
2. NE string recognition at raw text level, lexicon supporteddan.pre

2.1. Low level lexicon: specialised word contexts

  • Left or right context: selskab/forening/institut …. For
  • Name-internal 'og' vs. Coordination:(a) pattern: Petersen & Co.(b)lexical governed: Told og Skat, Sol og Strand, Se og Hør.
  • NOT name integrating numbers:(a) car companies:Peter købte en Peugeot=206. * Den gang købte Peter=206 pakkegaver.(b) unit words to the right:... tjente Toyota 6,8 procent tallet for Europa er 60 procent og for USA 75 procent
slide9

2.2. Sentence initial non-name small words (also used for sentence separation)

  • fordi, derfor, siden, ifølge, harFordi=Peter=Jensen ikke havde sendt…  Fordi Peter=Jensen ikke havde sendt …

2.3. Words, prefixes and suffixes with <+name> valency:

  • adjunkt, ...chef, historiker, institut, kollega,
  • ...erske, ...trice, virksomhed, ...ør, Hoved..., Vice....
  • lektor i børsret ved Københavns Universitet # Per Schaumburg Müller
  • Landsstyreformand # Jonathan Motzfeldt
  • Styresystemet # Windows
  • Stillehavsøen # Okinawa
  • Vicelagmand # Oli Nilsson
2 4 high level lexicon a lexicon based chunk splitter
2.4. High level lexicon:A lexicon based chunk splitter
  • Old alternative: split [A-Z]…erne + uppercase, [A-Z]…ede + uppercasewith negative list of "forbidden words": Horsens, Jens, Vincens, Enschede
  • New alternative: check all potential substrings of a polylexical name candidate, AND the whole string, against the full name lexicon (44.000 entries)
  • Genitive splitting: allowed if first half is <hum>, <org>, <inst>, <civ> (= humanoids) Sonofons <org> # GSM 900-net Richard Strauss' <hum> # Zarathustra New Yorks <civ> # Manhattan Beach Boys' <org> # Brian Wilson <hum>
  • Refuse split genitives with certain second half geographicals:Jensens Plads, Rådmans Boulevard
  • Variable split points (more lexicon checks necessary than in genitives):Kommende ambassadør i Kairo # Christian Oldenburg Bagefter hentede Peter # Maria Så ansatte IBM # Kevin Mondale Derfor forlod Jensen (Peter Jensen) # (FC) København
3 ne word recognition morphological analyzer with lexical data and compositional rules
3. NE word recognition, morphological analyzer with lexical data and compositional rules

Dantag, danpost

3.1. Recognized names

6 major & 20 minor categories:

<hum> person names <org> organisations<top> place names <occ> events<tit> semantic product names <brand> brands, objects

Full lexicon match ?

e.g. known first, unknown second nameknown company name with geographic extensionMorten Kaminski, Toshiba Denmark

Partial lexicon match ?

1. Name – other: Hans, Otte

2. Name NOM – Name GEN

3. Cross type: Lund(place/person), Audi(company/vehicle)

4. Systematic/underspecified: <media> = <org>/<tit>

Ambiguity

3 2 compositional analysis
3.2. Compositional analysis

No name recognized

Lower case run

Full inflexional & derivational analysis for all word classes

ANC-kontor N, G8-mødet N Talebanstyrel N, Martinicocktails N, Marsåret NAB'ernes [AB'er] N, AGF'ere NEU-godkendt ADJ, Heisenberg'ske ADJ

Compositional analysis

1. PROP  N: EMSen (N DEF), EMS'en (N DEF)2. Frozen usage (lexicon): Sovjetunionen, Folketinget (PROP)

Inflected names

  • Hyphen  <hum>: Jean-Pierre Wallez, Blomster-Jensenbut: hovedvej Pec-Prizren <top>, Lolland-Falster <top>Al-Qaida <org>, CO-Industri <org>, Jyllands-Posten <media>
  • Iterations with exchanged and omitted '-' and '=': Al=Qaida
  • (c) Heuristic full-string name reading

Heuristics

4 1 ne semantic type prediction semi lexical compositional heuristics
4.1. NE semantic type prediction, semi-lexical compositional heuristics

Cg2adapt.dansk

  • Respects non-heuristic types from the analyzer/lexicon
  • Tries to verify/falsify semi-lexical type analyzes
  • Uses patterns, suffixes, clue-word lists to predict types
  • To prevent interference between individual sections:ordering type predictions (e.g. <tit> early, <top> late) iterating certain classes (e.g. <hum>)NOT-conditions quoting partial or overlapping patterns that would indicate other semantic name classes.
  • Prepares cg-level: <non-hum> predictionnon-alphabetic characters, in-word capitals, coordinators (og, eller), certain English function words (of), non-human suffixes (-tion) etc.
4 2 ne semantic type predictor patterns
4.2. NE semantic type predictor: Patterns

<tit> e.g. quotes, in-name function words (articles, pronouns etc.), "semantic things" (-loven, -brev, -song, -report, Circulære=, Redegørelse=, Dictionary= ...)<media> e.g. -avis, -bladet, -tidende, Ugeskrift=, Kanal=, Channal=, Nyt= ...<occ> e.g. Expedition, -freden, -krig, -krise, =Rundt, Projekt=, Konference=, Slaget=<V> e.g. Boeing/Mercedes/Toyota=, =Combi, =Sedan, HMS=, USS=, M/S= ...<brand> e.g. Macintosh/Phillips/Sanyo=[0-9],wine types:=Appelation, =Cru, =Sec, Edition, Yamaha/Siemens=,quality markers:=Extra, =de=Luxe, =Ultra ...<hum> e.g. suffixes:-sen, -sson, -sky, -owa, infixes: ibn, van, ter, y, zu, di,abbreviated and part-of-name titles: frk., hr., Madame, Mlle, Morbror, jr., sr., Mc=, Al=, =Khan<A><B> e.g. [A.Z][a-z]+(=[a-z]+([ae]ns|ea|is|um|us))+<civ> e.g. =SSR/Republik, =Town/Ville, suffixes: -ager, -borough, -bølle, -dorf, -hausen, -løse, -ville, -polis (a number of these will receive both <civ> and <hum> tags for later disambiguation<top> e.g. =Bahnhof, =Bakker, =Kirke, =Manor, =Sund, =Prospekt, Islas=, Ciudad=, Gammel=, Lake=, Rio=, Sønder/Vester/Øster/Nørre=,suffixes: -fors, -kanten, -kvarteret, -marken,addresses: -stien, -strasse, -torv, -gade, -vej(the latter are also used by dantag)<org> e.g. in-word capitals: [a-z][A-Z] (MediaSoft),"suffixes": Amba, GmbH, A/S, AG, Bros., & Co ...,type indicators: =Holding, =Organisation, =Society, =Network, Bank=of=, Banco=d[eiao], K/S, I/S, Klub=, Fonden=,morphological indicators: -con, -com, -ex, -rama, -tech, -soft<inst> e.g. =Ambassade, =Airport?, =Børnehave, =Institut, =Universitet, =Bibliotek, =Hotel, Chez=,morphologicals: -eriet, -værk, -handel<mat> e.g. -[cpt]am, -[cz]id, -lax, -vent, Retard=,uppercase + number (NO2, H2O)<common> e.g. =Collection, =Samling, Ugens=,cards: Spa?=, Ru=

5 ne word class and case disambiguation rule and context based
5. NE word class and case disambiguation, rule and context based

dancg.morf (ca 3.300 rules)

Context based decisions are safer than pattern based predictions, and support each other

Full valency and semantic class context can be drawn upon

Iterated disambiguation creates safer context for more dangerous decisions

·<+name> valency of preceding noun: filmen Tornfuglene PROP <tit>

  • ·semantic product class <sem> in preceding noun: Lynda La Plantes tv-serie "Mistænkt"
  • ·topologicals rather than topological-derived nouns: Amagerbrogade PROP <top>
  • ·establishment NOM (not hum-GEN), if no np-head to the right: vi spiste på Marion's i går
  • ·GEN - GEN and NOM - NOM coordination matches: Peters NOM og Jensen kom kørende
  • ·NOM name readings are discarded in favor of GEN names, if there is an IDF noun or NOM name to the right with only matching prenominals in between: Australiens mest kendte sangere.
  • ·Sentence-initially, names are discarded in favor of verbs and function words, if followed by an np
  • ·non-compound nouns are favoured over heuristic names
  • ·heuristic names are favoured over compound names in a left lower case context
  • non-heuristic names are favoured over compound or derived nouns sentence-initially or in left upper case context
6 ne chaining a repair mechanism for faulty ne string recognition at levels 1 and 2 cleanmorf dan
6. NE chaining, a repair mechanism for faulty NE string recognition at levels (1) and (2)cleanmorf.dan

Performs chunking choices too hard or too ambiguous to make before CG:

  • FusesHans=Jensenog Otte=Nielsen, but keeps Hans Porsche and Otte PC'erusing CG-recognition of Jensen PROP <hum>, Nielsen PROP <hum>, Porsche PROP <V> and PC'er N <cc-h>
  • Fuses PROP and certain semantic N-types, if upper case and so far unrecognized:PROP + N <build> -> PROP <top>: Betty=Nansen BroenPROP + N <HH> -> PROP <org>: Betty=Nansen ForeningenPROP + N <sem> -> PROP <tit>: Betty=Nansen Prisen
  • Repairs erroneous PROP splitting by the preprocessor, if later contextual typing asks for fusion:PROP <org, media> + PROP <top, civ>: Dansk=Røde=Kors AfrikaPROP <civ> + PROP <org, inst>: Danmarks Monetære Institut
7 ne function classes mapped and disambiguated by context based rules dancg syn ca 4000 rules
7. NE function classes, mapped and disambiguated by context based rulesdancg.syn (ca. 4000 rules)

Handles, among other things, the syntactic function and attachment of names. The following are examples of functions relevant to the subsequent type mapper:

(i) @N< (nominal dependents)

præsident Bush, filmen "The Matrix"

(ii) @APP (identifying appositions)

Forældrebestyrelsens forman, Kurt Chistensen, anklager borgmester ...

(iii) @N<PRED (predicating appositions)

John Andersen, distrikschef, Billund, 60 år

8 ne semantic types mapped and disambiguated by context based rules dancg prop 428 rules
8. NE semantic types, mapped and disambiguated by context based rulesdancg.prop (428 rules)
  • Type mapper (introduces ambiguity, instantiates earlier tags)
  • Type disambiguator (reduces ambiguity)
  • Uses the same 6 major and 20 subcategories used by the lexicon and pattern based name predictor
  • Draws on syntactic relations, sentence context and lexical knowledge
  • Can override previously assigned type readings
  • Can disambiguate previously ambiguous readings
8 1 cross nominal prototype transfer
8.1 Cross-nominal prototype transfer
  • Post-nominal attachment: i byen RijnsburgMAP (<top>) TARGET (PROP @N<) (-1(N NOM) LINK 0 N-TOP) ;
  • Missing hyphen: Uppenskij katedralenMAP (<top>) TARGET (PROP) (1 @N<FUSE LINK 0 N-TOP) ;
  • Subject complement inference: Moskva er en by i RuslandSELECT (<top>) (0 @SUBJ>) (*1 @MV LINK 0 <vk> LINK *1 @<SC LINK 0 N-TOP) ;
  • Mines semantic N-types from relative clauses:Strongyle, som de gamle grækere kaldte øjenSELECT (<top>) (0 NOM) (*1 (<rel> INDP @SUBJ>) BARRIER NON-KOMMA LINK *1 VFIN LINK 0 @FS-N< LINK -1 ALL LINK *1 @MV LINK 0 <vk> LINK *1 @<SC LINK 0 N-TOP);MAP (%top) TARGET (PROP NOM) (*1 ("som") BARRIER NON-KOMMA LINK 0 @OC> LINK *1 @MV LINK 0 ("kalde" AKT) LINK *1 N-TOP BARRIER NON-PRE-N/ADV LINK 0 @<ACC)
  • "Som"-comparison: tv-programmer som "Robinson-Ekspeditionen" (here, <tit> overrides previous <occ>MAP (%tit) TARGET (PROP NOM) (0 @P< OR @AS<) (-1 ("som") LINK 0 @N< OR @AS-N<) (-2 N-SEM) (NOT -2 N-HUM) ;
8 2 coordination based type inference
8.2 Coordination based type inference

1. Maps "close coordinators" (&KC-CLOSE):ADD (&KC-CLOSE) TARGET (KC) (*1 @SUBJ> BARRIER @NON->N) (-1 @SUBJ>)

2. Then uses this tags in disambiguation rules: e.g. Arafat @SUBJ> og hans Palæstinas=Selvstyre @SUBJ>REMOVE %non-h (0 %hum-all) (*-1 &KC-CLOSE BARRIER @NON->N LINK -1C %hum OR N-HUM LINK 0 NOM); SELECT (<top>) (1 &KC-CLOSE) (*2C <top> BARRIER @NON->N) ;

3. Danish has <hum>-only and <non-hum> pronouns:SELECT %hum (0 @SUBJ>) (1 KC) (2 ("han" GEN) OR ("hun" GEN)) (*3 @SUBJ> BARRIER @NON->N/KOMMA) ;# Hejberg og hans skoleREMOVE %hum (0 @SUBJ>) (1 KC) (2 ("den" GEN) OR ("det" GEN)) (*3 @SUBJ> BARRIER @NON->N/KOMMA) ;# Anden Verdenskrig og dens mange slag

8 3 pp contexts
8.3 PP-contexts
  • Word-specific narrow contextMAP (<top>) TARGET (PROP) (-1 ("for" PRP)) (-2 ("syd") OR ("vest") OR ("nord") OR ("øst")) ;
  • Np-level vs. Clause level functionADD (<top>) TARGET (PROP @P<) (-1 ("i" PRP)) (NOT -1 @PIV) (NOT -2 <+i>) ; (safe, early rule)REMOVE (<top>) (0 @P<) (-1 ("i" PRP)) (-2 (<+i>)); (heuristic, later rule)
  • Pp-attachment inference, class based: godt 40 km fra MadrasREMOVE %non-top (-1 ("fra" PRP) OR ("til" PRP)) (-2 N-DIST) (-3 NUM) ;
  • Pp-attchment inference, word list basedMAP (%org) TARGET (PROP NOM @P<) (-1 ("i" PRP)) (-2 ("afdelingsleder") OR ("ansat") OR ("chef") OR ("direktør") OR ("forvaltningschef") OR ("koordinator") OR ("personalechef") OR ("souschef")) (NOT 0 <top> OR <civ>) ;
8 4 genitive mapping
8.4 Genitive mapping
  • MAP (%org) TARGET (GEN @>N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG) (NOT 0 <inst> OR <media> OR <party> OR <civ> OR <top>) ;# Microsofts generalforsamling/aktiekurs ("hard" GEN-ORG set) 
  • MAP (%org) TARGET (GEN @>N) (*1 (N IDF) BARRIER @NON->N LINK 0 GEN-ORG/HUM) (NOT 0 <inst> OR <media> OR <party> OR <civ> OR <top> OR <hum>) ;# Microsofts/ Bill=Gates advokat/hjemmeside ("soft" GEN-ORG set)
  • REMOVE %non-h (0 GEN LINK 0 %h) (*1 N BARRIER @NON->N/KOMMA LINK 0 (<p>) OR (<pp>)) ; # owning thoughts and "thought products". %non-h respects also "humanoids", <org>, <civ> etc.
8 5 prenominal context using adjective classes
8.5 Prenominal context: Using adjective classes

Uses semantic adjective classes, e.g.

  • Type based, more general, less safe: LIST ADJ-HUM = <Dphys> <Dpsych> <Dsoc> <Drel> ;
  • Word based, more specific and safer:LIST ADJ-HUM& = <alder> "adfærdsvanskelig" "adspredt" "affektlabil" "afklaret" "afmægtig" "afslappet" "afstumpet" "afvisende" "agtbar" "agtpågivende" "agtsom" "alert" "alfaderlig" "alkærlig" "altopgivende" "altopofrende" "alvorsfuld" ....

MAP (%hum) TARGET (<heur> PROP NOM) (-1 AD LINK 0 ADJ-HUM&) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura

ADD (%hum) TARGET (<heur> PROP NOM) (-1 AD LINK 0 ADJ-HUM) (*-2 (ART S DEF) BARRIER @NON->N) ; # Den langlemmede Kanako=Yonekura