1 / 23

A Robust Shallow Parser for Swedish

A Robust Shallow Parser for Swedish. Ola Knutsson, Johnny Bigert, Viggo Kann Royal Institute of Technology, Sweden. Introduction. What is robustness? Robust against noisy, ill-formed and partial natural language data. Shallow parsing. Many NLP-applications do not need full parsing

Download Presentation

A Robust Shallow Parser for Swedish

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Robust Shallow Parser for Swedish Ola Knutsson, Johnny Bigert, Viggo Kann Royal Institute of Technology, Sweden

  2. Introduction What is robustness? Robust against noisy, ill-formed and partial natural language data

  3. Shallow parsing Many NLP-applications do not need full parsing Shallow parsing: A parsing approach Pre-processing for full parsing A collection of techniques Abney - finite state cascades (1991) Currently, a lot of attention on ML Well suitable for modularization

  4. Chunking and phrase identification Common modules in a shallow parser: Tokenizer PoS-tagger Chunker Phrase identifier Grammatical function identifier

  5. Chunking [NP Den mycket gamla mannen][VC gillade][NP mat] Phrase identification [NP Den [AP mycket gamla] mannen][VC gillade][NP mat]

  6. Parsers for Swedish Full parser: UCP (Sågvall Hein) and SLE (Gambäck) Shallow parsers (phrase structure): Cass-Swe (Kokkinakis) and Megyesi using machine learning Dependency: CG (Birn) and FDG (Voutilainen)

  7. Granska Text Analyzer (GTA) Hand-crafted rules Context-free backbone Partly object-oriented notation

  8. Major Phrase Categories NP: Han såg den lilla mannen på bänken VC: Han har spelat kort hela natten PP: Han såg spår i sanden AP: Han ogillade små vita lögner ADVP: Han vill inte gå på bio. INFP: Han tycker om att spela

  9. Clause Boundary Identification Based on Ejerhed’s algorithm Context-sensitive rules Using only PoS information

  10. Different kinds of rules GTA contains 260 rules 200 identify phrase structure 20 clause boundary identification 40 selection rules (disambiguation)

  11. Example rule, [NP den lilla bilen] NPmin@ { X(wordcl=dt| wordcl=hd | wordcl=rg), X2(wordcl=ab | wordcl=rg)?, Y(wordcl=jj | wordcl=ro | wordcl=pc)*, Z(wordcl=nn) --> action(help, wordcl:=Z.wordcl, pnf:= undef, gender:=Z.gender, num:=Z.num, spec:=Z.spec, case:=Z.case)

  12. Clause boundary rule cl@ { V(sed!=sen & text!="som" & wordcl!=sn), X((wordcl=pn & pnf=sub)| (wordcl=pm & case=nom) | (wordcl=nn & case=nom & V.case!=gen) | wordcl=ab), ---endleftcontext---, Y(wordcl=kn), ---beginrightcontext---, Y2(((wordcl=pn & pnf=sub) | (wordcl=pm & case=nom) | (wordcl=nn & case=nom) | wordcl=ab) & wordcl=X.wordcl), Z(wordcl=vb & (vbf=prs | vbf=prt | vbf=imp)) --> action(help, wordcl:=Y.wordcl) }

  13. The Tetris Algorithm PP till general PP till general Claes NP general Claes Olsson NP Fänrik Ax VC gav NP boken PP till general Claes Olsson

  14. The IOB format Marcus and Ramshaw 1995 A phrase/clause tag contains two parts: • Phrase/Clause type, e.g. NP, PP • One of two tags: I = Inside a phrase/clause B = Beginning a phrase/clause When a word does not belong to a phrase 3. O = Outside

  15. Disagreement error De dt.utr/neu.plu.def NPB CLB gamla jj.pos.utr/neu.plu.ind/def.nom APB|NPI CLI äppelträdet nn.neu.sin.def.nom NPI CLI kan vb.prs.akt.mod VCB CLI bli vb.inf.akt.kop VCI CLI som kn O CLI nya jj.pos.utr/neu.plu.ind/def.nom APB CLI . mad O CLI

  16. Partial input Arrangör nn.utr.sin.ind.nom NPB CLB var vb.prt.akt.kop VCB CLI Järfälla pm.gen NPB|NPB CLI naturskyddsförening nn.utr.sin.ind.nom NPB|NPI CLI där ab ADVPB CLI är vb.prs.akt.kop VCB CLI medlem nn.utr.sin.ind.nom NPB CLI . mad O CLI

  17. Noisy data Inte ab APB CLB så ab ADVPB|APB|API CLI tjck jj.pos.utr.sin.ind.nom APB|API|API CLI som ha O CLB det pn.neu.sin.def.sub/obj NPB CLI ofta ab.pos ADVPB CLI står vb.prs.akt VCB CLI i pp PPB CLI lärobökerna nn.utr.plu.def.nom NPB|PPI CLI ; mid 0 CLI

  18. Word order violation Ympkvisten nn.utr.sin.def.nom NPB CLB inte ab ADVPB CLI ska vb.prs.akt.mod VCB CLI vara vb.inf.akt.kop VCI CLI sådär ab ADVPB|APB CLI lång jj.pos.utr.sin.ind.nom APB CLI , mid O CLI

  19. Evaluation Manually corrected output from GTA Untuned GTA in the evaluation 15 000 words from SUC 5 genres

  20. F-scores for individual phrase types

  21. F-score for clause boundary identification F-score for a baseline identifier was 69.0%

  22. Aplications with GTA We are using GTA in: Grammar checking, statistical and rule based Clustering of medical texts CALL-systems What do you want to do with GTA?

  23. More information www.nada.kth.se/theory/projects/xcheck Contact: Ola Knutsson knutsson@nada.kth.se

More Related