Treebank based acquisition of multilingual lfg resources for parsing generation and transfer
Download
1 / 14

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer. Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland Treebank Workshop NAACL 2007. Lexical-Functional Grammar (LFG).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer' - dutch


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Treebank based acquisition of multilingual lfg resources for parsing generation and transfer

Treebank-Based Acquisition of Multilingual LFG Resources for Parsing, Generation and Transfer

Josef van Genabith, National Centre for Language Technology (NCLT), Dublin City University, Ireland

Treebank Workshop NAACL 2007

Treebank-Based Acquisition of Multilingual LFG Resources


Lexical functional grammar lfg
Lexical-Functional Grammar (LFG) Parsing, Generation and Transfer

  • “Shallow” grammar: defines language (set of strings)

  • “Deep” Grammar: as above + maps strings to “meaning” representation: predicate-argument structure, dependencies, simple logical form …, usually involves some form of long-distance dependency (LDD) resolution

  • Deep grammars (HPSG, LFG, CCG, TAG …) usually hand-crafted

  • Very difficult & expensive to scale to unrestricted text

  • Motivation for treebank-based deep grammar acquisition (LFG/CCG/HPSG/TAG/DepGr/…)!!

  • LFG: [Kaplan and Bresnan, 82; Dalrymple, 2001; Bresnan, 2001]

  • Constraint-based (“unification”), lexicalised

  • c(onstituent)-str & f(unctional) structure

  • c-str: surface configuration (CFG trees)

  • f-str: abstract grammatical functions/relations (SUBJ, OBJ, OBL, COMP, XCOMP, ADJN, POSS, APP, …)

  • f-str: AVM (feature-structure) encoding of dependencies/pred-arg.

Treebank-Based Acquisition of Multilingual LFG Resources


Lexical functional grammar lfg1
Lexical-Functional Grammar LFG Parsing, Generation and Transfer

Treebank-Based Acquisition of Multilingual LFG Resources


Lexical functional grammar lfg2
Lexical-Functional Grammar LFG Parsing, Generation and Transfer

  • Treebank: trees

  • How do we get from trees to f-structures?

  • What’s missing is the equations!

  • Automatic f-structure annotation algorithm

  • Traverses tree and assigns LFG equations

  • Principle-based c-str/f-str interface

Treebank-Based Acquisition of Multilingual LFG Resources


F structure annotation algorithm
F-Structure Annotation Algorithm Parsing, Generation and Transfer

  • Algorithm exploits:

    • Categorial information (NP, VP, VBZ, …)

    • Configurational information:

      • Local head, left/right of head

      • Leftmost NP sister to right of V(erbal) head: (OBJ)=

    • Morphological information:

      • Him: (OBJ)=

    • “Functional” tag information:

      • -LGS (PASSIVE)=+ , -SBJ, -CLR, …

    • Trace/co-indexation information

      • Translate traces + co-indexation to corresponding re-entrancies at f-str.

Treebank-Based Acquisition of Multilingual LFG Resources


F structure annotation algorithm1
F-Structure Annotation Algorithm Parsing, Generation and Transfer

Lemmatization + Macros Lexical Entries

Defaults – “Functional Tags”

Head-Lexicalization [Magerman,1994]

Left-Right Context Annotation Principles

Proto

F-Structures

Coordination Annotation Principles

Proper

F-Structures

Catch-All and Clean-Up

Traces

Treebank-Based Acquisition of Multilingual LFG Resources


Treebank annotation control wh rel ldd
Treebank Annotation: Control & Wh-Rel. LDD Parsing, Generation and Transfer

Treebank-Based Acquisition of Multilingual LFG Resources


Multilingual treebank based lfg resources
Multilingual Treebank-Based LFG Resources Parsing, Generation and Transfer

  • English + Penn-II: parsers (+ LDD resolution), generators, subcat-frame extraction, bootstrapping of new TB-resources (QuestionBank), transfer

  • Pilots/proof of concept: multilingual treebank-based LFG acquisition:

    • German: TIGER (Cahill et al 2003, 2005)

    • Chinese: CTB (Burke et al 2004)

    • Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006)

  • GramLab Project (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German

Treebank-Based Acquisition of Multilingual LFG Resources


Multilingual treebank based lfg resources1
Multilingual Treebank-Based LFG Resources Parsing, Generation and Transfer

Language Treebank

English Penn-II

Chinese CTB 5.1

Japanese KTC 4.0

German TIGER 2.0

German TűBa-D/Z

Spanish Cast3LB

Arabic ATB

French P7T

Size Coding/Data

50,000 CFG+traces+FT

18,000 CFG+traces+FT

38,000 Dep (+traces)

50,000 Graphs+CFG+Dep

22,000 CFG+Dep+f-traces

3,500 CFG+Dep+f-traces

300,000 (words)

20,000 CFG+Dep+f-traces

--------

 > 200,000

Treebank-Based Acquisition of Multilingual LFG Resources


Q2 Parsing, Generation and Transfer

  • What was missing in TB resource?

    • F-structures, pred-argument structure, dependencies => f-structure annotation algorithm

    • Limited domain in Penn-II (most treebanks …) => bootstrap grammar and QuestionBank (4000 questions from TREC and CCG)

    • GFs, active/passive, decl/interrog/imp, control, raising, LDDs, pro-drop, zero-anaphora, tense/aspect, …

  • What was done by hand?

    • F-structure annotation algorithm (principle-based c-/f-str interface)

    • No restructuring, no clean-up of TB (unlike CCG/HPSG/TAG – but see P7T)

    • No manual additions (unlike CCG/HPSG/TAG)

    • Future work …

Treebank-Based Acquisition of Multilingual LFG Resources


Q3 Parsing, Generation and Transfer

  • Methodological Issues - Quality Assurance:

  • Evaluation against hand-crafted/corrected Gold Standard DepBanks

    • PARC 700

    • CBS 500

    • PropBank

    • Own Gold standard DepBanks for: English, Chinese, Japanese, German, Arabic, Spanish, French (200-500)

  • CCG-style evaluation against automatically annotated Gold (Silver-) Standard DepBanks based on WSJ Sec. 23 trees (CCG, HPSG)

  • Quality of annotation process and parsing resources: treebank-based LFG parsing statistically significantly outperform XLE and RASP (PARC 700 & CBS 500)

Treebank-Based Acquisition of Multilingual LFG Resources


Q4 Parsing, Generation and Transfer

  • Phrase Structure or Dependencies?

  • Both!!!  Why?:

  • Phrase Structure good for parsing and generation => tab into lots of mature, efficient & well understood technology (but see dependency parsing)

  • Dependencies close to f-structure/predicate-argument structures …

    • Penn-II: CFG-trees + traces/co-indexation + “functional” labels/tags

    • TIGER: graphs + CFG-categories + grammatical function labels + LDDs through crossing edges

    • Cast3LB/P7T/TűBa-DZ: CFG trees + grammatical function labels + LDDs through GF paths

Treebank-Based Acquisition of Multilingual LFG Resources


Q5 q6
Q5 & Q6 Parsing, Generation and Transfer

  • Pros/Cons Formalism-Specific Treebank?

    • Formalism-Specific Treebank? Bad!  Limits usefulness/user group/…

    • Better to have generic TB with CFG + Dep Label + LDDs + other feature labels (as required). And then extract LFG/HPSG/CCG/TAG/Dependency Grammars

  • Grammar First vs. Treebank First?

    • Depends on what you want to do …

    • If you want high-quality, wide-coverage resources (that can parse unrestricted text) then its definitely better to do treebanking-first (or use bootstrapping)

    • Problem: many traditionally trained linguists see TreeBanking as menial task

    • Highly qualified and interesting task: empirical linguistics: confront/rather than invent data

    • Sociological task: how to make treebanking/bootstrapping sexy?

Treebank-Based Acquisition of Multilingual LFG Resources


Some resources
Some Resources Parsing, Generation and Transfer

  • ESSLLI 2006 course material: Treebank-Based Acquisition of LFG, HPSG and CCG Resources. J. van Genabith, Y. Miyao and J. Hockenmaier

  • http://www.computing.dcu.ie/~josef/Malaga06.ppt

  • LFG parser demo:

  • http://lfg-demo.computing.dcu.ie/lfgparser.html

  • A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia

  • J. Judge, A. Cahill and J. van Genabith, QuestionBank:Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia

  • R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005

  • A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Kluwer Academic Press, 2005

  • R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005

  • M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLING-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004

  • A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of ACL-04, pp. 320-7, Barcelona, Spain, 2004

  • Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): LFG’02, Athens, Greece, CSLI Publications, Stanford, CA., pp.76--95. 2002

Treebank-Based Acquisition of Multilingual LFG Resources


ad