1 / 10

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models. Elias Ponvert, Jason Baldridge and Katrin Erk The University of Texas at Austin. Introduction. Grammar Induction Based on gold standard POS Foundamental one: Constituent Context Model (CCM)

katy
Download Presentation

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge and Katrin Erk The University of Texas at Austin

  2. Introduction • Grammar Induction • Based on gold standard POS • Foundamental one: Constituent Context Model (CCM) • Based on raw texts • Common cover links parser: CCL • This paper: cascaded chunking.

  3. Motivation of this paper • CCL depends on low-level constituents very much: • Simply extracting non-hierarchical multiword constituents from CCL’s output and putting a right branching structure over them actually works better than CCL’s own higher level predictions. • Suggestion: improvements to low-level constituent prediction will ultimately lead to further gains in overall constituent parsing

  4. Two Investigations • Unsupervised partial parsing or unsupervised chunking • Full parsing via cascaded chunking (explain later)

  5. Data of Unsupervised Chunking • Two kinds of data: • Constituent chunks • Multiword • Non-hierarchical (do not contain sub constituents) • Base NP: NPs that do not contain nested NPs

  6. Method of Unsupervised Chunking • BIO tagging, and STOP for sentence boundaries and phrasal punctuations. • Model: • HMM • PRLG (probabilistic right-linear grammar)

  7. Finite States • State transitions • Uniform initialization

  8. Chunking Results

  9. Full parsing via cascaded chunking Pseudoword: the term in the chunk with the highest corpus frequency

  10. Full Parsing Results • No length limit • <=10 words

More Related