1 / 29

# LING 581: Advanced Computational Linguistics - PowerPoint PPT Presentation

LING 581: Advanced Computational Linguistics. Lecture Notes January 26th. Penn Treebank. Bracketing guidelines. Ungraded Homework Exercise. Search for NP trace relative clauses as defined below:. Be ready to c ompare search pattern and number f ound next time in class.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### LING 581: Advanced Computational Linguistics

Lecture Notes

January 26th

Penn Treebank

Bracketing

guidelines

• Search for NP trace relative clauses as defined below:

compare search

pattern and

number

found next time

in class

@NP < @NP < @SBAR

12038

@NP < @NP < @SBAR

plus WH indices

10956 down from 12038

@NP < @NP < (@SBAR < /^-NONE-/)

529

Note

-NONE- < *ICH*

Not all

@NP < @NP < (@SBAR < /^-NONE-/)

are relative clauses

@NP < @NP < (@SBAR < /^-NONE-/)

plus *ICH*

count drops from 529 to 166

@NP < @NP < (@SBAR < /^-NONE-/)

plus *ICH*

Is 166 too low?

Homework Exercise

Use the bracketing guides and choose three “interesting” constructions

Find all occurrences in the WSJ PTB

Homework Exercise
• 581 Homework rules
• Due next lecture
• Present your findings in class (slides)
Parsing

… from Treebank search to stochastic parsers trained on the WSJ Penn Treebank

Bikel Collins
• Java re-implementation of Collins’ parser
• Paper
• Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model. (PS) (PDF)  in Computational Linguistics, 30(4), pp. 479-511.
• http://www.cis.upenn.edu/~dbikel/papers/collins-intricacies.pdf
• Software
• http://www.cis.upenn.edu/~dbikel/
Bikel Collins
• File: install.sh
• Java code
• but at this point I think Windows won’t work because of the shell script (.sh)
• maybe after files are extracted?
Bikel Collins

parser doesn’t actually need a separate tagger…

Bikel Collins
• Training the parser with the WSJ PTB
• See guide

directory: TREEBANK_3/parsed/mrg/wsj

chapters 02-21: create one single .mrg file

events: wsj-02-21.obj.gz

Bikel Collins
• Settings:
Bikel Collins
• Parsing
• Command
• Input file format (sentences)
Bikel Collins
• Verify the trainer and parser work on your machine
Bikel Collins
• File: bin/parse is a shell script that sets up program parameters and calls java
Bikel Collins
• File: bin/train is another shell script
Bikel Collins
• Relevant WSJ PTB files
Bikel Collins
• If you have tcl/tk installed, I use a wrapper to call Dan Bikel’s code

makes it easy to work the parser without memorizing the command line options

Bikel Collins
• For tree viewing, you can use tregex

For demos, I use my own viewer

Bikel Collins
• POS tagging (MXPOST, in directory jmx)
• tagger_input
• \$prefix/jmx/mxpost \$prefix/jmx/tagger.project < /tmp/test.txt 2> /tmp/err.txt
• Parsing
• set ddf "wsj-02-21.obj.gz”
• set properties "collins.properties"
• parser_input
• \$dbprefix/bin/parse 400 \$dbprefix/settings/\$properties \$dbprefix/bin/\$ddf /tmp/test2.txt 2>@ stdout
• Training
• set mrg "wsj-02-21.mrg”
• set properties "collins.properties"
• \$dbprefix/bin/train 800 \$dbprefix/settings/\$properties \$dbprefix/bin/\$mrg 2>@ stdout

Unix file descriptors

0 Standard input (stdin)

• Standard output (stdout)
• Standard error (stderr)

GUI components

frame .input

text .input.t -height 4 -yscrollcommand {.input.s set}

scrollbar .input.s -command {.input.tyview}

frame .tagged

text .tagged.t -height 9 -yscrollcommand {.tagged.s set}

scrollbar .tagged.s -command {.tagged.tyview}

Code

proc tagger_input {} {

set lines [.input.t get 1.0 end]

set infile [open "/tmp/test.txt" w]

puts -nonewline \$infile [string trimright \$lines]

close \$infile

}

proc parser_input {} {

set lines [.tagged.t get 1.0 end]

set infile [open "/tmp/test2.txt" w]

puts -nonewline \$infile [string trimright \$lines]

close \$infile

}