I256 Applied Natural Language Processing Fall 2009. Sentence Structure. Barbara Rosario. Resources for IR. Excellent resources for IR: Course syllabus of Stanford course: Information Retrieval and Web Search (CS 276 / LING 286) http://www.stanford.edu/class/cs276/cs276-2009-syllabus.html
Acknowledgments: Some slides are adapted and/or taken from Klein’s CS 288 course
Phrase structure parsing organizes syntax into constituentsor brackets
new art critics write reviews with computers
Hurricane Emily howled toward Mexico \'s Caribbean coast on Sunday packing 135 mph winds and torrential rain and causing panic in Cancun , where frightened tourists squeezed into musty shelters .
Words combine with other words to form units.
How do we know what nodes go in the tree?
What is the evidence of being a unit?
Classic constituency tests:
Constituency isn’t always clear
Units of transfer:
think about ~ penser à
talk about ~ hablar de
I will go I’ll go
I want to go I wanna go
He went to and came from the store.
La vélocité des ondes sismiques
I cleaned the dishes from dinner
I cleaned the dishes with detergent
I cleaned the dishes in my pajamas
I cleaned the dishes in the sink
Prepositional phrases:They cooked the beans in the pot on the stove with handles.
Particle vs. preposition:The puppy tore up the staircase.
Gerund vs. participial adjectiveVisiting relatives can be boring.Changing schedules frequently confused passengers.
Modifier scope within NPsimpractical design requirementsplastic cup holder
Coordination scope:Small rats and mice can squeeze into holes or cracks in the wall.
Write symbolic or logical rules:
S NP VP
NP DT NN
NP NN NNS
NP NP PP
VP VBP NP
VP VBP NP PP
PP IN NP
A context-free grammar is a tuple <N, T, S, R>
N : the set of non-terminals
Phrasal categories: S, NP, VP, ADJP, etc.
Parts-of-speech (pre-terminals): NN, JJ, DT, VB
T : the set of terminals (the words)
S : the start symbol
Often written as ROOT or TOP
R : the set of rules
Of the form X Y1 Y2 … Yk, with X, YiN
Examples: S NP VP, VP VP CC VP
Also called rewrites, productions, or local trees
Treebank grammars can be enormous
The raw grammar has ~10K states, excluding the lexicon
Better parsers usually make the grammars larger, not smaller
Parsing with the vanilla treebank grammar:
Why’s it worse in practice?
Longer sentences “unlock” more of the grammar
All kinds of systems issues don’t scale
~ 20K Rules
(not an optimized parser!)
Observed exponent: 3.6
If we do no annotation, these trees differ only in one rule:
VP VP PP
NP NP PP
Parse will go one way or the other, regardless of words
Lexicalization allows us to be sensitive to specific words
Add “headwords” to each phrasal node
Syntactic vs. semantic heads
Headship not in (most) treebanks
Usually use head rules, e.g.:
Take leftmost NP
Take rightmost N*
Take rightmost JJ
Take right child
Take leftmost VB*
Take leftmost VP
Take left child
Problem: we now have to estimate probabilities like
Never going to get these atomically off of a treebank
Solution: break up derivation into smaller steps
Dependency structure gives attachments.
In principle, can express any kind of dependency
How to find the dependencies?
Shaw Publishing acquired 30 % of American Cityin March
Words select other words (also) on syntactic grounds
Mutual information between words does not necessarily indicate syntactic selection.
expect brushbacks but no beanballs
a new year begins in new york
congress narrowly passed the amended bill
congress narrowly passed the amended billIdea: Word Classes