- 46 Views
- Uploaded on
- Presentation posted in: General

Day 2: Pruning continued; begin competition models

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Day 2: Pruning continued;begin competition models

Roger Levy

University of Edinburgh

&

University of California – San Diego

- Concept from probability theory: marginalization
- Complete Jurafsky 1996: modeling online data
- Begin competition models

- In many cases, a joint p.d. will be more “basic” than the raw distribution of any member variable
- Imagine two dice with a weak spring attached
- No independence → joint more basic
- The resulting distribution over Y is known as the marginal distribution
- Calculating P(Y) is called marginalizing over X

- Concept from probability theory: marginalization
- Complete Jurafsky 1996: modeling online data
- Begin competition models

- Does this sentence make sense?
The complex houses married and single students and their families.

- How about this one?
The warehouse fires a dozen employees each year.

- And this one?
The warehouse fires destroyed all the buildings.

- fires can be either a noun or a verb. So can houses:
[NP The complex] [VP houses married and single students…].

- These are garden path sentences
- Originally taken as some of the strongest evidence for serial processing by the human parser

Frazier and Rayner 1987

- Full-serial: keep only one incremental interpretation
- Full-parallel: keep all incremental interpretations
- Limited parallel: keep some but not all interpretations
- In a limited parallel model, garden-path effects can arise from the discarding of a needed interpretation

[S [NP The complex] [VP houses…] …]

discarded

[S [NP The complex houses …] …]

kept

- Pruning strategy for limited ranked-parallel processing
- Each incremental analysis is ranked
- Analyses falling below a threshold are discarded
- In this framework, a model must characterize
- The incremental analyses
- The threshold for pruning

- Jurafsky 1996: partial context-free parses as analyses
- Probability ratio as pruning threshold
- Ratio defined as P(I) : P(Ibest)

- (Gibson 1991: complexity ratio for pruning threshold)

- Each analysis is a partial PCFG tree
- Tree prefix probability used for ranking of analysis
- Partial rule probs marginalize over rule completions

these nodes are actually

still undergoing expansion

*implications for granularity of structural analysis

- Partial CF tree analysis of the complex houses…
- Analysis of houses as noun has much lower probability than analysis as verb (> 250:1)
- Hypothesis: the low-ranking alternative is discarded

- Note that top-down vs. bottom-up questions are immediately implicated, in theory
- Jurafsky includes the cost of generating the initial NP under the S
- of course, it’s a small cost as P(S -> NP …) = 0.92

- If parsing were bottom-up, that cost would not have been explicitly calculated yet

(that was)

- The most famous garden-paths: reduced relative clauses (RRCs) versus main clauses (MCs)
- From the valence + simple-constituency perspective, MC and RRC analyses differ in two places:

The horse raced past the barn fell.

p=0.14

p≈1

best intransitive:

p=0.92

transitive valence: p=0.08

- 82 : 1 probability ratio means that lower-probability analysis is discarded
- In contrast, some RRCs do not induce garden paths:
- Here, found is preferentially transitive (0.62)
- As a result, the probability ratio is much closer (≈ 4 : 1)
- Conclusion within pruning theory: beam threshold is between 4 : 1 and 82 : 1
- (granularity issue: when exactly does probability cost of valence get paid??? c.f. the complex houses)

The bird found in the room died.

*note also that Jurafsky does not treat found as having POS ambiguity

- Jurafsky 1996 is a product-of-experts (PoE) model
- Expert 1: the constituency model
- Expert 2: the valence model

- PoEs are flexible and easy to define, but…
- The Jurafsky 1996 model is actually deficient (loses probability mass), due to relative frequency estimation

sometimes approximated as

- Jurafsky 1996 predated most work on lexicalized parsers (Collins 1999, Charniak 1997)
- In a generative lexicalized parser, valence and constituency are often combined through decomposition & Markov assumptions, e.g.,
- The use of decomposition makes it easy to learn non-deficient models

- Syntactic comprehension is probabilistic
- Offline preferences explained by syntactic + valence probabilities
- Online garden-path results explained by same model, when beam search/pruning is assumed

- What is the granularity of incremental analysis?
- In [NPthe complex houses], complex could be an adjective (=the houses are complex)
- complex could also be a noun (=the houses of the complex)
- Should these be distinguished, or combined?
- When does valence probability cost get paid?

- What is the criterion for abandoning an analysis?
- Should the number of maintained analyses affect processing difficulty as well?

- Concept from probability theory: marginalization
- Complete Jurafsky 1996: modeling online data
- Begin competition models

- Disambiguation: when different syntactic alternatives are available for a given partial input, each alternative receives support from multiple probabilistic information sources
- Competition: the different alternatives compete with each other until one wins, and the duration of competition determines processing difficulty

- Parallel competition models of syntactic processing have their roots in lexical access research
- Initial question: process of word recognition
- are all meanings of a word simultaneously accessed?
- or are only some (or one) meanings accessed?

- Parallel vs. serial question, for lexical access

- Testing access models: priming studies show that subordinate (= less frequent) meanings are accessed as well as dominant (=more frequent) meanings
- Also, lexical decision studies show that more frequent meanings are accessed more quickly

- Lexical ambiguity in reading: does the amount of time spent on a word reflect its degree of ambiguity?
- Readers spend more time reading equibiased ambiguous words than non-equibiased ambiguous words (eye-tracking studies)
- Different meanings compete with each other

Of course the pitcher was often forgotten…

?

?

Rayner and Duffy (1986); Duffy, Morris, and Rayner (1988)

- Can this idea of competition be applied to online syntactic comprehension?
- If so, then multiple interpretations of a partial input should compete with one another and slow down reading
- does this mean increase difficulty of comprehension?
- [compare with other types of difficulty, e.g., memory overload]

- Configurational bias: MV vs. RR
- Thematic fit (initial NP to verb’s roles)
- i.e., Plaus(verb,noun), ranging from

- Bias of verb: simple past vs. past participle
- i.e., P(past | verb)*

- Support of by
- i.e., P(MV | <verb,by>) [not conditioned on specific verb]

- That these factors can affect processing in the MV/RR ambiguity is motivated by a variety of previous studies (MacDonald et al. 1993, Burgess et al. 1993, Trueswell et al. 1994 (c.f. Ferreira & Clifton 1986), Trueswell 1996)

*technically not calculated this way, but this would be the rational reconstruction