1 / 36

Automatic Extraction of Subcategorization Frames From Corpora

Automatic Extraction of Subcategorization Frames From Corpora. Jianguo Li Department of Linguistics The Ohio State University. Task Description.

lavina
Download Presentation

Automatic Extraction of Subcategorization Frames From Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Extraction of Subcategorization Frames From Corpora Jianguo Li Department of Linguistics The Ohio State University

  2. Task Description • Use Charniak Parser and some machine learning techniques to acquire subcategorization frames (SCFs) for English verbs from British National Corpus (BNC).

  3. Why do we need SCF information? • Parsing: Rule-based parsers use SCF information to constrain the number of analyses that are generated. Mary put [NP a book] [PP on the table].

  4. Why do we need SCF information?

  5. Why do we need to extract SCFs from corpora? • Hand-coded SCF lists are invariably incomplete: a. Dictionaries may miss SCFs present in the corpus data. - Krohonen(2002): add: He adds what he thinks is right. - Manning(1993): retire to/in: Mary retired to Malibu. Mary retired in Malibu.

  6. Why do we need to extract SCFs from corpora? b. Dictionaries do NOT show the relative frequency information of listed SCFs. - Manning(1993): agree about/with disagree about/with

  7. Why is it difficult? • Theoretical Issues: Complement/adjunct distinction gets murky in practice: a. clear boundary: The baby ate its cake [PP under the table]. b. ambiguous: The baby was crawling [PP under the table]. c. more complement-like or adjunct-like: He broke the vase [PP with a hammer].

  8. Why is it difficult? • Practical Issues: a. Charniak parser makes no distinction between complements and adjuncts. b. I am working on the spoken language corpus of BNC.

  9. Two-Stage Task • Hypothesis Generation: Identify all SCFs for every verb from the corpus. • Hypothesis Selection: Determine which SCF is a real SCF for a given verb.

  10. Different Approaches

  11. Description of the Acquisition System Hypothesis Generation Hypothesis Selection Charniak Parser SCF Extractor Lemmatizer SCF Evaluator Charniak Parser SCF Extractor SCF Evaluator Lemmatizer

  12. Charniak Parser • Return very flat structure for each sentences in the corpus. • No complement/adjunct distinctions are made.

  13. SCFs Extractor • Take as input the parsed tree, locate the verb and all its sisters. Its sisters form the observed frame (OF) for that verb. • Further modifications are needed.

  14. Finite and Infinite Clauses asked me to say … (VP (VBN asked) (S (NP (PRP me)) (VP (TO to) (VP (VB say) …)))) asked: [S] [NP INF(to)]

  15. Finite and Infinite Clauses

  16. Passive Sentences Over one thousand pounds had been saved. (NP (QP (IN over) (CD one) (CD thousand)) (NNS pounds))) (VP (AUX had) (VP (AUX been) (VP (VBN saved))) saved: [ ] saved: [NP] saved: [ ] saved(pp): [NP]

  17. Verbs (going, got, used) Who’s going to be recording our meeting? (WHNP (WP who) (S (VP (AUX ’s) (VP (VBG going) (S (VP (TO to) (VP (AUX be) (VP (VBG recording) (NP (PRP$ our) (NN meeting)) going: [S] going: [INF(to)] going: [INF(to)]

  18. Verbal Conjunctions … bought and sold fruits. (VP (VBD bought) (CC and) (VBD sold) (NP (NNS fruits))) bought: [CC VBD NP] bought: [NP] ; sold: [NP]

  19. Phrasal Verbs as she goes along … (SBAR (IN as) (S (NP (PRP she)) (VP (VBZ goes) (ADVP (RB along)) goes: [ADVP(along)] goes along: [ ]

  20. Lemmatizer • Use the English morphological analyzer MORPHA to lemmatize all verbs extracted from the corpus.

  21. SCFs Evaluator • Why do we need a filter? The frames generated by the Extractor: a. match correct SCFs. b. contain adjuncts. c. are wrong due to a tagging or parsing error.

  22. SCFs Evaluator Two Components: • Binomial Hypothesis Test: (Brent 1993) • Back-Off Algorithm: (Sarkar & Zeman 2000)

  23. Hypothesis Testing: H0:There is NOT a genuine correlation between verbj and scfi. H1: There is such a correlation. Binomial Hypothesis Test

  24. Binominal Hypothesis Test • n: the total number of SCFs cues found for verbj. • m: the total number of these cues for scfi. • p: the error probability that a cue scfi for a occurs with a verb which does not take scfi. n! P(m,n,p) = pm(1-p)n-m m!(n-m)! n P(m+,n,p) =  P(k,n,p) k = m

  25. Back-Off Algorithm remember: S S(wh) NP V-ING S(wh) NP V-ING 0 25 10 12 5 1 2 4

  26. Back-off Algorithm • Basic Idea: The correct SCFs are often subsets of OFs. - Long infrequent frames are suspected to contain adjuncts. - Short infrequent frames may have elided some arguments.

  27. Back-off Algorithm • Frames generated due to parsing errors: introduce Mr. Johnson from Longman Company. (VP (VB introduce) (NP (NN Mr. Johnson)) (PP (IN from) (NP (NN Longman Company))) OF: [NP PP(from)] SCF: [NP]

  28. Expected Performance

  29. Expected Performance

  30. Expected Performance

  31. Applications of the Current System • Parsing System • Verb Sense Disambiguation • Using syntactic variables as predictors for speech variability. (CogSci Summer Research Project)

  32. Future Work • Some linguistic issues: - Prepositional complements (PP): put: (locative/directional) lean: (toward/against) - more complement-like or more adjunct-like.

  33. Future Work • Hypothesis Generation: - correctly identify gaps in relative clauses. - correctly identify extraposed clausal complements.

  34. Future Work • Back-off methods: - semantic verb classification (Levin). - morphological analyzer (verb derivation)

  35. Future Work • More informative computational lexicon: Subcatgorization Information: syntactic frames semantic predicate-argument structure selectional restriction on argument diathesis alternation

  36. Phrasal Verbs • A good method to extract phrasal verbs from corpus. - disambiguate phrasal verbs - English teaching - translation

More Related