1 / 17

Grammar Extraction and Refinement from an HPSG Corpus

Grammar Extraction and Refinement from an HPSG Corpus. Kiril Simov BulTreeBank Project (www.BulTreeBank.org) Linguistic Modeling Laboratory, Bulgarian Academy of Sciences kivs@bultreebank.org ESSLLI'2002 Workshop on Machine Learning Approaches in Computational Linguistics

walden
Download Presentation

Grammar Extraction and Refinement from an HPSG Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grammar Extraction and Refinement from an HPSG Corpus Kiril Simov BulTreeBank Project (www.BulTreeBank.org) Linguistic Modeling Laboratory, Bulgarian Academy of Sciences kivs@bultreebank.org ESSLLI'2002 Workshop on Machine Learning Approaches in Computational Linguistics August 5 - 9, 2002

  2. Plan of the Talk • DOP model • An HPSG Corpus - definition • Formalism for HSPG • Extraction of HPSG Grammar from HPSG Corpus • Refinement of an HPSG grammar • Conclusion

  3. DOP Model [Bod 1998] • Grammar formalism for the target grammar • Procedure for the construction of sentence analyses in the chosen grammar formalism • Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus • A performance model guiding the analysis of new sentences with respect to some desirable conditions

  4. DOP Model (2) • Two additional unspoken assumptions are: • The structures in the corpus are decomposable into the grammar formalism • The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus This assumption refers to the quality of the corpus

  5. Corpus in a Grammar Formalism A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar  in this grammatical formalism:  S. S  C  S  SGC() and  S. S  C  S'.(S'  ((S))  S'  C)

  6. HPSG Corpus Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King’s logic - SRL

  7. Feature Graphs (1)   S,F,A - SRL finite signature G = <N,V,,T> is a feature graph iff G is a directed, connected and rooted graph such that N is a set of nodes, V : NFN is a partial arc function,  is the root node, T : NS is a total species assignment function

  8. Feature Graphs (2) Some notions: Subsumption based on isomorphism Unification - there is no most general unifier Complete feature graphs - all information from signature is presented Paths Subgraphs

  9. Feature Graphs (3) • Feature graphs can be interpreted via translation to SRL clauses • Exclusive matrixes can be represented as a set of feature graphs (exclusive set of graphs) • An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs • A sentence analysis can be represented as a complete feature graph

  10. Feature Graphs (4) • Complete feature graphs are a good representation for an HPSG corpus • Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs) Important property: For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node

  11. Corpus Grammar A grammar  such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar C  SGC() In feature graph terms: For each complete graph in the corpus, the grammar contains a graph which subsumes it

  12. Grammar Extraction (1) Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF Each extracted fragment has to • contain all features for the root node, and • subsume at least one complete graph in the corpus

  13. Grammar Extraction (2) The set GF is ordered by subsumption relation. The complete graphs from the corpus are at the bottom. Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of C

  14. Grammar Extraction (3) All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity A grammar from this hierarchy can be chosen by specifying additional constraints over it such as: • it is the most general one that doesn’t overgenerate or undergenerate over the corpus, or • it satisfies some external conditions like - the shortest inference over the corpus and etc

  15. The set GF as a Grammar This is the original idea behind DOP Model • GF contains all generalizations over the corpus • GF will overgenerate over the corpus • GF will accept ungrammatical sentences Thus a special inference mechanism is necessary in order to use GF as a grammar

  16. Grammar Refinement In the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement We can choose the most general grammars that refine the original grammar as a new grammar

  17. Conclusions • We define an HPSG corpus as a set of complete graphs • We define an HPSG grammar as a set of graphs • We define a procedure for extraction of corpus grammars from the corpus • We define a refinement of a grammar on the basis of a corpus

More Related