1 / 27

A Dependency Treebank of Classical Chinese Poems

A Dependency Treebank of Classical Chinese Poems. John Lee and Yin Hei Kong The Halliday Centre for Intelligent Applications of Language Studies Department of Chinese, Translation and Linguistics

bertha
Download Presentation

A Dependency Treebank of Classical Chinese Poems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Dependency Treebank of Classical Chinese Poems John Lee and Yin Hei Kong The Halliday Centre for Intelligent Applications of Language Studies Department of Chinese, Translation and Linguistics 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 191–199, Montre´al, Canada, June 3-8, 2012. c 2012 Association for Computational Linguistics

  2. Outline • 1. Abstract • 3. Treebank design • 4. Data • 5. Parallel Couplets

  3. 1.Abstract • First large-scale dependency treebank for Classical Chinese literature. • Derived from the Stanford dependency type • Over 32K characters • 唐詩

  4. 3.Treebank design • Classical Chinese and Modern Chinese • similarity • Vocabulary • Grammar • POS tagset • Based on Penn Chinese Treebank and slight Revision of its 33 tags (Lee, 2012)

  5. A dependency framework is chosen for two reasons. • free word order. • Dependency grammars can handle this phenomenon well • helpful to students

  6. dependency relations • Our set of dependency relations is based on those developed at Stanford University for Modern Chinese • Our approach is to map their 44 dependency relations, as much as possible, to Classical Chinese. • Many of these function words do not exist in Classical Chinese. • such as tense, voice, and case.

  7. dependency relations • 3.4 • 3.1 • 3.3 • 3.2 • 3.6

  8. 3.1 Locative modifiers • preposition is frequently omitted • bare locative noun phrase modifying the verb directly • “hill” occupies the position normally reserved for the subject , it actually indicates a location • the locative noun ‘alley’ is placed after the verb.

  9. 3.2 Oblique objects • mark nouns that directly modify a verb • They typically come after the verb. • the noun ‘cup’ is used in an instrumental sense to modify ‘drunk’ in an obl relation.

  10. 3.3 Noun phrase as adverbial modifier • floating reflexives • (e.g., it is itself adequate) • other PP-like NPs • (e.g., two times a day) • the noun ‘self’ as a reflexive • the noun ‘year’ indicating repetition.

  11. 3.4 Indirect objects • The double object construction contains two objects in a verb phrase. • direct object • (e.g., “he gave me a book”); • indirect object • (“he gave me a book”) • Classical Chinese does not have this linguistic device • indirect object is unmarked; • we distinguish it with the “indirect object” label (iobj). • ‘word’ as the direct object • ‘person’ as the indirect.

  12. 3.4 Indirect objects

  13. 3.5 Absence of copular verbs • “A is B”, A is considered the “topic” (top) of the copular verb “is” (Chang et al., 2009). • The copular, however, is rarely used in Classical Chinese (Pulleyblank, 1995) • In some cases • it is replaced by an adverb that functions as a copular verb • If so, that adverb is POS-tagged as such (VC) in our treebank • In other cases, • the copular is absent altogether. • we expand the usage of the top relation. • the relation top(‘capable’, ‘general’) would be assigned.

  14. 3.6 Discourse relations • Even in the absence of these connectives, however, two adjacent clauses can still hold an implicit discourse relation.

  15. 3.6 Discourse relations

  16. 4 Data • The Complete Shi Poetry of the Tang (Peng, 1960) • nearly 50,000 poems • more than two thousand poets

  17. 4.1 Material • over 32,000 characters in 521 poems • Wang Wei(王維) and MengHaoran(孟浩然) • dependency relations • Wordboundaries and POS tags • metadata • Level(平) or oblique (ze仄). • title, author, andgenre • ‘recent-style’ (近體詩) or ‘ancient-style’ (古體詩).

  18. 4.2 Inter-annotator agreement • Two annotators, both university graduates with adegree in Chinese, created this treebank. • To measureinter-annotator agreement, we set apart a subsetof about 1050 characters • three tasks: agreement rate • POS tagging 95.1% • head selection 92.3% • dependency labeling 91.2%

  19. For POS tagging • the three main error categories are the confusion • between adverbs (AD) and verbs with anadverbial force, • between measure words (M) andnouns (NN) • between adjectives (JJ) andnouns. • These differences in POS tags trickle down tohead selection and dependency labeling.

  20. Polysemy • 簞食伊何 • ‘bowl / blanket’ • ‘What food is containedin that bowl?’ • the relation clf • is required for 簞dan, and 伊 yi is the root word. • ‘food’, • ‘What food is placed onthe blanket?’ • Here, dan takes on the relation nn,and the root word would be 何he instead.

  21. 5. Parallel Couplets • Character-level parallelism. • Phrase-level parallelism.

  22. Character-level parallelism. • exactly matched POStags yields a parallel rate of only 74% in the corpusas a whole. • ‘equivalence sets’ of POS • Two tags in the same set are considered parallel,even though they do not match. • the parallel rate increases to 87%. • ‘equivalence sets’ of POS is Not perfect • polysemous character with a ‘out-of-context’ meaning (jieyi借義). • Instance : “欲就終焉志,恭聞智者名,” • Since 焉 is a sentence particle and 者 is a noun. • However, the poet apparently viewed them as parallel, because zhe can also function as a sentence particle in other contexts.

  23. Character-level parallelism.

  24. Phrase-level parallelism. • The character-level metric, however, still rejects some couplets that would be deemed parallel by scholars. • Most of these couplets are parallel not at the character level, but at the phrase level. • pentasyllabic (5-character) line • = disyllabic unit (the first two characters) • + trisyllabic unit (the last three characters) • Ex : Consider two corresponding disyllabic units • 抱琴垂釣 • 抱/VV琴/NN 垂/AD釣/VV • both units are verb phrases describing an activity (‘to hold a violin’ and ‘to fish while looking down’)

  25. 5.3 Results

  26. Conclusion • We have presented the first large-scale dependency treebank of Classical Chinese literature, which encodes works by two poets in the Tang Dynasty. • We have described how the dependency grammar framework has been derived from existing treebanks for Modern Chinese, and shown a high level of inter-annotator agreement. Finally, we have illustrated the utility of the treebank with a study on parallelism in Classical Chinese poetry. • Future work will focus on parsing Classical Chinese poems of other poets, and on enriching the corpus with semantic information, which would facilitate not only deeper study of parallelism but also other topics such as imagery and metaphorical coherence (Zhu and Cui, 2010).

More Related