1 / 18

Hansaem Kim The National Institute of the Korean Language

2009. 6. 18. ISO/TC37/SC4/WG2 Word Segmentation Project Editorial Meeting Word Segmentation in Korean. Hansaem Kim The National Institute of the Korean Language. Contents for further work (09.4.24.). Part1 1. WU, WSU: check up 2. Figure1 -> change and check. 3. Figure4

zorina
Download Presentation

Hansaem Kim The National Institute of the Korean Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2009. 6. 18.ISO/TC37/SC4/WG2 Word Segmentation Project Editorial MeetingWord Segmentation in Korean Hansaem Kim The National Institute of the Korean Language

  2. Contents for further work (09.4.24.) Part1 1. WU, WSU: check up 2. Figure1 -> change and check. 3. Figure4 1) lemma: delete 2) other lexical items -> other character strings 3) word forms -> lexical items 4) bound morpheme: delete Part2 1. terms & definition: added ex) bunsetz, eojeol, character, etc 2. Properties of CJK: add to introductory part 3. in "Scope": Chinese scripts -> Chinese characters 4. Application of Chinese general rules for JK in combination of Chinese characters 5. Add examples of agglutinative unit in JK

  3. Table of contents (Part 2) Foreword 1. Introduction: Kim 1)difference of CJK 2)interaction of CJK (nouns w/ Chinese characters) 2. Scope: Choi Application oriented refer to MAF, SynAF, etc linguistic layer & processing(vertical) 3. terms and definitions Bunsetsu: Kanzaki Eojeol: Kim 4. Overview and motivation: Kanzaki(main), Sun, Kim Mapping table of CJK POS scheme( + examples and definition) 5. Chinese word segmentation 6. Japanese 7. Korean 5.1. General rules for identifying WUs in Chinese text 5.2 Typology of WUs in Chinese

  4. Basic concepts and general principles (Part1)

  5. Word unit(WU) Distinction between ‘word unit’and‘word segmentation unit’ Y  Terms and definition of WSU + N  Correcting the definition of WU MWE(phrasal compound, fragment of sentence,…) ⊂ lexical item? Y  No change or changing ‘lexical items’ into ‘lexical items including MWEs’ N  changing ‘lexical items’ into ‘lexical items, MWEs’ Terms and definitions

  6. Essential concept systems (Figure 1)

  7. Essential concept systems (Figure 4) changed Word segmentation unit  Miscellaneous character strings Word forms

  8. Word segmentation for CJK (Part2)

  9. See the document. 1)difference of CJK 2)interaction of CJK (nouns w/ Chinese characters) Introduction

  10. Eojeol Linguistic unit separated by white space in Korean text, consisting of a word followed by either particle(s) or ending(s), or just a word. Example Given a sentence “나는 점심을 먹었다.”, “나(I)” is a pronoun, “는”is a particle, “점심(lunch)” is a noun, “을”is a particle, “먹(eat)” is a verbal stem followed by the endings “었”and “다”. And the sentence contains 3 Eojeols - “나는”, “점심을”, and “먹었다”. Terms and definitions

  11. Mapping table of CJK POS scheme Overview and motivation

  12. 7. 1. General rules for identifying WUs in Korean text

  13. 7.1.1. Punctuation Space blank and punctuations are separation marks of word segmentation unit in computer processing. The punctuations used as separation marks include the full stop(.), question mark(?), exclamation mark(!), comma(,) middle dot(․), colon(:), slash(/), quotation mark(“”, ‘’), brackets(( ), { }, [ ]), dash(―), hyphen(-), swungdash(~), ellipsis dots(……), etc. Korean punctuation marks are listed up in the “Korean language regulations”.

  14. 7.1.2.1. Numeric character strings 1984, 2009 7.1.2.2. Foreign character strings GPS, EU, 同意 7.1.2.3. Hangeul(Korean Alphabet) characters (C & V) ㄱㄴㄷ, 가 7.1.2.4. Combination of character strings or other symbols [abc], {라} 7.1.2. Combination of characters

  15. 7.1.3.1. Simplex 사자, 밥 7.1.3.2. Compound 농목장, 검붉다 7.1.3.3. Derivation 풋사과, 신사적, 동의하다 7.1.3.4. Abbreviation 건교위, 노찾사 7.1.3.5. idiomatic expression w/ Chinese characters 와신상담(臥薪嘗膽), 오십보백보(五十步百步) 7.1.3. word

  16. 7.1.4.1. Phrasal compound 1) General phrasal compound 주민 번호 2) Terminology 민주 국가, 계급 사회 3) Expressions related to proper nouns 예술의 전당 7.1.4.2. Idiom 1) Lexical idiom 무릎을 꿇다 2) Grammatical idiom ~로 인해, ~을 위해 7.1.4.3. Fixed expression: proverb, motto, etc. 낫 놓고 기역 자도 모른다 7.1.4. Combination of words (MWEs)

  17. Typology of WUs in Korean

  18. 1. Noun 1.1 Common noun 1.2 Proper noun 1.3 Bound noun 2. Pronoun 3. Numeral 4. Verb 5. Auxiliary verb 6. Copula Overall typology (See the document.) 7. Adjective 8. Auxiliary adjective 9. Adnoun 10. Adverb 11. Exclamation 12. Particle 12.1 Case particle 12.2 Auxiliary particle

More Related