200 7 2 1
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

关于分词国际标准的若干思考 孙茂松 清华大学 200 7 年 2 月 1 日 西双版纳 PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on
  • Presentation posted in: General

关于分词国际标准的若干思考 孙茂松 清华大学 200 7 年 2 月 1 日 西双版纳. 目的. 分词( word segmentation ) 据认为,俄罗斯方面已经进入控制首都的最后阶段。 据 认为 , 俄罗斯 方面 已经 进入 控制 首都 的 最后 阶段 。 ロシア側は首都制圧の最終段階に入ったとみられる。 ロシア 側 は 首都 制圧 の 最終 段階 に 入った と み られる 。 亚洲:日、越、泰、韩(*)等 国内:民族语言. 目的.

Download Presentation

关于分词国际标准的若干思考 孙茂松 清华大学 200 7 年 2 月 1 日 西双版纳

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


200 7 2 1

关于分词国际标准的若干思考孙茂松清华大学2007年2月1日西双版纳


200 7 2 1

目的

分词(word segmentation)

  • 据认为,俄罗斯方面已经进入控制首都的最后阶段。

  • 据 认为 , 俄罗斯 方面 已经 进入 控制 首都 的 最后 阶段 。

  • ロシア側は首都制圧の最終段階に入ったとみられる。

  • ロシア 側 は 首都 制圧 の 最終 段階 に 入った と み られる 。

    亚洲:日、越、泰、韩(*)等

    国内:民族语言


200 7 2 1

目的

  • Why “word segmentation” for what purpose?

    • What the output is for an input text after the process of word segmentation, pursuing the consistency in word segmentation within/among texts to the maximum extent so as to meet the requirements from a variety of applications in information processing regarding natural languages , -- both mono-lingual and multi-lingual

    • Without word segmentation, no “word”, no content computing and management of NLP text

    • Benchmark for word segmentation evaluation


Word seg standard series

Word Seg. Standard Series

  • Language resource management – Word segmentation of

    written texts or mono-lingual and multi-lingual information

    processing - Part 1: General principles and methods

    Proposed by: CNIS, China

    Project leaders:

    Prof. Sun Maosong (Tsinghua U., China)

    Prof. Sue Ellen Wright (Kent State University, USA)

    Prof. Budin (Vienna U., Austia)

    Experts and actively involved scholars:

    Dr. Galinski

    Experts from USA (ANSI,…)

    Prof. Benjamin Tsou (City U of Hong Kong)

    Prof Chu-ren Huang (Academic Sinica, Taipei)

    Prof. Virach (Thailand)

    Ms. Song Min (CNIS, China)

    ……


Word seg standard series1

Word Seg. Standard Series

  • Language resource management - Word segmentation of

    written texts for mono-lingual and multi-lingual information

    processing - Part 2: Word Segmentation for Chinese,

    Japanese and Korean

    Proposed by: CNIS, China

    Project leaders:

    Prof. Sun Maosong (Tsinghua U., China)

    Prof. Choi (KAIST, Korea)

    Dr. Isahara (NICT, Japan)

    Experts and actively involved scholars:

    Dr. Galinski

    Experts from USA (ANSI,…)

    Prof. Benjamin Tsou (City U of Hong Kong)

    Prof Chu-ren Huang (Academic Sinica, Taipei)

    Ms. Song Min (CNIS, China)

    ……


Related activities

Related Activities

  • Early August, 2004 in Paris, NWIP in ISO TC37

    meetings

  • Late Jan. 2005, NWIP approved by ISO

  • Late April 2005 in Yantai: about 30 Chinese linguists

    and computational linguists had two days discussions on Chinese word segmentation

  • Early July 2005 in CNIS: Discussion with Dr. Galinski and Prof. Choi;

  • Late July 2005 in CNIS: Discussion with Prof. Budin

  • Late July 2005 in Japan: invited by Dr. Isahara from

    NICT, Two-days discussion by

    Prof. Choi, Dr. Isahara and Sun Maosong.


Related activities1

Related Activities

  • August 25, 2005 in Warsaw,

    ISO TC-37 Meeting, 1 day meeting on word

    segmentation standard

  • Oct. 2005, in Jeju, Korea,

    discussion in two related workshops (ALR, SIGHAN)

    (organized by Prof. Choi.

    presented by Sun Maosong.

    Dr. Isanhara also attended the events):

    (1) Workshop on Asian Language Resources

    (2) SIGHAN workshop (3 hours intensive

    discussion)

  • Nov. 2005, in Tokyo, EFTerm

  • Jan. 17, 2006, in Beijing, Small-scale workshop with

    Chinese scholors.

  • Jan. 20, 2006, in Jeju,Korea, ISO TC37 SC4 meeting

  • August, 2006,Beijing, ......


Today s discussion focus on p art 1 general p rinciples and m ethods

Today’s Discussion:Focus onPart 1: General Principles and Methods


62 13

概念体系(62核心,13外围)

  • Word

    A basic grammatical unit, and a relatively independent carrier of meaning, of a language that can stand alone to make up sentences. The unit is intuitively and mentally available for native speakers. In the context of a given language, a word is codified as a lexeme in the lexicon, with at least a part of speech. A word consists of at least a morpheme.

  • Lexeme

    A basic abstract unit of the lexicon which may be realized in different word forms. A simple(r) lexeme can be a part of another complex lexeme (associated with the process of derivation and compounding), and, free morphemes are the simplest lexemes. In its broader sense, lexeme is also used synonymously for word.

  • Word forms

    The concretely realized grammatical form of a word, or equivalently, of a lexeme in the lexicon, according to its grammatical categories in the context of a sentence.


200 7 2 1

(English) find, found, and finding are word forms of the lexeme FIND


200 7 2 1

概念体系


200 7 2 1

概念体系


200 7 2 1

概念体系


200 7 2 1

一般原则与方法

4.1 Principlesin applying this Standard to the text

4.1.1 Principle of full coverage

The standard should be applicable to any text that needs word segmentation.

4.1.2 Principle of consistency

The standard should be used in a consistent way to any text and, the output of using the standard should also be consistent.


200 7 2 1

一般原则与方法

4.2 The universal principle of morphology

All languages have words and all languages have morphemes.


200 7 2 1

一般原则与方法

  • 4.3 Principles for validating the word-hood of a linguistic unit

  • 4.3.1 Principles from the linguistic perspective

  • In general, all the linguistic principles regarding word-formation hold.

  • Principle of bound morpheme

  • Principle of lexical integrity hypothesis.

  • (3) Principle of unpredictability of a word meaning from its subparts.

  • (4) Principle of idiomatization.

  • (5) Principle of collocation.

  • (6) Principle of unproductivity.


200 7 2 1

一般原则与方法

4.3.2. Principles from the practical (pragmatic) perspective

(1) Principle of frequency: Frequency is a basic index for the degree of lexicalization of a linguistic unit.

(2) Gestalt principle in cognitive linguistics: Things are likely to be perceived as a whole.

(3) Principle of prototype members in categories: According to the prototype theory in the mental lexicon, prototype members in categories is more salient than non-prototype members, and more accurately remembered in short-term memory and more easily retained and accessed in long-term memory for human-beings.

(4) Principle of language economy: For a linguistic unit, if its inclusion in the lexicon can decrease the difficulty of later linguistic analysis, then it is likely to be a lexical item. 大中小学


200 7 2 1

一般原则与方法

4.4 The full entry principle of lexicon

All the words which ‘exist’ are listed in the lexicon. The lexicon should be dynamic, being adapted to the changes of language usage.


200 7 2 1

一般原则与方法

  • 4.5 Principles for word segmentation output

  • Principle of granularity.

  • 傣族风情园令人流连忘返.

  • 傣族风情园 | 令 | 人 | 流连忘返.

  • 傣族 | 风情 | 园 | 令 | 人 | 流连忘返.

  • (((傣 族) 风情) 园) 令 人 流连忘返.

  • (2) Principle of the scope maximization of affixation.

  • (3) Principle of the scope maximization of compounding with respect to a lexicon.


200 7 2 1

一般原则与方法

4.6.1 General architecture for word segmentation

(1) a dictionary, built on the representative corpus, with high coverage to texts, and, possibly with morphological structures for some lemmas, if applicable, respectively.

(2) word formation specification.

(3) a complete prefix/semi-prefix list

(4) a complete suffix/semi-suffix list

(5) a complete free morpheme list

(6) a complete bound morpheme list

(7) special morpheme lists that have special functions in the process of word segmentation, for example, inflectional affix for verbs in Japanese.

(8) corpora: to support the quantitative analysis of the lexicon (but not as a part of the Standards).


200 7 2 1

一般原则与方法

  • 4.6.2 The role and makeup of the lexicon

  • The lexicon serves as a foundation and gold-standard in word segmentation, so as to keep consistencies in word segmentation to the maximum extent.

  • Regular word forms are in general not included in the lexicon.

  • Two lexical items which are homographic should keep two separate entries in the lexicon.


Q a thanks

Q&A Thanks!


  • Login