A Proposed Tag Set for Exchanging Word-Segmented Text Corpora

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.tw http://www.bdc.com.tw/~shin/ Behavior Design Corporation

Technical Issues in Designing a Tag Set for a Stratified WS Standard • Why a Stratified WS Standard ? • “Words” are generated by various mechanisms • No common agreement on all WS criteria due to different processing models of different researchers and institutions on the mechanisms • Stratification help the exchange of corpora and conversion to appropriate private word units • What Tags and Attributes, and Why?

Text Generation Mechanisms Behind Word Stratification • Lexicon Selection • Basic Lexicon (“Standard Dictionary”) • Derivational Processes (non-enumerable) • simple variants (color/colour;呆子/獃子;兇手/凶手) • regular expressions (numbers, word patterns) • regular derivational processes (proper nouns, abbreviations, compounding, …) • Text Planning • Writing Variants (symbols, punctuations)

What Tags/Attributes, and Why? • Tags for carrying linguistics information • word boundary • level of stratification (in terms of a standard) • misc. (e.g., symbols and punctuations in text) • Tags for carrying conforming information • standard/substandard of conformance • so as to convert to-and-from private systems easily • to allow user extension on (sub)standard(s) & overcome time variant issues

Tags for carrying linguistics information • Tags: • <w0>(~信級詞): words in standard dictionary • <w1>(~達級詞): morphologically derived • <w2>(~雅級詞): derived through compounding regularity • Attributes: • POS (part of speech), tt (token type, derived word type), hwds (embedded head words), rel (relationship among embedded head words)

Tags for carrying conforming information • Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>, <p> (un-segmented para.) • paragraphs of various stratification level conforming to specified standard/substandards • Attributes: WS, Dict, MR, NUM, NAM, CMPR, DR, GR, specifying: • conformed “standard resources” • user extension: • e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2”

Attributes On Standardized Resources: Why ? • Official standard (and thus tags) should be defined in terms of explicitly specified and unambiguously testifiable resources!! • with (optional) mechanism for user extension • e.g., Charset registry, RFC (Internet standards) • Every resource is assigned a symbolic name (referenced in attribute) for conforming test • for conversion to/evaluation in private systems

Attributes On Standardized Resources (Cont.) • WS: WS standard, the collection of a set of substandards (such as Dict, MR, ...). • e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1 • Dict: standard dictionary (basic lexicon) • qualified basic words • POS: optional (referred by other substandards) • MR: morphological rules/standard • qualified affix/prefix/suffix • qualified combination patterns

Attributes on Standardized Resources (Cont.) [& Arguable] • NUM: numbering rules/patterns • NAM: naming rules/patterns • qualified family names • length constraints, abbreviations, standard translations of foreign names, ...

Attributes on Standardized Resources (Cont.) [& Arguable] • CMPR: compound formation rules/patterns • DR: other derivational rules not in the above substandards • GR: private rules/patterns/description

Example: Simplest Encoding •  • <wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3"> •  • <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> • 中文分詞標準必須一步一步小心地制定 . • </ws0p> • <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3" verified=TRUE> • 中文分詞標準必須一步一步小心地制定 . • </ws1p> • <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> • 中文分詞標準必須一步一步小心地制定 . • </ws2p> • </wstxt>

Example: Using Word Tags •  • <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> • <w0>中文</w0><w0>分詞</w0><w0>標準</w0><w0>必須</w0> • <w0 pos=quan>一</w0><w0>步</w0><w0>一</w0><w0>步</w0> • <w0>小心</w0><w0>地</w0><w0>制定</w0><w0>.</w0> • </ws0p> • <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” verified=TRUE> • <w0>中文</w0> <w0>分詞</w0> <w0>標準</w0> <w0>必須</w0> • <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> • <w1><w0>小心</w0><w0>地</w0></w1> <w0>制定</w0> <w0>.</w0> • </ws1p> • <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> • <w2><w0>中文</w0><w0>分詞</w0><w0>標準</w0></w2> • <w0>必須</w0> • <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> • <w1><w0>小心</w0><w0>地</w0></w1> • <w0>制定</w0><w0>.</w0> • </ws2p>

Example: Derived Words and Token Type (TT) Attribute • <ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2"> •  • <w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3"> • <w0>孩子</w0><w0>們</w0> • </w1> • <w1> • <w0 pos=quan>一萬</w0> • <w0>朵</w0> • <w0 pos=quan>一萬</w0> • <w0>朵</w0> • </w1><w0>地</w0><w0>送</w0> • </ws1p>

Example: Application of (Hwrds, Rel) Attributes for Punctuations •  • <w1 hwds="高中,高職" rel=AND_OR> • <w0>高中</w0><w0>(</w0><w0>職</w0><w0>)</w0> • </w1> •  • <w1 hwds="中山南路,中山北路" rel=AND> • <w0>中山</w0><w0>南</w0><w0>、</w0><w0>北</w0><w0>路</w0> • </w1> • <w0>與</w0> • <w1 hwds="中山南路,中山北路" rel=AND> • <w0>中山</w0><w0>南</w0><w0>(</w0><w0>北</w0><w0>)</w0><w0>路</w0> • </w1> • <w0>意義</w0><w0>相同</w0><w1>...</w1><w1>...</w1>

Future Issues • Specification of the Official WS Standard • standard resources and substandards to be defined in the first official version • Construction of Basic Lexicon • basic vs. derivational words • standardization of the derivational parts • Registration of User Extension & Evolution of the Official Standard

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora

Presentation Transcript

Money Text Set

Author-Topic Models for Large Text Corpora

Set Text

Text Set

Text Corpora and Lexical Resources

Kindergarten Text Set

Text Set

Text Set

Text Set

Text Set

Addition Text Set

Text Set

Text Set: Electricity

Text Set

TEXT SET POWERPOINT

Text set

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Text set

Proposed introductory text for 11ai Draft

Text Set

Evaluating word sketches and corpora

Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora