150 likes | 290 Views
A Proposed Tag Set for Exchanging Word-Segmented Text Corpora. Jing-Shin Chang shin@bdc.com.tw http://www.bdc.com.tw/~shin/ Behavior Design Corporation. Technical Issues in Designing a Tag Set for a Stratified WS Standard. Why a Stratified WS Standard ?
E N D
A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.tw http://www.bdc.com.tw/~shin/ Behavior Design Corporation
Technical Issues in Designing a Tag Set for a Stratified WS Standard • Why a Stratified WS Standard ? • “Words” are generated by various mechanisms • No common agreement on all WS criteria due to different processing models of different researchers and institutions on the mechanisms • Stratification help the exchange of corpora and conversion to appropriate private word units • What Tags and Attributes, and Why?
Text Generation Mechanisms Behind Word Stratification • Lexicon Selection • Basic Lexicon (“Standard Dictionary”) • Derivational Processes (non-enumerable) • simple variants (color/colour;呆子/獃子;兇手/凶手) • regular expressions (numbers, word patterns) • regular derivational processes (proper nouns, abbreviations, compounding, …) • Text Planning • Writing Variants (symbols, punctuations)
What Tags/Attributes, and Why? • Tags for carrying linguistics information • word boundary • level of stratification (in terms of a standard) • misc. (e.g., symbols and punctuations in text) • Tags for carrying conforming information • standard/substandard of conformance • so as to convert to-and-from private systems easily • to allow user extension on (sub)standard(s) & overcome time variant issues
Tags for carrying linguistics information • Tags: • <w0>(~信級詞): words in standard dictionary • <w1>(~達級詞): morphologically derived • <w2>(~雅級詞): derived through compounding regularity • Attributes: • POS (part of speech), tt (token type, derived word type), hwds (embedded head words), rel (relationship among embedded head words)
Tags for carrying conforming information • Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>, <p> (un-segmented para.) • paragraphs of various stratification level conforming to specified standard/substandards • Attributes: WS, Dict, MR, NUM, NAM, CMPR, DR, GR, specifying: • conformed “standard resources” • user extension: • e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2”
Attributes On Standardized Resources: Why ? • Official standard (and thus tags) should be defined in terms of explicitly specified and unambiguously testifiable resources!! • with (optional) mechanism for user extension • e.g., Charset registry, RFC (Internet standards) • Every resource is assigned a symbolic name (referenced in attribute) for conforming test • for conversion to/evaluation in private systems
Attributes On Standardized Resources (Cont.) • WS: WS standard, the collection of a set of substandards (such as Dict, MR, ...). • e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1 • Dict: standard dictionary (basic lexicon) • qualified basic words • POS: optional (referred by other substandards) • MR: morphological rules/standard • qualified affix/prefix/suffix • qualified combination patterns
Attributes on Standardized Resources (Cont.) [& Arguable] • NUM: numbering rules/patterns • NAM: naming rules/patterns • qualified family names • length constraints, abbreviations, standard translations of foreign names, ...
Attributes on Standardized Resources (Cont.) [& Arguable] • CMPR: compound formation rules/patterns • DR: other derivational rules not in the above substandards • GR: private rules/patterns/description
Example: Simplest Encoding • <!-- The whole segmented text is enclosed by the "wstxt" tag; conforming standard is specified with attributes. Words are space-delimited, and are conforming to the “w0”, “w1” or “w2” standard depending on weather they are enclosed in “ws0p”, “ws1p” or “ws2p” --> • <wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3"> • <!-- use spaces as default word boundaries w/o using word tags --> • <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> • 中文 分詞 標準 必須 一 步 一 步 小心 地 制定 . • </ws0p> • <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3" verified=TRUE> • 中文 分詞 標準 必須 一步一步 小心地 制定 . • </ws1p> • <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> • 中文分詞標準 必須 一步一步 小心地 制定 . • </ws2p> • </wstxt>
Example: Using Word Tags • <!-- use word tags to identify word boundaries --> • <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> • <w0>中文</w0><w0>分詞</w0><w0>標準</w0><w0>必須</w0> • <w0 pos=quan>一</w0><w0>步</w0><w0>一</w0><w0>步</w0> • <w0>小心</w0><w0>地</w0><w0>制定</w0><w0>.</w0> • </ws0p> • <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” verified=TRUE> • <w0>中文</w0> <w0>分詞</w0> <w0>標準</w0> <w0>必須</w0> • <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> • <w1><w0>小心</w0><w0>地</w0></w1> <w0>制定</w0> <w0>.</w0> • </ws1p> • <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> • <w2><w0>中文</w0><w0>分詞</w0><w0>標準</w0></w2> • <w0>必須</w0> • <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> • <w1><w0>小心</w0><w0>地</w0></w1> • <w0>制定</w0><w0>.</w0> • </ws2p>
Example: Derived Words and Token Type (TT) Attribute • <ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2"> • <!-- examples of derived w1 words (from "w0" words) --> • <w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3"> • <w0>孩子</w0><w0>們</w0> • </w1> • <w1> • <w0 pos=quan>一萬</w0> • <w0>朵</w0> • <w0 pos=quan>一萬</w0> • <w0>朵</w0> • </w1><w0>地</w0><w0>送</w0> • </ws1p>
Example: Application of (Hwrds, Rel) Attributes for Punctuations • <!-- examples of tagging punctuation enclosed/delimited words --> • <w1 hwds="高中,高職" rel=AND_OR> • <w0>高中</w0><w0>(</w0><w0>職</w0><w0>)</w0> • </w1> • <!-- words with the same (hwds,rel) could be normalized to the same internal representation of a private system --> • <w1 hwds="中山南路,中山北路" rel=AND> • <w0>中山</w0><w0>南</w0><w0>、</w0><w0>北</w0><w0>路</w0> • </w1> • <w0>與</w0> • <w1 hwds="中山南路,中山北路" rel=AND> • <w0>中山</w0><w0>南</w0><w0>(</w0><w0>北</w0><w0>)</w0><w0>路</w0> • </w1> • <w0>意義</w0><w0>相同</w0><w1>...</w1><w1>...</w1>
Future Issues • Specification of the Official WS Standard • standard resources and substandards to be defined in the first official version • Construction of Basic Lexicon • basic vs. derivational words • standardization of the derivational parts • Registration of User Extension & Evolution of the Official Standard