1 / 15

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora. Jing-Shin Chang shin@bdc.com.tw http://www.bdc.com.tw/~shin/ Behavior Design Corporation. Technical Issues in Designing a Tag Set for a Stratified WS Standard. Why a Stratified WS Standard ?

omana
Download Presentation

A Proposed Tag Set for Exchanging Word-Segmented Text Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Proposed Tag Set for Exchanging Word-Segmented Text Corpora Jing-Shin Chang shin@bdc.com.tw http://www.bdc.com.tw/~shin/ Behavior Design Corporation

  2. Technical Issues in Designing a Tag Set for a Stratified WS Standard • Why a Stratified WS Standard ? • “Words” are generated by various mechanisms • No common agreement on all WS criteria due to different processing models of different researchers and institutions on the mechanisms • Stratification help the exchange of corpora and conversion to appropriate private word units • What Tags and Attributes, and Why?

  3. Text Generation Mechanisms Behind Word Stratification • Lexicon Selection • Basic Lexicon (“Standard Dictionary”) • Derivational Processes (non-enumerable) • simple variants (color/colour;呆子/獃子;兇手/凶手) • regular expressions (numbers, word patterns) • regular derivational processes (proper nouns, abbreviations, compounding, …) • Text Planning • Writing Variants (symbols, punctuations)

  4. What Tags/Attributes, and Why? • Tags for carrying linguistics information • word boundary • level of stratification (in terms of a standard) • misc. (e.g., symbols and punctuations in text) • Tags for carrying conforming information • standard/substandard of conformance • so as to convert to-and-from private systems easily • to allow user extension on (sub)standard(s) & overcome time variant issues

  5. Tags for carrying linguistics information • Tags: • <w0>(~信級詞): words in standard dictionary • <w1>(~達級詞): morphologically derived • <w2>(~雅級詞): derived through compounding regularity • Attributes: • POS (part of speech), tt (token type, derived word type), hwds (embedded head words), rel (relationship among embedded head words)

  6. Tags for carrying conforming information • Tags: <wstxt>, <ws0p>, <ws1p>, <ws2p>, <p> (un-segmented para.) • paragraphs of various stratification level conforming to specified standard/substandards • Attributes: WS, Dict, MR, NUM, NAM, CMPR, DR, GR, specifying: • conformed “standard resources” • user extension: • e.g., Dict=“CNS-WS-Dict-1998-1,X-BDC-WS-Dict-1998-2”

  7. Attributes On Standardized Resources: Why ? • Official standard (and thus tags) should be defined in terms of explicitly specified and unambiguously testifiable resources!! • with (optional) mechanism for user extension • e.g., Charset registry, RFC (Internet standards) • Every resource is assigned a symbolic name (referenced in attribute) for conforming test • for conversion to/evaluation in private systems

  8. Attributes On Standardized Resources (Cont.) • WS: WS standard, the collection of a set of substandards (such as Dict, MR, ...). • e.g., CNS-WS-1998-1=CNS-WS-{Dict, MR, …}-1998-1 • Dict: standard dictionary (basic lexicon) • qualified basic words • POS: optional (referred by other substandards) • MR: morphological rules/standard • qualified affix/prefix/suffix • qualified combination patterns

  9. Attributes on Standardized Resources (Cont.) [& Arguable] • NUM: numbering rules/patterns • NAM: naming rules/patterns • qualified family names • length constraints, abbreviations, standard translations of foreign names, ...

  10. Attributes on Standardized Resources (Cont.) [& Arguable] • CMPR: compound formation rules/patterns • DR: other derivational rules not in the above substandards • GR: private rules/patterns/description

  11. Example: Simplest Encoding • <!-- The whole segmented text is enclosed by the "wstxt" tag; conforming standard is specified with attributes. Words are space-delimited, and are conforming to the “w0”, “w1” or “w2” standard depending on weather they are enclosed in “ws0p”, “ws1p” or “ws2p” --> • <wstxt dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3"> • <!-- use spaces as default word boundaries w/o using word tags --> • <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> • 中文 分詞 標準 必須 一 步 一 步 小心 地 制定 . • </ws0p> • <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3" verified=TRUE> • 中文 分詞 標準 必須 一步一步 小心地 制定 . • </ws1p> • <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3" CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> • 中文分詞標準 必須 一步一步 小心地 制定 . • </ws2p> • </wstxt>

  12. Example: Using Word Tags • <!-- use word tags to identify word boundaries --> • <ws0p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" verified=TRUE> • <w0>中文</w0><w0>分詞</w0><w0>標準</w0><w0>必須</w0> • <w0 pos=quan>一</w0><w0>步</w0><w0>一</w0><w0>步</w0> • <w0>小心</w0><w0>地</w0><w0>制定</w0><w0>.</w0> • </ws0p> • <ws1p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” verified=TRUE> • <w0>中文</w0> <w0>分詞</w0> <w0>標準</w0> <w0>必須</w0> • <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> • <w1><w0>小心</w0><w0>地</w0></w1> <w0>制定</w0> <w0>.</w0> • </ws1p> • <ws2p dict="CNS-WS-DICT-1998-1" NUM="CNS-WS-NUM-1998-1" MR="CNS-WS-MR-1998-2.3” CMPR="CNS-WS-CMPR-COMPOUND-DICT-1998-1.2" verified=TRUE> • <w2><w0>中文</w0><w0>分詞</w0><w0>標準</w0></w2> • <w0>必須</w0> • <w1><w0>一</w0><w0>步</w0><w0>一</w0><w0>步</w0></w1> • <w1><w0>小心</w0><w0>地</w0></w1> • <w0>制定</w0><w0>.</w0> • </ws2p>

  13. Example: Derived Words and Token Type (TT) Attribute • <ws1p MR="CNS-WS-MR-1998-2.3" DR="CNS-WS-DR-1988.1.2"> • <!-- examples of derived w1 words (from "w0" words) --> • <w1 tt=(common_noun,suffix) MR="CNS-WS-MR-1998-2.3"> • <w0>孩子</w0><w0>們</w0> • </w1> • <w1> • <w0 pos=quan>一萬</w0> • <w0>朵</w0> • <w0 pos=quan>一萬</w0> • <w0>朵</w0> • </w1><w0>地</w0><w0>送</w0> • </ws1p>

  14. Example: Application of (Hwrds, Rel) Attributes for Punctuations • <!-- examples of tagging punctuation enclosed/delimited words --> • <w1 hwds="高中,高職" rel=AND_OR> • <w0>高中</w0><w0>(</w0><w0>職</w0><w0>)</w0> • </w1> • <!-- words with the same (hwds,rel) could be normalized to the same internal representation of a private system --> • <w1 hwds="中山南路,中山北路" rel=AND> • <w0>中山</w0><w0>南</w0><w0>、</w0><w0>北</w0><w0>路</w0> • </w1> • <w0>與</w0> • <w1 hwds="中山南路,中山北路" rel=AND> • <w0>中山</w0><w0>南</w0><w0>(</w0><w0>北</w0><w0>)</w0><w0>路</w0> • </w1> • <w0>意義</w0><w0>相同</w0><w1>...</w1><w1>...</w1>

  15. Future Issues • Specification of the Official WS Standard • standard resources and substandards to be defined in the first official version • Construction of Basic Lexicon • basic vs. derivational words • standardization of the derivational parts • Registration of User Extension & Evolution of the Official Standard

More Related