1 / 23

Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora

Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora. Wei Qiao and Maosong Sun Department of Computer Science and Technology Tsinghua University. Outline. Introduction Motivations The New Approximation Methods Data Set

Download Presentation

Word Frequency Approximation for Chinese Using Raw, MM-Segmented and Manually-Segmented Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word Frequency Approximation for Chinese UsingRaw, MM-Segmented and Manually-SegmentedCorpora Wei Qiao and Maosong Sun Department of Computer Science and Technology Tsinghua University

  2. Outline • Introduction • Motivations • The New Approximation Methods • Data Set • Experiments and Result Analysis • Conclusion and Future Work

  3. Background Plays important roles in NLP applications TF in information retrieval Word segmentation using statistic method Teaching Chinese as a second language The Key point of research Easy for English but hard for Chinese Chinese word frequency approximation: A correct manually-segmented Chinese corpus is needed What is a word in Chinese? => Inconsistence phenomena need a corpus with several hundred million characters => Zipf’s Law Where are we? The resources we can use: Introduction—Motivations (1)

  4. Introduction—Motivations (2) Raw corpus Manually segmented corpora Unrealistic Precision 90%,better consistence Consistent,much higher than actual value Complex segmenter Perfect segmenter MM segmenter Character of string Count Word frequency √ √ √ Manually segmented corpora Segmented corpus Raw corpus Precision 95%, weak inconsistence Inconsistence

  5. Where are we? • Introduction • The New Approximation Methods • Architecture • Data Set • Experiments and Result Analysis • Conclusion and Future Work

  6. Architecture——(1) Manually segmented corpora Word frequency Combine method Simply add them up String of character Raw corpus Combine method MM approximation result Segmented corpus Cite Sun and Zhang (2006) 1-4: the average of forward and backward 5 :Backward 6+:string of character

  7. Architecture——(2) The shorter the word, the better the manually segmented corpus result is —— 1,2,3,4+ descent Initial : adjust it through experiment Balance corpus size Word length effect

  8. Where are we? • Introduction • The New Approximation Methods • Data Set • Experiments and Result Analysis • Conclusion and Future Work

  9. Data Set • Data set for word frequency approximation • Manually segmented corpora • Tsinghua and Peking Language University corpus(HUAYU) • Peking University corpus(BEIDA) • Raw corpus (RC): 447,079,112 characters

  10. Data Set • Standard corpus — Golden-standard • Institute of National Applied Linguistics corpus, denoted as YUWEI (25,000,309 words,51,311,659 character) • distribute:124 kinds of different kinds of fields from the year 1920 to 2001——of large size and relatively balanced • Select the words with frequency above 3 from YUWEI. Totally 99,660 words • Get a word rank sequence YWL by sorting their frequency in descent order • Distribute of YWL according to word length:

  11. Where are we? • Introduction • The New Approximation Methods • Data Set • Experiments and Result Analysis • Experiments design • Results and analysis • Conclusion and Future Works

  12. Experiments design 标准序 w1 r1 w2 r2 Rank1 Rank2 Rank3 w1 w1 w1 Mr1 Rr1 Zr1 w3 r3 w2 w2 w2 Mr2 Rr2 Zr1 w3 w3 w3 Mr3 Zr1 Rr3 99,660 wn wn wn Mrn Rrn Zr1 wn rn YWL RM RR RZ RYW

  13. Results and Analysis ——(1) Rank1 Rank2 w1 r1 w1 w2 r2 w2 w3 r3 w3 wn rn wn • Spearman Ranks Correlation Coefficients (SRCC)

  14. Results and Analysis ——(2) • Adjust the factor Initial value Experiment value 1: 1 1:3.9

  15. Results and Analysis ——(3) • Overcast Rate Evaluation on YUWEI • Top 50,000 words

  16. Results and Analysis ——(4) • test result on all YWL words ?

  17. Results and Analysis ——(5) High 8,076 Middle 52,148 Low 39,436 YWL TOP N words

  18. Results and Analysis ——(6) • test result on high frequency words test

  19. Results and Analysis ——(7) • test result on middle frequency words test

  20. Results and Analysis ——(8) • test result on low frequency words test Unbelievable for low frequency words

  21. Where are we? • Introduction • The New Approximation Methods • Data Set • Experiment and Analysis • Conclusion and Future Work

  22. Conclusion and Future Work • Propose a trade-off method to do Chinese word frequency approximation based on raw corpus, MM-segmented corpus and manually segmented corpora • Experiments shows that it can exactly benefit the word frequencies approximation result ——in the condition that we have several manually-segmented corpus which is small size and unbalanced • Still not very satisfied in some cases • Future work: • More study to look into low frequency words • Evaluation in other NLP tasks

  23. Thank you! Welcome your questions and comments^_^

More Related