1 / 15

Personal Research in NLP --as a master of engineering student

Personal Research in NLP --as a master of engineering student. Li Jun Department of Computer Science and Technology, Tsinghua University. Outline . TextMatrix Project Sentiment Classification Experiment1 Experiment2 Machine Translation. Text Matrix ( C++, cross-platform).

zenia-morse
Download Presentation

Personal Research in NLP --as a master of engineering student

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Personal Research in NLP --as a master of engineering student Li Jun Department of Computer Science and Technology, Tsinghua University.

  2. Outline • TextMatrix Project • Sentiment Classification • Experiment1 • Experiment2 • Machine Translation

  3. Text Matrix ( C++, cross-platform) • http://code.google.com/p/textmatrix/ • Library and utilities for Text Mining. • Text preprocessing • Character, word based N-gram and suffix tree based features extraction • Dimension reduction • Evaluation • Text processing API for machine learning • Developing in leisure time.

  4. Sentiment Classification • Sentiment Classification using machine learning techniques • based on the overall sentiment of a text • Easily transfer to new domains with a training set. • Applications: • Split reviews into the sets of positive and negative • Monitor bloggers mood trend

  5. Corpus and Baseline Method • Corpus • Chinese reviews crawled from ctrip.com • Most in Chinese • Scored by customers • Training set: 12000 • Test set: 4000 • Label decided by rating • Baseline Method • Predict by comparsion of number of sentiment words. • micro-averaging F1 0.7931, macro-averaging F1 0.7573 Availabe on my website: http://nlp.csai.tsinghua.edu.cn/~lj/

  6. Experiment1 Workflow • Tried different parameters combination at all levels • More than 1400 experiments in published paper ( even more in practice) • Using TextMatrix Abbreviations: Word-based unigram: WBU Word-based bigram: WBB Character-based bigram: CBB Character-based trigram: CBT

  7. Performance – WBU(i.e., bag of words) SVM, NB, ME, ANN using WBU as features with different feature weights

  8. Performance (WBU,WBB,CBB,CBT) SVM, NB, ME, ANN using WBU, WBB,CBB,CBT as features with some specified feature weighting scheme which obtained best performance.

  9. Some Conclusions of experiment1 • On the average, NB outperforms all the other classifiers using WBB and CBT • N-gram based features relaxes conditional independent assumption of Naive Bayes Model • capture real integral semantic content • People like to use combination of words to express positive and negative sentiment.

  10. Experiment2: suffixtree-based features • Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations. Suffix tree for the string BANANA From http://en.wikipedia.org/wiki/Suffix_tree

  11. Key-SubString-Group Concept • The substrings in an equivalence group have exactly identical distribution over the corpus, therefore, a substring-group = a single feature • Input • a set of documents • the parameters • Output • the key-substring-groups for each document • Time Complexity: O(n) Zhang D, LeeWS. Extracting key-substring-group features for text classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD.06)

  12. Performance (part) Modified Program which support chinese is available in TextMatrix v1.1 Text Classification Package is available at http://nlp.csai.tsinghua.edu.cn/~lj/

  13. Compared with n-gram experiment1 • Conclusion: • Outperform n-gram features with lower number of features. • “string” features provide a better representation of documents for classification. Soon published in my master thesis

  14. Statistical Machine Translationhands-on experience • Giza++ ( implement of IBM 1-5 models) • Srilm ( Language modeling toolkit) • Moses (decoder,system, tuning by MERT) • Running on linux with scripts • Execution example: echo “long time no see” | moses –config moses.ini BEST TRANSLATION: 好久 不见

  15. Thank You! Q&A

More Related