Personal Research in NLP --as a master of engineering student

Personal Research in NLP --as a master of engineering student Li Jun Department of Computer Science and Technology, Tsinghua University.

Outline • TextMatrix Project • Sentiment Classification • Experiment1 • Experiment2 • Machine Translation

Text Matrix ( C++, cross-platform) • http://code.google.com/p/textmatrix/ • Library and utilities for Text Mining. • Text preprocessing • Character, word based N-gram and suffix tree based features extraction • Dimension reduction • Evaluation • Text processing API for machine learning • Developing in leisure time.

Sentiment Classification • Sentiment Classification using machine learning techniques • based on the overall sentiment of a text • Easily transfer to new domains with a training set. • Applications: • Split reviews into the sets of positive and negative • Monitor bloggers mood trend

Corpus and Baseline Method • Corpus • Chinese reviews crawled from ctrip.com • Most in Chinese • Scored by customers • Training set: 12000 • Test set: 4000 • Label decided by rating • Baseline Method • Predict by comparsion of number of sentiment words. • micro-averaging F1 0.7931, macro-averaging F1 0.7573 Availabe on my website: http://nlp.csai.tsinghua.edu.cn/~lj/

Experiment1 Workflow • Tried different parameters combination at all levels • More than 1400 experiments in published paper ( even more in practice) • Using TextMatrix Abbreviations: Word-based unigram: WBU Word-based bigram: WBB Character-based bigram: CBB Character-based trigram: CBT

Performance – WBU(i.e., bag of words) SVM, NB, ME, ANN using WBU as features with different feature weights

Performance (WBU,WBB,CBB,CBT) SVM, NB, ME, ANN using WBU, WBB,CBB,CBT as features with some specified feature weighting scheme which obtained best performance.

Some Conclusions of experiment1 • On the average, NB outperforms all the other classifiers using WBB and CBT • N-gram based features relaxes conditional independent assumption of Naive Bayes Model • capture real integral semantic content • People like to use combination of words to express positive and negative sentiment.

Experiment2: suffixtree-based features • Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations. Suffix tree for the string BANANA From http://en.wikipedia.org/wiki/Suffix_tree

Key-SubString-Group Concept • The substrings in an equivalence group have exactly identical distribution over the corpus, therefore, a substring-group = a single feature • Input • a set of documents • the parameters • Output • the key-substring-groups for each document • Time Complexity: O(n) Zhang D, LeeWS. Extracting key-substring-group features for text classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD.06)

Performance (part) Modified Program which support chinese is available in TextMatrix v1.1 Text Classification Package is available at http://nlp.csai.tsinghua.edu.cn/~lj/

Compared with n-gram experiment1 • Conclusion: • Outperform n-gram features with lower number of features. • “string” features provide a better representation of documents for classification. Soon published in my master thesis

Statistical Machine Translationhands-on experience • Giza++ ( implement of IBM 1-5 models) • Srilm ( Language modeling toolkit) • Moses (decoder,system, tuning by MERT) • Running on linux with scripts • Execution example: echo “long time no see” | moses –config moses.ini BEST TRANSLATION: 好久不见

Thank You! Q&A

Personal Research in NLP --as a master of engineering student