Personal research in nlp as a master of engineering student
Download
1 / 15

Personal Research in NLP --as a master of engineering student - PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on

Personal Research in NLP --as a master of engineering student. Li Jun Department of Computer Science and Technology, Tsinghua University. Outline . TextMatrix Project Sentiment Classification Experiment1 Experiment2 Machine Translation. Text Matrix ( C++, cross-platform).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Personal Research in NLP --as a master of engineering student' - zenia-morse


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Personal research in nlp as a master of engineering student

Personal Research in NLP --as a master of engineering student

Li Jun

Department of Computer Science and Technology,

Tsinghua University.


Outline
Outline

  • TextMatrix Project

  • Sentiment Classification

    • Experiment1

    • Experiment2

  • Machine Translation


Text matrix c cross platform
Text Matrix ( C++, cross-platform)

  • http://code.google.com/p/textmatrix/

  • Library and utilities for Text Mining.

  • Text preprocessing

  • Character, word based N-gram and suffix tree based features extraction

  • Dimension reduction

  • Evaluation

  • Text processing API for machine learning

  • Developing in leisure time.


Sentiment classification
Sentiment Classification

  • Sentiment Classification using machine learning techniques

    • based on the overall sentiment of a text

    • Easily transfer to new domains with a training set.

    • Applications:

      • Split reviews into the sets of positive and negative

      • Monitor bloggers mood trend


Corpus and baseline method
Corpus and Baseline Method

  • Corpus

    • Chinese reviews crawled

      from ctrip.com

    • Most in Chinese

    • Scored by customers

    • Training set: 12000

    • Test set: 4000

    • Label decided by rating

  • Baseline Method

    • Predict by comparsion of number of sentiment words.

    • micro-averaging F1 0.7931, macro-averaging F1 0.7573

Availabe on my website: http://nlp.csai.tsinghua.edu.cn/~lj/


Experiment1 workflow
Experiment1 Workflow

  • Tried different parameters combination at all levels

  • More than 1400 experiments in published paper ( even more in practice)

  • Using TextMatrix

Abbreviations: Word-based unigram: WBU Word-based bigram: WBB

Character-based bigram: CBB Character-based trigram: CBT


Performance wbu i e bag of words
Performance – WBU(i.e., bag of words)

SVM, NB, ME, ANN using WBU as features with different feature weights


Performance wbu wbb cbb cbt
Performance (WBU,WBB,CBB,CBT)

SVM, NB, ME, ANN using WBU, WBB,CBB,CBT as features with some specified feature weighting scheme which obtained best performance.


Some conclusions of experiment1
Some Conclusions of experiment1

  • On the average, NB outperforms all the other classifiers using WBB and CBT

    • N-gram based features relaxes conditional independent assumption of Naive Bayes Model

    • capture real integral semantic content

  • People like to use combination of words to express positive and negative sentiment.


Experiment2 suffixtree based features
Experiment2: suffixtree-based features

  • Suffix tree is a data structure that presents the suffixes of a given string in a way that allows for a particularly fast implementation of many important string operations.

Suffix tree for the string BANANA

From http://en.wikipedia.org/wiki/Suffix_tree


Key substring group concept
Key-SubString-Group Concept

  • The substrings in an equivalence group have exactly identical distribution over the corpus, therefore,

    a substring-group = a single feature

  • Input

    • a set of documents

    • the parameters

  • Output

    • the key-substring-groups for each document

  • Time Complexity: O(n)

Zhang D, LeeWS. Extracting key-substring-group features for text classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge

discovery and data mining (KDD.06)


Performance part
Performance (part)

Modified Program which support chinese is available in TextMatrix v1.1

Text Classification Package is available at http://nlp.csai.tsinghua.edu.cn/~lj/


Compared with n gram experiment1
Compared with n-gram experiment1

  • Conclusion:

  • Outperform n-gram features with lower number of features.

  • “string” features provide a better representation of documents for classification.

Soon published in my master thesis


Statistical machine translation hands on experience
Statistical Machine Translationhands-on experience

  • Giza++ ( implement of IBM 1-5 models)

  • Srilm ( Language modeling toolkit)

  • Moses (decoder,system, tuning by MERT)

  • Running on linux with scripts

  • Execution example:

    echo “long time no see” | moses –config moses.ini

    BEST TRANSLATION: 好久 不见



ad