1 / 27

Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi

Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi GSIS, Tohoku University GSIS, JAIST GSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-Jie Ko.

naasir
Download Presentation

Xuan-Hieu Phan Le-Minh Nguyen Susumu Horiguchi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections Xuan-HieuPhan Le-Minh Nguyen Susumu Horiguchi GSIS, Tohoku University GSIS, JAIST GSIS, Tohoku University WWW 2008 NLG Seminar 2008/12/31 Reporter:Kai-JieKo

  2. Many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness Motivation

  3. Employ search engines to expand and enrich the context of data Previous works to overcome data sparseness

  4. Employ search engines to expand and enrich the context of data • Time consuming! Previous works to overcome data sparseness

  5. To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources Previous works to overcome data sparseness

  6. To utilize online data repositories, such as Wikipedia or Open Directory Project, as external knowledge sources • Only used the user defined categories and concepts in those repositories, not general enough Previous works to overcome data sparseness

  7. General framework

  8. Must large and rich enough to cover words, concepts that are related • to the classification problem. • Wikipedia & MEDLINE are chosen in this paper. (a)Choose an universal data

  9. Use topic oriented keywords to crawl Wikipedia with maximum depth of hyperlink 4 • 240MB • 71,968 documents • 882,376 paragraphs • 60,649 vocabulary • 30,492,305 words (a)Choose an universal data

  10. Ohsumed : a test collection of medical journal abstracts to assist IR research • 156MB • 233,442 abstracts (a)Choose an universal data

  11. (b)Doing topic analysis for the universal dataset

  12. Using GibbsLDA++, a C/C++ implementation of LDA using Gibbs Sampling • The number of topics ranges from 10, 20 . . . to 100, 150, and 200 • The hyperparametersalpha and beta were set to 0.5 and 0.1, respectively (b)Doing topic analysis for the universal dataset

  13. Hidden topics analysis for Wikipedia data

  14. Hidden topics analysis for the Ohsumed-MEDLINE data

  15. Words/terms in this dataset should be relevant to as many hidden • topics as possible. (c)Building a moderate size labeled training dataset

  16. To transform the original data into a set of topics (d)Doing topic inference for training and future data

  17. Sample Google search snippets

  18. This show the sparseness of web snippets in that only small fraction of words are shared by the 2 or 3 different snippets Snippets word co-occurence

  19. After doing inference and integration, snippets are more related in semantic way Shared topics among snippets after inference

  20. Choose from different learning methods • Integrate hidden topics into the training, test, or future data • according to the data representation of the chosen learning • technique • Train the classifier on the integrated training data (e) Building the classifier

  21. Domain disambiguation for Web search results • To classify Google search snippets into different domains, such as Business, Computers, Health, etc. • Disease classification for medical abstracts • Classifies each MEDLINEmedical abstract into one of five disease categories that are related to neoplasms, digestive system, etc. Evaluation

  22. Obtain Google snippet as training and testing data, the search phrase of the two data are totally exclusive Domain disambiguation for Web search results

  23. The result of doing 5-fold cross validation on the training data • Reduce 19% of error on average Domain disambiguation for Web search results

  24. Domain disambiguation for Web search results

  25. Domain disambiguation for Web search results

  26. Disease Classification for Medical Abstracts with MEDLINE Topics The proposed method requires only 4500 training data to reach the accuracy of the baseline which uses 22500 training data!

  27. Advantages of proposed framework: • Agood method to classify sparse and previous unseen data • Utilizing the large universal dataset • Expanding the coverage of the classifier • Topics coming from external data cover a lot of terms/words that do not exist in training dataset • Easy to implement • Only have to prepare a small set of labeled training example to attain high accuracy Conclusion

More Related