Michigan State University
This presentation is the property of its rightful owner.
Sponsored Links
1 / 1

CIKM 200 8 , Napa Valley, California October 26-30, 2008 PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

Michigan State University. The Chinese University of Hong Kong. Semi-supervised Text Categorization by Active Search. Zenglin Xu 1 , Rong Jin 2 , Kaizhu Huang 1 , Michael R. Lyu 1 , and Irwin King 1. 2 Department of Computer Science and Engineering Michigan State University

Download Presentation

CIKM 200 8 , Napa Valley, California October 26-30, 2008

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cikm 200 8 napa valley california october 26 30 2008

Michigan State University

The Chinese University of Hong Kong

Semi-supervised Text Categorization by Active Search

Zenglin Xu1, Rong Jin2, Kaizhu Huang1, Michael R. Lyu1, and Irwin King1

2 Department of Computer Science and Engineering

Michigan State University

[email protected]

1 Department of Computer Science and Engineering

The Chinese University of Hong Kong

{zlxu, kzhuang, lyu, [email protected]

1

Motivations

2

Contributions

  • A general framework for semi-supervised text categorization that collects the unlabeled documents via Websearch engines.

  • A novel discriminative query generation method

  • The categorization framework can significantly improve the classification accuracy.

  • Given a small number of labeled documents, it is very challenging to build a reliable classifier

  • .Unlabeled data are helpful in automated text categorization

How to obtain unlabeled documents?

  • We can collect the unlabeled documents through search engines

  • Semi-supervised learning can take advantage of both the labeled documents and unlabeled documents

3

Framework & Model

  • Query generation: generate a query for every labeled document (document: (x,y), Vi: vocabulary for i-th document, w: word weights, ξ: margin error)

  • 2.Text Categorization Models

  • D: labeled documents, U: retrieved unlabeled documents

  • Auxiliary SVM (y* is the input)

  • Semi-supervised SVM (y* is an optimization variable)

  • Query generation that generates the textual queries for document retrieval

  • Document retrieval that retrieves the Web documents through the Web search engine

  • Semi-supervised text categorization utilizing both the labeled documents and the retrieved unlabeled Web documents

4

Experiment results

  • Data Repositories: 20-newsgroup, Reuters-21578, Ohsumed

  • Training data: 5 labeled documents in each category

  • Each documents generates one query

  • Each query returns 100 unlabeled documents

  • Auxi-SVM: Auxiliary SVM (Optimization : QP)

  • Semi-SVM: Semi-supervised SVM (Optimization: CCCP)

  • Search engine: Google

  • Accuracy improvement over SVM:

  • Auxi-SVM: 26%

  • Semi-SVM: 34%

CIKM 2008, Napa Valley, California October 26-30, 2008


  • Login