Term filtering with bounded error
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Term Filtering with Bounded Error PowerPoint PPT Presentation


  • 32 Views
  • Uploaded on
  • Presentation posted in: General

Term Filtering with Bounded Error. Zi Yang , Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China { yangzi , tangjie , ljz }@ keg.cs.tsinghua.edu.cn [email protected]

Download Presentation

Term Filtering with Bounded Error

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Term filtering with bounded error

Term Filtering with Bounded Error

Zi Yang, Wei Li, Jie Tang, and Juanzi Li

Knowledge Engineering Group

Department of Computer Science and Technology

Tsinghua University, China

{yangzi, tangjie, [email protected]

[email protected]

Presented in ICDM’2010, Sydney, Australia


Outline

Outline

  • Introduction

  • Problem Definition

  • Lossless Term Filtering

  • Lossy Term Filtering

  • Experiments

  • Conclusion


Introduction

Introduction

  • Term-document correlation matrix

    • : a co-occurrence matrix,

    • Applied in various text mining tasks

      • text classification [Dasgupta et al., 2007]

      • text clustering [Bleiet al., 2003]

      • information retrieval [Wei and Croft, 2006].

    • Disadvantage

      • high computation-cost and intensive memory access.

the number of times that term occurs in document

How can we overcome the disadvantage?


Introduction cont d

Introduction (cont'd)

  • Three general methods:

    • To improve the algorithm itself

    • High performance platforms

      • multicore processors and distributed machines [Chu et al., 2006]

    • Feature selection approaches

      • [Yi, 2003]

  • We follow this line


Introduction cont d1

Introduction (cont'd)

  • Basic assumption

    • Each term captures more or less information.

  • Our goal

    • A subspace of features (terms)

    • Minimal information loss.

  • Conventional solution:

    • Step 1: Measure each individual term or the dependency to a group of terms.

      • Information gain or mutual information.

    • Step 2: Remove features with low scores in importance or high scores in redundancy.

  • Limitations

    • A generic definition of information loss for a single term and a set of terms in terms of any metric?

    • The information loss of each individual document, and a set of documents?

  • Why should we consider the information loss on documents?


Introduction cont d2

Introduction (cont'd)

  • Consider a simple example:

  • Question: any loss of information if A and B are substituted by a single term S?

    • Consider only the information loss for terms

      • YES!Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.

    • Consider both

      • NO!Because both documents emphasize A more than B. It’s information loss of documents!

A A B

A A B

Doc 1

Doc 2

  • What we have done?

    • - Information loss for both terms and documents

    • - Term Filtering with Bounded Error problem

    • - Develop efficient algorithms


Outline1

Outline

  • Introduction

  • Problem Definition

  • Lossless Term Filtering

  • Lossy Term Filtering

  • Experiments

  • Conclusion


Problem definition superterm

Problem Definition - Superterm

  • Want

    • less information loss

    • smaller term space

  • Superterm:.

the number of occurrences of superterm in document

  • Group terms into superterms!


Problem definition information loss

Problem Definition – Information loss

A transformed representation for document after merging

  • Information loss during term merging?

    • user-specified distance measures

      • Euclidean metric and cosine similarity.

    • superterm substitutes a set of terms

      • information loss of terms .

      • information loss of documents .

Can be chosen with different methods: winner-take-all, average-occurrence, etc.

A A B

A A B

Doc 1

Doc 2

Doc 1

Doc 1

Doc 2

Doc 2

S

A

S

B


Problem definition tfbe

Problem Definition – TFBE

  • Term Filtering with Bounded Error (TFBE)

    • To minimize the size of supertermssubject to the following constraints:

      (1) ,

      (2) for all, ,

      (3) for all , .

Mapping function from terms to superterms

Bounded by user-specified errors


Outline2

Outline

  • Introduction

  • Problem Definition

  • Lossless Term Filtering

  • Lossy Term Filtering

  • Experiments

  • Conclusion


Lossless term filtering

Lossless Term Filtering

  • Special case

    • NO information loss of any term and document

  • Theorem: The exactoptimal solution of the problem is yielded by grouping terms of the samevector representation.

  • Algorithm

    • Step 1: find “local” superterms for each document

      • Same occurrences within the document

    • Step 2: add, split, or remove superterms from global superterm set

  • = 0 and = 0

  • This case is applicable?

    • YES!On Baidu Baike dataset (containing 1,531,215 documents)

    • The vocabulary size 1,522,576 -> 714,392

    • > 50%


Outline3

Outline

  • Introduction

  • Problem Definition

  • Lossless Term Filtering

  • Lossy Term Filtering

  • Experiments

  • Conclusion


Lossy term filtering

#iterations << |V|

Lossy Term Filtering

  • General case

  • A greedy algorithm

    • Step 1: Consider the constraint given by .

      • Similar to Greedy Cover Algorithm

    • Step 2: Verify the validity of the candidates with .

  • Computational Complexity

    • Parallelize the computation for distances 

  • > 0 and > 0

  • NP-hard!

  • Further reduce the size of candidate superterms?

    • Locality-sensitive hashing (LSH) 


Term filtering with bounded error

a projection vector (drawn from the Gaussian distribution)

Efficient Candidate Superterm Generation

the width of the buckets

(assigned empirically)

  • Basic idea:

  • Hash function for the Euclidean distance

    [Datar et al., 2004].

  • LSH

    (Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)

Same hash value

(with several randomized hash projections)

Now search -near neighbors in all the buckets!

Close to each other

(in the original space)


Outline4

Outline

  • Introduction

  • Problem Definition

  • Lossless Term Filtering

  • Lossy Term Filtering

  • Experiments

  • Conclusion


Experiments

Experiments

  • Settings

    • Datasets

      • Academic: ArnetMiner(10,768 papers and 8,212 terms)

      • 20-Newsgroups (18,774 postings and 61,188 terms)

    • Baselines

      • Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA)

      • Supervised method (only on classification): Chi-statistic (CHIMAX)


Experiments cont d

Experiments (cont'd)

  • Filtering results on Euclidean metric

  • Findings

    • Errors ↑, term radio ↓

    • Same bound for terms, bound for documents↑, term ratio ↓

    • Same bound for documents, bound for terms ↑, term ratio ↓


Experiments cont d1

Experiments (cont'd)

  • The information loss in terms of error

Avg. errors

Max errors


Experiments cont d2

Experiments (cont'd)

  • Efficiency Performance of Parallel Algorithm and LSH*

    * Optimal resulting from different choice of tuned


Experiments cont d3

Experiments (cont'd)

  • Applications

    • Clustering (Arnet)

  • Classification (20NG)

  • Document retrieval (Arnet)


Outline5

Outline

  • Introduction

  • Problem Definition

  • Lossless Term Filtering

  • Lossy Term Filtering

  • Experiments

  • Conclusion


Conclusion

Conclusion

  • Formally define the problem and perform a theoretical investigation

  • Develop efficient algorithms for the problems

  • Validate the approach through an extensive set of experiments with multiple real-world text mining applications


Thank you

Thank you!


References

References

  • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022.

  • Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,.

  • Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,.

  • Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,.

  • Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,.

  • Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.


  • Login