The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao
Word-clusters vs words • Reduced feature dimensionality. • More robust. • 18% increase in accuracy. • Challenge: Group similar words into word-clusters that preserve the information about document categories. --Information Bottleneck (IB) Method.
IB method is based on following idea: Given the empirical joint distribution of two variables, one variable is compressed so that the mutual information about the other variable is preserved as much as possible. • find clusters of the members of the set X, denoted here by , such that the mutual information I( ;Y) is maximized, under a constraint on the information extracted from X, I ( ;X).
The problem has optimal formal solution without any assumption about the origin of the joint distribution p(x,y).
Kullback-Leibler divergence between the conditional distributions p(y|x) and Z(β,x) is a normalization factor. Single positive β determines the softness of the classification.
Normalized information curves for all 10 iterations in large and small sample sizes