1 / 21

Sparse Representation

Sparse Representation. Shih-Hsiang Lin ( 林士翔 ). 1. T. N. Sainath , et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 2010

shima
Download Presentation

Sparse Representation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sparse Representation Shih-Hsiang Lin(林士翔) 1. T. N. Sainath, et al., “Bayesian Compressive Sensing for Phonetic Classification,” ICASSP 2010 2. T. N. Sainath, et al., “Sparse Representation Phone Identification Features for Speech Recognition,” IBM T.J. Watson Research Center, Tech. Rep., 2010. 3. T. N. Sainath, et al., “Sparse Representation Features for Speech Recognition,” INTERSPEECH 2010 4. T. N. Sainath, et al., “Sparse Representations for Text Categorization,” INTERSPEECH 2010 5. V. Goel, et al., “Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families” INTERSPEECH 2010 6. D. Kanevsky, et al., “An Analysis of Sparseness and Regularization in Exemplar-Based Methods for Speech Classification,” INTERSPEECH 2010 7. A. Sethy, et al., “Data Selection for Language Modeling Using Sparse Representations,” INTERSPEECH 2010

  2. Outline • Introduction • Related Work • Supervised Summarizers • Unsupervised Summarizers • Risk Minimization Framework • Experiments • Convulsion and Future Work

  3. Introduction • Sparse representation (SR) has become a popular technique for efficient representation and compression of signals is constructed consisting of possible examples of the signal is a weight vector whose elements reflect the importance of the corresponding training samples 1 0 = 1 A sparseness condition is enforced on β, such that it selects a small number of examples from H to describe y

  4. Introduction (Cont.) • One benefit of SR is that for a given test example, SR adaptively selects the relevant support vectors from the training set H • It has also shown success in face recognition over linear SVM and 1-NN methods • SVMs select a sparse subset of relevant training examples (support vectors) • use these supports to characterize “all” examples in the test set • kNNs characterize a test point by selecting a small number of k points from the training set which are closest to the test vector • voting on the class that has the highest occurrence from these k samples

  5. Introduction (Cont.) • We can think that SR techniques is a kind of exemplar-based methods • Issues • Why sparse representations? • What type of regularization? • Construction of H? • Choice of Dictionary? • Choice of sampling?

  6. Why sparse representations C1 C1 C1 C2 C2 C2 C2

  7. What type of regularization • l1 norm • This constraint can be modeled as a Gaussian prior • LASSO, Bayesian Compressive Sensing (BCS) • Impose a combination of an l1 and l2 constraint • Elastic Net • Cyclic Subgradient Projections (CSP) • Impose a semi-Gaussian constraint • Approximate Bayesian Compressive Sensing (ABCS)

  8. What type of regularization (Cont.) • As the size of H increases up to 1, 000 the error rates of the RR and SR both decrease • showing the benefit of including multiple training examples when making a classification decision • There is no different in error between the RR and SR techniques • suggesting that regularization does not provide any extra benefit

  9. What type of regularization (Cont.) • The plot shows that the coefficients for the RR method are the least sparse • The LASSO technique has the sparsest values A randomly selected classification frame y in TIMIT and an H ofsize 200

  10. What type of regularization (Cont.) • The decrease in accuracy when a high degree of sparseness is enforced. • Thus, it appears that using a combination of a sparsity constraint on β does not force unnecessary sparseness and offers the best performance

  11. Construction of H • The traditional CS implementation represents as a linear combination of samples in • Many pattern recognition algorithms have shown better performance can be achieved by a nonlinear mapping of the feature set to a higher dimensional space

  12. Construction of H (Cont.) • Performance for different

  13. SR formulation for classification • The matrix is constructed consisting of possible examples of the signal • Each could represent features from different classes in the training set • Given a test feature vector , the goal of SRs is to solve the following equation for • Each element of in some sense characterize how well the corresponding represents feature vector • We can make a classification decision for , by choosing the class from that has the maximum size of elements

  14. SR formulation for classification (Cont.) • Ideally, all non-zero entries of should correspond to the entries in with the same class as • However, due to noise and modeling errors, might have a non-zero value for more than one class • We can compute the l2 norm for all entries within a specific class, and choose the class with the largest l2 norm support • Let as a vector whose entries are non-zero except for entries in corresponding to class i

  15. Phone Identification Features • Given a test feature vector , we first find a sparse as a solution of . We then compute as • We then can use them as input feature for recognition

  16. Phone Identification Features (cont.)

  17. Choice of Dictionary H • Success on the sparse representation features depends heavily on a good choice of H • Pooling together all training data from all classes into H will make the columns of H large (typically millions of frames) • this will make solving for intractable • Therefore, we should have a good strategy to select H from a large sample set • Seeding H from Nearest Neighbors • For each y, we find a neighborhood of closest points to y in the training set • These k neighbors become the entries of H • k is chosen to be in the large to ensure that is sparse and all training examples are not chosen from the same class

  18. Choice of Dictionary H (Cont.) • This approach is computationally feasible on small vocabulary tasks, but not for large vocabulary tasks • Using a Trigram Language Model • Ideally only a small subset of Gaussians are typically evaluated at a given frame • The training data belonging to this small subset can be used to seed H • For each test frame y, we decode the data using a trigram language model and find the best aligned Gaussian at each frame • For each Gaussian, we compute the 4 other closest Gaussians to this Gaussian • We seed H with the training data aligning to these top 5 Gaussians • Using a Unigram / no Language Model • increase variability between the Gaussians used to seed H

  19. Choice of Dictionary H (Cont.) • Enforcing Unique Phonemes • all of these Gaussians might come from the same phoneme by using above approaches • finding the 5 closest Gaussians relative to the best aligned such that the phoneme identities of these Gaussians are unique (i.e. “AA”, “AE”, “AW”, etc.) • Using Gaussian Means • The above approaches of seeding H use actual examples from the training set, which is computationally expensive • We can seed H from Gaussian means • At each frame we use a trigram LM to find the best aligned Gaussian • Then we find the 499 closest Gaussians to this top Gaussian, and use the means from these 500 Gaussians to seed H

  20. Choice of Dictionary H (Cont.)

  21. SR for Text Categorization 18,000 news documents /20 classes TF features • Using the Maximum Support as a metric is too hard of a decision, as from other classes is often non-zero • Making a softer decision by using the l2 norm of offers higher accuracy • Using the residual error offers the lowest accuracy • When will reduced to which is a very small number and might not offer good distinguishability from class residuals in which is high

More Related