1 / 14

Text mining

Text mining. Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu. Introduction. Generic model Document preprocessing Text mining methods. Text Mining Tasks. Classification (supervised learning)

laasya
Download Presentation

Text mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu

  2. Introduction • Generic model • Document preprocessing • Text mining methods

  3. Text Mining Tasks • Classification (supervised learning) • Binary classification • Single label (multi-class) classification • Multi-label classification • Multi-level (hierarchical) classification • Clustering (unsupervised learning) • Summarization • Extraction: only parts of the original text • Abstraction: introduces text that is not included in the original text

  4. Solutions • Classification • Decision tree • Neural network • Bayes network • Clustering • k-means

  5. Document preprocessing • Goal: represent any text briefly, in a fixed number of parameters • Representation: vector space model

  6. Vector space model • The text is tokenized to words • The words are canonized to base wordswe refer to base words as terms • A dictionary is built, that is the set of the terms in the document • The document is represented as a vector:the ith element of the vector is the number the ith term of the dictionary occurs in the document • The collection of documents is represened in the term-document matrix • Problem: the number of dimensions is too largeSolution: feature selection

  7. Dimension Reduction • Feature Selection: find a subset of original variables • Document Frequency Thresholding • Omit the words with occurences greater than a threshold value, because these words are not discriminative • Omit the words with occurences less then a threshold value, because these words do not carry much information • Information gain based feature selection (information theory) • Chi-square based feature selection (statistics) • Feature Extraction: transform the data to fewer dims • Latent Semantic Indexing (LSI) • Principal Component Analysis (PCA) • Nonlinear methods

  8. Latent Semantic Indexing (LSI) • SVD is applied to the term-document matrix • The features belonging to the k largest eigenvalues represent the term-document matrix well, these features are used • LSI regards documents with many common words as being semantically near

  9. Principal Component Analysis (PCA) • Also called Karhunen-Loève transform (KLT) • A linear technique • Maps the data to a lower dimensional space in a way that the variance in the low-dimensional representation is maximized • The algorithm • The correlation matrix of the data is constructed • The eigenvectors and eigenvalues of the correlation matrix are calculated • The original space is reduced to the space spanned by the eigenvectors that belong to the largest eigenvalues

  10. Kernel PCA • A nonlinear method • PCA + kernel trick • Kernel trick (generally) • we map observations from a general set S into a higher dimensional space V • we hope that the general classification in S reduces to the linear classification in V • the trick lets us avoid the calculation of mapping the observations from S to V • We use a learning algorithm that needs only the dot product operation in V • We use a mapping that allows to calculate the dot product within V by a kernel function K within S (the original space)

  11. Manifold learning techniques • they minimize a cost function that retains local properties of the data • methods • Locally Linear Embedding (LLE) • Hessian LLE • Laplacian Eigenmaps • Local tangent space alignment (LTSA) • Maximum Variance Unfolding (MVU)

  12. Locally Linear Embedding (LLE)

  13. Locally Linear Embedding (LLE)

  14. Maximum Variance Unfolding (MVU) • instead of defining a fixed kernel, it tries to learn the kernel using semidefinite programming • exactly preserves all pairwise distances between nearest neighbors • maximizes the distances between points that are not nearest neighbors

More Related