Novel representations and methods in text classification
This presentation is the property of its rightful owner.
Sponsored Links
1 / 62

Novel representations and methods in text classification PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Novel representations and methods in text classification . Manuel Montes, Hugo Jair Escalante Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico. http://ccc.inaoep.mx/~mmontesg / http://ccc.inaoep.mx/~hugojair/ { mmontesg , hugojair } @ inaoep.mx

Download Presentation

Novel representations and methods in text classification

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Novel representations and methods in text classification

Novel representations and methods in text classification

Manuel Montes, Hugo Jair Escalante

Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico.http://ccc.inaoep.mx/~mmontesg/

http://ccc.inaoep.mx/~hugojair/

{mmontesg, [email protected]

7th Russian Summer School in Information RetrievalKazan, Russia, September 2013


Overview of this course

Novel represensations and methods in textclassification

Overview of thiscourse


Text classification

Text classification

  • Text classification is the assignment of free-text documents to one or more predefined categories based on their content

  • In this course we will review the basics of text classification (document preprocessing and representation and classifier construction) with emphasis on novel or extended document representations


Day 1 today introduction to text classification

Day 1 (today): Introduction to text classification

  • We will provide a general introduction to the task of automatic text classification:

  • Text classification: manual vs automatic approach

  • Machine learning approach to text classification

  • BOW: Standard representation for documents

  • Main learning algorithms for text classification

  • Evaluation of text classification


Day 2 concept based representations

Day 2: Concept-based representations

  • This session elaborates on concept-based representations of documents. That is document representations extending the standard representation with –latent– semantic information.

    • Distributional representations

    • Random indexing

    • Concise semantic analysis

    • Multimodal representations

    • Applications in short-text classification, author profiling and authorship attribution


Day 3 modeling sequential and syntactic information

Day 3: Modeling sequential and syntactic information

  • This session elaborates on different alternatives that extend the BOW approach by including sequential and syntactic information.

    • N-grams

    • Maximal frequent sequences

    • Pattern sequences

    • Locally weighted bag of words

    • Syntactic representations

    • Applications in authorship attribution


Day 4 non conventional classification methods

Day 4: Non-conventional classification methods

  • This session considers text classification techniques specially suited to work with low quality training sets.

    • Self-training

    • PU-learning

    • Consensus classification

    • Applications in short-text classification, cross-lingual text classification and opinion spam detection.


Day 5 automatic construction of classification models

Day 5: Automatic construction of classification models

  • This session presents methods for the automated construction of classification models in the context of text classification.

    • Introduction to full model selection

    • Related works

    • PSMS

    • Applications in authorship attribution


Some relevant material

Somerelevant material

ccc.inaoep.mx/~mmontesg/

  • Notes of thiscourse

    • teaching  Russir 2013

  • Notes of ourcourseontextmining

    • teaching  textmining

  • Notes of ourcourseoninformationretrieval

    • teaching  informationretrieval


  • Introduction to text classification

    Novel represensations and methods in textclassification

    Introductiontotextclassification


    Text classification1

    Text classification

    • Text classification is the assignment of free-text documents to one or more predefined categories based on their content

    Categories/classes

    (e.g., sports, religion, economy)

    Documents (e.g., news articles)


    Manual classification

    Manual classification

    • Very accuratewhen job is done by experts

      • Different to classify news in general categories than biomedical papers into subcategories.

    • But difficult and expensiveto scale

      • Different to classify thousands than millions

    • Used by Yahoo!, Looksmart, about.com, ODP, Medline, etc.


    Hand coded rule based systems

    Hand-coded rule based systems

    • Main approach in the 80s

    • Disadvantage  knowledge acquisition bottleneck

      • too time consuming, too difficult, inconsistency issues

    Experts

    Knowledgeengineers

    Labeleddocuments

    New document

    Rule 1, if … then … else

    Rule N, if … then …

    Document’s category

    Classifier


    Example filtering spam email

    Example: filtering spam email

    • Rule-basedclassifier

    Hastie et al. The Elements of Statistical Learning, 2007, Springer.


    Machine learning approach 1

    Machine learning approach (1)

    • A general inductive process builds a classifier by learning from a set of preclassified examples.

      • Determines the characteristics associated with each one of the topics.

    Ronen Feldman and James Sanger, The Text Mining Handbook


    Machine learning approach 2

    Machine learning approach (2)

    Experts

    Labeleddocuments(training set)

    Learning

    algorithm

    Newdocument

    Rules, trees, probabilities, prototypes, etc.

    Document’scategory

    Classifier

    How to represent document’s information?


    Representation of documents

    Representation of documents

    • First step: transform documents, which typically are strings of characters, into a representation suitable for the learning algorithm.

      • Generation of a document-attribute representation

    • The most common used document representation is the bag of words.

      • Documents are represent by the set of different words in all of the documents

      • Word order is not capture by this representation

      • There is no attempt for understanding their content


    Representation of documents1

    Representation of documents

    Vocabulary from the collection(set of different words)

    All documents

    (one vector per document)

    Weight indicating the contributionof word j in documenti.

    Each different word is a feature!

    How to compute their weights?


    Preprocessing

    Preprocessing

    • Eliminate information about style, such as html or xml tags.

      • For some applications this information may be useful. For instance, only index some document sections.

    • Remove stop words

      • Functional words such as articles, prepositions, conjunctions are not useful (do not have an own meaning).

    • Perform stemmingorlemmatization

      • The goal is to reduce inflectional forms, and sometimes derivationally related forms.

    • am, are, is → be car, cars, car‘s → car

    Have to do them always?


    Term weighting two main ideas

    Term weighting - two main ideas

    • The importance of a term increases proportionally to the number of times it appears in the document.

      • It helps to describe document’s content.

    • The general importance of a term decreases proportionally to its occurrences in the entire collection.

      • Common terms are not good to discriminate between different classes


    Term weighting main approaches

    Term weighting – main approaches

    • Binary weights:

      • wi,j = 1 iff document di contains term tj , otherwise 0.

    • Term frequency (tf):

      • wi,j = (no. of occurrences of tj in di)

    • tf x idf weighting scheme:

      • wi,j = tf(tj, di) × idf(tj), where:

        • tf(tj, di) indicatestheocurrences of tj in documentdi

        • idf(tj) = log [N/df(tj)], wheredf(tj) isthenumber of documetsthatcontainthetermtj.

    • Normalization?


    Extended document representations

    Extended document representations

    • BOW is simple and tend to produce good results, but it has important limitations

      • Does not capture word order neither semantic information

    • New representations attempt to handle these limitations. Some examples are:

      • Distributional term representations

      • Locally weighted bag of words

      • Bag of concepts

      • Concise semantic analysis

      • Latent semantic indexing

      • Topic modeling

    Thetopic of thiscourse


    Dimensionality reduction

    Dimensionality reduction

    • A central problem in text classification is the high dimensionality of the features space.

      • Exist one dimension for each unique word found in the collection  can reach hundreds of thousands

      • Processing is extremely costly in computational terms

      • Most of the words (features) are irrelevant to the categorization task

        How to select/extract relevant features?

        How to evaluate the relevance of the features?


    Two main approaches

    Two main approaches

    • Feature selection

      • Idea: removal of non-informative words according to corpus statistics

      • Output: subset of original features

    • Main techniques: document frequency, mutual information and information gain

  • Re-parameterization

    • Idea: combine lower level features (words) into higher-level orthogonal dimensions

    • Output: a new set of features (not words)

    • Main techniques: word clustering and Latent semantic indexing (LSI)

  • Out of thescope of thiscourse


    Learning the classification model

    Learning the classification model

    Categories

    C1

    C2

    C3

    C4

    X2

    X1

    Howtolearnthis?


    What is a classifier

    m1

    m3

    What is a classifier?

    • A function:

    • or:

    • Given:

    Terms (d = |V|)

    Documents (M)


    What is a classifier1

    m1

    m3

    What is a classifier?

    • A function:

    • or:

    • Given:

    Terms (d = |V|)

    Documents (M)


    Classification algorithms

    Classification algorithms

    • Popular classification algorithms for TC are:

      • K-Nearest Neighbors

        • Example-based approach

      • Centroid-based classification

        • Prototype-based approach

      • Support Vector Machines

        • Kernel-based approach

      • Naïve Bayes

        • Probabilistic approach


    Knn k nearest neighbors classifier

    KNN: K-nearest neighbors classifier

    Positive examples

    Negative examples

    1-nearest neighborclassifier


    Knn k nearest neighbors classifier1

    KNN: K-nearest neighbors classifier

    Positive examples

    Negative examples

    3-nearest neighborsclassifier


    Knn k nearest neighbors classifier2

    KNN: K-nearest neighbors classifier

    Positive examples

    Negative examples

    5-nearest neighborsclassifier


    Knn the algorithm

    KNN – the algorithm

    • Given a new document d:

      • Find the k most similar documents from the training set.

        • Common similarity measures are the cosine similarity and the Dice coefficient.

      • Assign the class to d by considering the classes of its k nearest neighbors

        • Majority voting scheme

        • Weighted-sum voting scheme


    Common similarity measures

    Common similarity measures

    • Dice coefficient

    • Cosine measure

      wki indicates the weight of word k in document i


    Selection of k

    Selection of K

    How to select a good value for K?


    Decision surface of knn

    Decision surface of KNN

    http://clopinet.com/CLOP

    K=1


    Decision surface of knn1

    Decision surface of KNN

    http://clopinet.com/CLOP

    K=2


    Decision surface of knn2

    Decision surface of KNN

    http://clopinet.com/CLOP

    K=5


    Decision surface of knn3

    Decision surface of KNN

    http://clopinet.com/CLOP

    K=10


    The weighted sum voting scheme

    The weighted-sum voting scheme

    Other alternatives for computing the weights?


    Knn comments

    KNN - comments

    • One of the best-performing text classifiers.

    • It is robust in the sense of not requiring the categories to be linearly separated.

    • The major drawback is the computational effort during classification.

    • Other limitation is that its performance is primarily determined by the choice of k as well as the distance metric applied.


    Linear models

    Linear models

    • Classification of DNA micro-arrays

    ?

    x2

    Cancer

    ?

    No Cancer

    x1


    Support vector machines svm

    Support vector machines (SVM)

    • A binary SVM classifier can be seen as a hyperplane in the feature space separating the points that represent the positive from negative instances.

      • SVMs selects the hyperplanethat maximizes the marginaround it.

      • Hyperplanes are fullydetermined by a small subsetof the training instances, calledthe support vectors.

    Support vectors

    Maximize

    margin


    Support vector machines svm1

    Support vector machines (SVM)

    Subject to:

    When data are linearly separable we have:


    Non linear svms on the inputs

    x

    0

    x2

    Non-linear SVMs (on the inputs)

    • What about classes whose training instances are not linearly separable?

      • The original input space can always be mapped to some higher-dimensional feature space where the training set is separable.

        • A kernel function is some function that corresponds to an inner product in some expanded feature space.


    Decision surface of svms

    Decision surface of SVMs

    http://clopinet.com/CLOP

    Linear support vector machine


    Decision surface of svms1

    Decision surface of SVMs

    http://clopinet.com/CLOP

    Non-linear support vector machine


    Svm discussion

    SVM – discussion

    • The support vector machine (SVM) algorithm is very fast and effective for text classification problems.

      • Flexibility in choosing a similarity function

        • By means of a kernel function

      • Sparseness of solution when dealing with large data sets

        • Only support vectors are used to specify the separating hyperplane

      • Ability to handle large feature spaces

        • Complexity does not depend on the dimensionality of the feature space


    Na ve bayes

    Sec.13.2

    Naïve Bayes

    • It is the simplest probabilistic classifier used to classify documents

      • Based on the application of the Bayes theorem

    • Builds a generative model that approximates how data is produced

      • Uses prior probability of each category given no information about an item

      • Categorization produces a posterior probability distribution over the possible categories given a description of an item.

    A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes. MultinomialNaiveBayesforTextCategorizationRevisited. AustralianConferenceon Artificial Intelligence 2004: 488-499


    Na ve bayes1

    Naïve Bayes

    • Bayes theorem:

    • Why?

      • We know that:

      • Then

      • Then


    Na ve bayes2

    Sec.13.2

    Naïve Bayes

    • For a document d and a class cj

    C

    . . . .

    t1

    t2

    t|V|

    • Assuming terms are independent of each other given the class (naïve assumption)

    • Assuming each document is equally probable


    Bayes rule for text classification

    Sec.13.2

    Bayes’ Rule for text classification

    • For a document d and a class cj


    Bayes rule for text classification1

    Sec.13.2

    Bayes’ Rule for text classification

    • For a document d and a class cj

    • Estimation of probabilities

    Smoothing to avoid zero-values

    Prior probability of class cj

    Probability of occurrence of word ti in class cj


    Na ve bayes classifier

    Naïve Bayes classifier

    • Assignment of the class:

    • Assignment using underflow prevention:

      • Multiplying lots of probabilities can result in floating-point underflow

      • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities


    Comments on nb classifier

    Comments on NB classifier

    • Very simple classifier which works very well on numerical and textual data

    • Very easy to implement and computationally cheap when compared to other classification algorithms.

    • One of its major limitations is that it performs very poorly when features are highly correlated.

    • Concerning text classification, it fails to consider the frequency of word occurrences in the feature vector.


    Evaluation of text classification

    Evaluation of text classification

    • What to evaluate?

    • How to carry out this evaluation?

      • Which elements (information) are required?

    • How to know which is the best classifer for a given task?

      • Which things are important to perform a fair comparison?


    Evaluation general ideas

    Evaluation – general ideas

    • Performance of classifiers is evaluated experimentally

    • Requires a document set labeled with categories.

      • Divided into two parts: training and test sets

      • Usually, the test set is the smaller of the two

    • A method to smooth out the variations in the corpus is the n-fold cross-validation.

      • The whole document collection is divided into n equal parts, and then the training-and-testing process is run n times, each time using a different part of the collection as the test set. Then the results for n folds are averaged.


    Evaluation of text classification1

    m1

    m3

    Evaluation of text classification

    Terms (N = |V|)

    Documents (M)

    • The available data is divided into two subsets:

      • Training (m1)

        • used for the construction (learning) the classifier

      • Test (m2)

        • Used for the evaluation of the classifier


    Performance metrics

    Performance metrics

    • Considering a binary problem

    • Recall for a category is defined as the percentage of correctly classified documents among all documents belonging to that category, and precision is the percentage of correctly classified documents among all documents that were assigned to the category by the classifier.

      What happen if there are more than two classes?

    Label YES

    Label NO

    a

    b

    Classifier YES

    c

    d

    Classifier NO


    Micro and macro averages

    Micro and macro averages

    • Macroaveraging: Compute performance for each category, then average.

      • Gives equal weights to all categories

    • Microaveraging: Compute totals of a, b, c and d for all categories, and then compute performance measures.

      • Gives equal weights to all documents

        Is it important the selection of the averaging strategy?

        What happen if we are very bad classifying the minority class?


    References

    References

    • G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003

    • H. Liu, H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, CRC, 2008.

    • Y. Yang, J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Proc. of the 14th International Conference on Machine Learning, pp. 412—420, 1997.

    • D. Mladenic, M. Grobelnik. Feature Selection for Unbalanced Class Distribution and Naïve Bayes. Proc. of the 16th Conference on Machine Learning, pp. 258—267, 1999.

    • I. Guyon, et al. FeatureExtractionFoundations and Applications,Springer, 2006.

    • Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, Vol. 3:1157—1182, 2003.

    • M. Lan, C. Tan, H. Low, S. Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. Proc. of WWW, pp. 1032—1033, 2005.


    Novel representations and methods in text classification

    Questions?


  • Login