Text classification day 35
This presentation is the property of its rightful owner.
Sponsored Links
1 / 13

Text classification Day 35 PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Text classification Day 35. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. http://www.tulane.edu/~ling/NLP/. Learning to classify text. NLPP §6. Classification. What is it? Supervision

Download Presentation

Text classification Day 35

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Text classificationDay 35

LING 681.02

Computational Linguistics

Harry Howard

Tulane University


Course organization

  • http://www.tulane.edu/~ling/NLP/

LING 681.02, Prof. Howard, Tulane University


Learning to classify text

NLPP §6


Classification

  • What is it?

  • Supervision

    • A classifier is supervised if it is built on training corpora containing the correct label for each input.

      • This usually means that the program can calculate an error when the predicted label does not match the correct label.

    • A classifier is unsupervised if it is built on training corpora that does not contain the correct label for each input.

      • There is no way to calculate an error.

LING 681.02, Prof. Howard, Tulane University


Diagram of supervised classification

LING 681.02, Prof. Howard, Tulane University


Philosophical question

  • Does supervised classification work for the majority of stuff that you learned spontaneously as a child?

  • NO, life does not come neatly labelled.

LING 681.02, Prof. Howard, Tulane University


Algorithm

  • Divide the corpus into three sets:

    • training set

    • test set

    • development (dev-test) set

  • Choose an initial set of features that will be used to classify the corpus.

    • The part of the program that looks for the features in the corpus is called a feature extractor.

  • Train the classifier on the training set.

  • Run it on the development set.

  • Refine the feature extractor from any errors produced on the development set.

  • Run the improved classifier on the test set.

LING 681.02, Prof. Howard, Tulane University


Choosing the right features

  • Use too few, and the data will be underfitted.

    • The classifier is too vague and makes too many mistakes.

  • Use too many, and the data will be overfitted.

    • The classifier is too specific and will not generalize to new examples.

LING 681.02, Prof. Howard, Tulane University


Example: gender id

  • What would the features be?

    • A female name ends in a, e, i.

    • A male name ends in k, o, r, s, t.

  • Explain how classification would work.

  • NLTK code pp. 223-4.

LING 681.02, Prof. Howard, Tulane University


More examples

  • Classify movie reviews as positive or negative.

    • How?

  • Classify POS of words.

    • How?

LING 681.02, Prof. Howard, Tulane University


Beyond the word

  • Look at word's context.

    • As we have seen, this is crucial to POS tagging.

  • Classify IMs as to dialogue acts that they instantiate.

    • What could be some such acts?

    • statement, emotion, yes-no question

    • How?

  • Recognizing textual entailment

    • … is the task of determining whether a given piece of text T entails another text called the "hypothesis".

    • How?

LING 681.02, Prof. Howard, Tulane University


RTE example

  • T: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism.

  • H: China is a member of SCO.

LING 681.02, Prof. Howard, Tulane University


Next time

Finish NLPP §6

Go on to NLPP §7

Extracting info from text


  • Login