Sims 290 2 applied natural language processing
Download
1 / 55

SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation


  • 157 Views
  • Uploaded on

SIMS 290-2: Applied Natural Language Processing. Barbara Rosario Sept 27, 2004. Today. Classification Text categorization (and other applications) Various issues regarding classification Clustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification…

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'SIMS 290-2: Applied Natural Language Processing' - Gideon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Sims 290 2 applied natural language processing l.jpg
SIMS 290-2: Applied Natural Language Processing

Barbara Rosario

Sept 27, 2004


Today l.jpg
Today

  • Classification

    • Text categorization (and other applications)

  • Various issues regarding classification

    • Clustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification…

  • Introduce the steps necessary for a classification task

    • Define classes

    • Label text

    • Features

    • Training and evaluation of a classifier


Classification l.jpg
Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object Categories

Tagging Word POS

Sense Disambiguation Word The word’s senses

Information retrieval Document Relevant/not relevant

Sentiment classification Document Positive/negative

Author identification Document Authors

From: Foundations of Statistical Natural Language Processing. Manning and Schutze


Author identification l.jpg
Author identification

  • They agreed that Mrs. X should only hear of the departure of the family, without being alarmed on the score of the gentleman's conduct; but even this partial communication gave her a great deal of concern, and she bewailed it as exceedingly unlucky that the ladies should happen to go away, just as they were all getting so intimate together.

  • Gas looming through the fog in divers places in the streets, much as the sun may, from the spongey fields, be seen to loom by husbandman and ploughboy. Most of the shops lighted two hours before their time--as the gas seems to know, for it has a haggard and unwilling look. The raw afternoon is rawest, and the dense fog is densest, and the muddy streets are muddiest near that leaden-headed old obstruction, appropriate ornament for the threshold of a leaden-headed old corporation, Temple Bar.


Author identification5 l.jpg
Author identification

  • Jane Austen (1775-1817), Pride and Prejudice

  • Charles Dickens (1812-70), Bleak House


Author identification6 l.jpg
Author identification

  • Federalist papers

    • 77 short essays written in 1787-1788 by Hamilton, Jay and Madison to persuade NY to ratify the US Constitution; published under a pseudonym

    • The authorships of 12 papers was in dispute (disputed papers)

    • In 1964 Mosteller and Wallace* solved the problem

    • They identified 70 function words as good candidates for authorships analysis

    • Using statistical inference they concluded the author was Madison

Mosteller, Frederick and Wallace, David L. 1964. Inference and Disputed Authorship: The Federalist.




Classification9 l.jpg
Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object Categories

Author identification Document Authors

Language identification Document Language

From: Foundations os Statistical Natural Language Processing. Manning and Schutze


Language identification l.jpg
Language identification

  • Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza.

  • Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen.

    Universal Declaration of Human Rights, UN, in 363 languages


Language identification11 l.jpg
Language identification

  • égaux

  • eguali

  • iguales

  • edistämään

  • Ü

  • ¿


Classification12 l.jpg
Classification

Goal: Assign ‘objects’ from a universe to two or more classes or categories

Examples:

Problem Object Categories

Author identification Document Authors

Language identification Document Language

Text categorization Document Topics

From: Foundations of Statistical Natural Language Processing. Manning and Schutze


Text categorization l.jpg
Text categorization

  • Topic categorization: classify the document into semantics topics


Text categorization14 l.jpg
Text categorization

  • http://news.google.com/

  • Reuters

    • Collection of (21,578) newswire documents.

    • For research purposes: a standard text collection to compare systems and algorithms

    • 135 valid topics categories


Reuters l.jpg
Reuters

  • Top topics in Reuters


Reuters16 l.jpg
Reuters

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>


Text categorization examples l.jpg
Text categorization: examples

  • Topic categorization

    • http://news.google.com/

    • Reuters.

  • Spam filtering

    • Determine if a mail message is spam (or not)

  • Customer service message classification


Classification vs clustering l.jpg
Classification vs. Clustering

  • Classificationassumes labeled data: we know how many classes there are and we have examples for each class (labeled data).

  • Classification is supervised

  • In Clusteringwe don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are

  • Clustering is unsupervised


Slide19 l.jpg

Class1

Class2


Slide20 l.jpg

Class1

Class2


Classification21 l.jpg
Classification

Class1

Class2


Classification22 l.jpg
Classification

Class1

Class2



Categories labels classes l.jpg
Categories (Labels, Classes)

  • Labeling data

  • 2 problems:

  • Decide the possible classes (which ones, how many)

    • Domain and application dependent

    • http://news.google.com

  • Label text

    • Difficult, time consuming, inconsistency between annotators


Reuters29 l.jpg
Reuters

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Why not topic = policy ?


Binary vs multi way classification l.jpg
Binary vs. multi-way classification

  • Binary classification: two classes

  • Multi-way classification: more than two classes

  • Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes


Flat vs hierarchical classification l.jpg
Flat vs. Hierarchical classification

  • Flat classification: relations between the classes undetermined

  • Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node


Single vs multi category classification l.jpg
Single- vs. multi-category classification

  • In single-category text classification each text belongs to exactly one category

  • In multi-category text classification, each text can have zero or more categories


Labeledtext class in nltk l.jpg
LabeledText class in NLTK

  • LabeledTextclass

  • >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."

  • >>> label = “sport”

  • >>> labeled_text = LabeledText(text, label)

  • >>> labeled_text.text()

  • “Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix.”

  • >>> labeled_text.label()

  • “sport”


Nltk the classifier interface l.jpg
NLTK: The Classifier Interface

  • classifydetermines which label is most appropriate for a given text token, and returns a labeled text token with that label.

  • labelsreturns the list of category labels that are used by the classifier.

  • >>> token = Token(“The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”)

  • >>> my_classifier.classify(token)

    “The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”/ health

    >>> my_classifier.labels()

  • ("sport", "health", "world",…)


Features l.jpg
Features

  • >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix."

  • >>> label = “sport”

  • >>> labeled_text = LabeledText(text, label)

  • Here the classification takes as input the whole string

  • What’s the problem with that?

  • What are the features that could be useful for this example?


Feature terminology l.jpg
Feature terminology

  • Feature: An aspect of the text that is relevant to the task

  • Some typical features

    • Words present in text

    • Frequency of words

    • Capitalization

    • Are there NE?

    • WordNet

    • Others?


Feature terminology37 l.jpg
Feature terminology

  • Feature: An aspect of the text that is relevant to the task

  • Feature value: the realization of the feature in the text

    • Words present in text : Kerry, Schumacher, China…

    • Frequency of word: Kerry(10), Schumacher(1)…

    • Are there dates? Yes/no

    • Are there PERSONS? Yes/no

    • Are there ORGANIZATIONS? Yes/no

    • WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)


Feature types l.jpg
Feature Types

  • Boolean (or Binary) Features

  • Features that generate boolean (binary) values.

  • Boolean features are the simplest and the most common type of feature.

    • f1(text) = 1 if text contain “Kerry”

      0 otherwise

    • f2(text) = 1 if text contain PERSON

      0 otherwise


Feature types39 l.jpg
Feature Types

  • Integer Features

  • Features that generate integer values.

  • Integer features can be used to give classifiers access to more precise information about the text.

    • f1(text) = Number of times text contains “Kerry”

    • f2(text) = Number of times text contains PERSON


Features in nltk l.jpg
Features in NLTK

  • Feature Detectors

    • Features can be defined using feature detector functions, which map LabeledTexts to values

    • Method: detect, which takes a labeled text, and returns a feature value.

    • >>> def ball(ltext):

      return (“ball” in ltext.text())

    • >>> fdetector = FunctionFeatureDetector(ball)

    • >>> document1 = "John threw the ball over the fence".split()

    • >>> fdetector.detect(LabeledText(document1)

      1

    • >>> document2 = "Mary solved the equation".split()

    • >>> fdetector.detect(LabeledText(document2)

      0


Feature selection l.jpg
Feature selection

  • How do we choose the “right” features?

  • Next lecture


Classification42 l.jpg
Classification

  • Define classes

  • Label text

  • Extract Features

  • Choose a classifier

    • >>> my_classifier.classify(token)

    • The Naive Bayes Classifier

    • NN (perceptron)

    • SVM

    • …. (next Monday)

  • Train it (and test it)

  • Use it to classify new examples


Training l.jpg
Training

  • (We’ll see what we mean exactly with training when we’ll talk about the algorithms)

  • Adaptation of the classifier to the data

  • Usually the classifier is defined by a set of parameters

  • Training is the procedure for finding a “good” set of parameters

  • Goodness is determined by an optimization criterion such as misclassification rate

  • Some classifiers are guaranteed to find the optimal set of parameters


Linear classification l.jpg
(Linear) Classification

Class1

Linear classifier: g(x) = wx + w0

parameters:w, w0

Class2


Linear classification45 l.jpg
(Linear) Classification

Class1

Linear classifier: g(x) = wx + w0

Changing the parameters:w, w0

Class2


Linear classification46 l.jpg
(Linear) Classification

Class1

Linear classifier: g(x) = wx + w0

Class2

For each set of parameters:w, w0, calculate error


Linear classification47 l.jpg
(Linear) Classification

Class1

Linear classifier: g(x) = wx + w0

Class2

For each set of parameters:w, w0, calculate error


Linear classification48 l.jpg
(Linear) Classification

Choose the classier with the lower rate of misclassification

Class1

Linear classifier: g(x) = wx + w0

Class2

For each set of parameters:w, w0, calculate error


Testing evaluation of the classifier l.jpg
Testing, evaluation of the classifier

  • After choosing the parameters of the classifiers (i.e. after training it) we need to test how well it’s doing on a test set (not included in the training set)

  • Calculate misclassification on the test set


Evaluating classifiers l.jpg
Evaluating classifiers

  • Contingency table for the evaluation of a binary classifier

  • Accuracy = (a+d)/(a+b+c+d)

  • Precision: P_GREEN = a/(a+b), P_ RED= d/(c+d)

  • Recall: R_GREEN = a/(a+c), R_ RED= d/(b+d)


Training size l.jpg
Training size

  • The more the better! (usually)

  • Results for text classification*

*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang


Training size52 l.jpg
Training size

*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang


Training size53 l.jpg
Training size

*From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang


Training size54 l.jpg
Training Size

  • Author identification

Authorship Attribution a Comparison Of Three Methods, Matthew Care


Next time and upcoming l.jpg
Next Time and Upcoming

  • Define classes

  • Label text

  • Features (Wednesday)

  • Classifiers (next week)

    • The Naive Bayes Classifier

    • NN (perceptron)

    • SVM

    • Decision trees

    • K nearest neighbor

    • Maximum Entropy models