INTRODUCTION TO ARTIFICIAL INTELLIGENCE

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo PoesioLECTURE 14: Text categorization with Decision Trees and Naïve Bayes

REMINDER: DECISION TREES • A DECISION TREE is a classifier in the form of a tree structure, where each node is either a: • Leaf node - indicates the value of the target attribute (class) of examples, or • Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. • A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance.

Day Outlook Temp. Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cold Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Decision Tree Example Goal: learn when we can play Tennis and when we cannot

Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification Decision Tree for PlayTennis Outlook Sunny Overcast Rain Humidity High Normal No Yes

No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes Decision Tree for PlayTennis Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ? www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp

TEXT CLASSIFICATION WITH DT • As an example of actual application of decision trees, we’ll consider the problem of TEXT CLASSIFICATION

IS THIS SPAM? From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: http://www.wholesaledaily.com/sales/nmd.htm =================================================

TEXT CATEGORIZATION • Given: • A description of an instance, xX, where X is the instance language or instance space. • Issue: how to represent text documents. • A fixed set of categories: C ={c1, c2,…, cn} • Determine: • The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C. • We want to know how to build categorization functions (“classifiers”).

Document Classification “planning language proof intelligence” Testing Data: (AI) (Programming) (HCI) Classes: Planning Semantics Garb.Coll. Multimedia GUI ML Training Data: learning intelligence algorithm reinforcement network... planning temporal reasoning plan language... programming semantics language proof... garbage collection memory optimization region... ... ... (Note: in real life there is often a hierarchy, not present in the above problem statement; and you get papers on ML approaches to Garb. Coll.)

Text Categorization Examples Assign labels to each document or web-page: • Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" • Labels may be genres e.g., "editorials" "movie-reviews" "news“ • Labels may be opinion e.g., “like”, “hate”, “neutral” • Labels may be domain-specific binary e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “is a toner cartridge ad” :“isn’t”

TEXT CATEGORIZATION WITH DT • Build a separate decision tree for each category • Use WORDS COUNTS as features

Reuters Data Set (21578 - ModApte split) • 9603 training, 3299 test articles; ave. 200 words • 118 categories • An article can be in more than one category • Learn 118 binary category distinctions • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) Common categories (#train, #test) • Trade (369,119) • Interest (347, 131) • Ship (197, 89) • Wheat (212, 71) • Corn (182, 56)

AN EXAMPLE OF REUTERS TEXT Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze

OTHER LEARNING METHODS USED FOR TEXT CLASSIFICATION • Bayesian methods (Naïve Bayes) • Neural nets (e.g. ,perceptron) • Vector-space methods (k-NN, Rocchio, unsupervised) • SVMs

BAYESIAN METHODS • Learning and classification methods based on probability theory. • Bayes theorem plays a critical role in probabilistic learning and classification. • Build a generative model that approximates how data is produced • Uses prior probability of each category given no information about an item. • Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes’ Rule

Maximum a posteriori Hypothesis

Naive Bayes Classifiers Task: Classify a new instance based on a tuple of attribute values

Naïve Bayes Classifier: Assumptions • P(cj) • Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) • O(|X|n•|C|) • Could only be estimated if a very, very large number of training examples was available. Conditional Independence Assumption:  Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache The Naïve Bayes Classifier • Conditional Independence Assumption: features are independent of each other given the class:

C X1 X2 X3 X4 X5 X6 Learning the Model • Common practice:maximum likelihood • simply use the frequencies in the data

Flu X1 X2 X3 X4 X5 runnynose sinus cough fever muscle-ache Problem with Max Likelihood • What if we have seen no training cases where patient had no flu and muscle aches? • Zero probabilities cannot be conditioned away, no matter the other evidence!

Smoothing to Avoid Overfitting # of values of Xi overall fraction in data where Xi=xi,k • Somewhat more subtle version extent of “smoothing”

Using Naive Bayes Classifiers to Classify Text: Basic method • Attributes are text positions, values are words. • Naive Bayes assumption is clearly violated. • Example? • Still too many possibilities • Assume that classification is independent of the positions of the words • Use same parameters for each position

Text Classification Algorithms: Learning • From training corpus, extract Vocabulary • Calculate required P(cj)and P(xk | cj)terms • For each cj in Cdo • docsjsubset of documents for which the target class is cj • Textj single document containing all docsj • for each word xkin Vocabulary • nk number of occurrences ofxkin Textj

Text Classification Algorithms: Classifying • positions  all word positions in current document which contain tokens found in Vocabulary • Return cNB, where

Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. • Class with highest final un-normalized log probability score is still the most probable.

Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. • However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not. • Output probabilities are generally very close to 0 or 1.

READINGS • Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002

REMAINING LECTURES

ACKNOWLEDGMENTS • Several slides come from Chris Manning & Hinrich Schuetze’s course on IR and text classification

INTRODUCTION TO ARTIFICIAL INTELLIGENCE