Effective Feature Selection Techniques in Natural Language Processing

SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004

Today • Feature selection • TF.IDF Term Weighting • Term Normalization

Features for Text Categorization • Linguistic features • Words • lowercase? (should we convert to?) • normalized? (e.g. “texts”  “text”) • Phrases • Word-level n-grams • Character-level n-grams • Punctuation • Part of Speech • Non-linguistic features • document formatting • informative character sequences (e.g. &lt)

When Do We NeedFeature Selection? • If the algorithm cannot handle all possible features • e.g. language identification for 100 languages using all words • text classification using n-grams • Good features can result in higher accuracy • But! Why feature selection? • What if we just keep all features? • Even the unreliable features can be helpful. • But we need to weight them: • In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection!

Why Feature Selection? • Not all features are equally good! • Bad features: best to remove • Infrequent • unlikely to be be met again • co-occurrence with a class can be due to chance • Too frequent • mostly function words • Uniform across all categories • Good features: should be kept • Co-occur with a particular category • Do not co-occur with other categories • The rest: good to keep

Types Of Feature Selection? • Feature selection reduces the number of features • Usually: • Eliminating features • Weighting features • Normalizing features • Sometimes by transforming parameters • e.g. Latent Semantic Indexing using Singular Value Decomposition • Method may depend on problem type • For classification and filtering, may use information from example documents to guide selection

Feature Selection • Task independent methods • Document Frequency (DF) • Term Strength (TS) • Task-dependent methods • Information Gain (IG) • Mutual Information (MI) • 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997)

Pedersen & Yang Experiments • Compared feature selection methods for text categorization • 5 feature selection methods: • DF, MI, CHI, (IG, TS) • Features were just words • 2 classifiers: • kNN: k-Nearest Neighbor (to be covered next week) • LLSF: Linear Least Squares Fit • 2 data collections: • Reuters-22173 • OHSUMED: subset of MEDLINE (1990&1991 used)

Document Frequency (DF) DF: number of documents a term appears in • Based on Zipf’s Law • Remove the rare terms: (met 1-2 times) • Non-informative • Unreliable – can be just noise • Not influential in the final decision • Unlikely to appear in new documents • Plus • Easy to compute • Task independent: do not need to know the classes • Minus • Ad hoc criterion • Rare terms can be good discriminators (e.g., in IR) What about the frequent terms? What is a “rare” term?

Examples of Frequent Words:Most Frequent Words in Brown Corpus

Stop Word Removal • Common words from a predefined list • Mostly from closed-class categories: • unlikely to have a new word added • include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles • But also some open-class words like numerals • Bad discriminators • uniformly spread across all classes • can be safely removed from the vocabulary • Is this always a good idea? (e.g. author identification)

2 statistic (CHI) • 2 statistic (pronounced “kai square”) • The most commonly used method of comparing proportions. • Checks whether there is a relationship between being in one of two groups and a characteristic under study. • Example: Let us measure the dependency between a term t and a category c. • the groups would be: • 1) the documents from a category ci • 2) all other documents • the characteristic would be: • “document contains term t”

2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? We want to compare: • the observed distribution above; and • null hypothesis: that jaguar and auto are independent

2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N  Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25

2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? • We would have: Pr(j,a) = Pr(j) Pr(a) • So, there would be: N  Pr(j,a), i.e. N Pr(j) Pr(a) • Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 • Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25 expected: fe observed: fo

2 statistic (CHI) 2 is interested in(fo– fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). expected: fe observed: fo

2 statistic (CHI) There is a simpler formula for 2: N = A + B + C + D

2 statistic (CHI) How to use 2 for multiple categories? Compute 2 for each category and then combine: • we can require to discriminate well across all categories, then we need to take the expected value of 2: • or to discriminate well for a single category, then we take the maximum:

2 statistic (CHI) • Plus • normalized and thus comparable across terms • 2(t,c) is 0, when t and c are independent • can be compared to 2distribution, 1 degree of freedom • Minus • unreliable for low frequency terms • computationally expensive

Information Gain • A measure of importance of the feature for predicting the presence of the class. • Defined as: • The number of “bits of information” gained by knowing the term is present or absent • Based on Information Theory • We won’t go into this in detail here. • Plus: • sound information theory justification • Minus: • computationally expensive

Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, ci is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)

Mutual Information (MI) • The probability of seeing x followed by y vs. the probably of seeing x anywhere times the probability of seeing y anywhere. • log ( P(x,y) / P(x)P(y) )

Mutual Information (MI) rare terms scored higher Approximation: does not use term absence

Using Mutual Information • Compute MI for each category and then combine • If we want to discriminate well across all categories, then we need to take the expected value of MI: • To discriminate well for a single category, then we take the maximum:

Mutual Information • Plus • I(t,c) is 0, when t and c are independent • Sound information-theoretic interpretation • Minus • Small numbers produce unreliable results • Computationally expensive • Does not use term absence

Term strength Mutual information

Comparison: DF,TS,IG,CHI,MI DF, IG and CHI are good and strongly correlated • thus using DF is good, cheap and task independent • can be used when IG and CHI are too expensive • MI is bad • favors rare terms (which are typically bad) • MI vs. IG mutual information information gain

Term Weighting • In the study just shown, terms were (mainly) treated as binary features • If a term occurred in a document, it was assigned 1 • Else 0 • Often it us useful to weight the selected features • Standard technique: tf.idf

TF.IDF Term Weighting • TF: term frequency • definition: TF = tij • frequency of term i in document j • purpose: makes the frequent words for the document more important • IDF: inverted document frequency • definition: IDF = log(N/ni) • ni : number of documents containing term i • N : total number of documents • purpose: makes rare words across documents more important • TF.IDF • definition: tij log(N/ni)

Term Normalization • Combine different words into a single representation • Stemming/morphological analysis • bought, buy, buys -> buy • General word categories • $23.45, 5.30 Yen -> MONEY • 1984, 10,000 -> DATE, NUM • PERSON • ORGANIZATION • (Covered in Information Extraction segment) • Generalize with lexical hierarchies • WordNet, MeSH • (Covered later in the semester)

Stemming & Lemmatization • Purpose: conflate morphological variants of a word to a single index term • Stemming: normalize to a pseudoword • e.g. “more” and “morals” become “mor” (Porter stemmer) • Lemmatization: convert to the root form • e.g. “more” and “morals” become “more” and “moral” • Plus: • vocabulary size reduction • data sparseness reduction • Minus: • loses important features (even to_lowercase() can be bad!) • questionable utility (maybe just “-s”, “-ing” and “-ed”?)

What Do People Do In Practice? • Feature selection • infrequent term removal • infrequent across the whole collection (i.e. DF) • met in a single document • most frequent term removal (i.e. stop words) • Normalization: • Stemming. (often) • Word classes (sometimes) • Feature weighting: TF.IDF or IDF • Dimensionality reduction. (occasionally)

Summary • Feature selection • Task independent methods: DF, TS • Task dependent: IG, MI, 2 statistic • Term weighting • IDF • TF.IDF • Term normalization

References • Feature Selection • Yang Y., J. Pedersen. A comparative study on feature selection in text categorization. In J. D. H. Fisher, editor, The Fourteenth International Conference on Machine Learning (ICML'97), pages 412-420. Morgan Kaufmann, 1997. • Term Weighting • Salton G., C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management: an International Journal, v.24 n.5, p.513-523, 1988. • Salton, G. 1989. Automatic text processing. Chapter 9.

Effective Feature Selection Techniques in Natural Language Processing

Effective Feature Selection Techniques in Natural Language Processing

Presentation Transcript

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

I256: Applied Natural Language Processing

I256: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

I256: Applied Natural Language Processing

I256: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

I256: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

I256: Applied Natural Language Processing

I256: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing

SIMS 290-2: Applied Natural Language Processing