special topics on text mining part i text classification n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Special topics on text mining [ Part I: text classification ] PowerPoint Presentation
Download Presentation
Special topics on text mining [ Part I: text classification ]

Loading in 2 Seconds...

play fullscreen
1 / 65

Special topics on text mining [ Part I: text classification ] - PowerPoint PPT Presentation


  • 138 Views
  • Uploaded on

Special topics on text mining [ Part I: text classification ]. Hugo Jair Escalante , Aurelio Lopez, Manuel Montes and Luis Villaseñor. Representation and preprocessing of documents. Hugo Jair Escalante , Aurelio Lopez, Manuel Montes and Luis Villaseñor. Agenda.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Special topics on text mining [ Part I: text classification ]' - dawn


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
special topics on text mining part i text classification

Special topics on text mining[Part I: text classification]

Hugo Jair Escalante, Aurelio Lopez,

Manuel Montes and Luis Villaseñor

representation and preprocessing of documents

Representation and preprocessing of documents

Hugo Jair Escalante, Aurelio Lopez,

Manuel Montes and Luis Villaseñor

agenda
Agenda
  • Recap: text classification
  • Representation of documents
  • Preprocessing
  • Feature selection
  • Instance selection
  • Discussion
  • Assignments
text classification
Text classification
  • Text classificationis the assignment of free-text documents to one or more predefined categories based on their content

Categories/classes

(e.g., sports, religion, economy)

Documents (e.g., news articles)

Manual approach?

manual classification
Manual classification
  • Very accuratewhen job is done by experts
    • Different to classify news in general categories than biomedical papers into subcategories.
  • But difficult and expensiveto scale
    • Different to classify thousands than millions
  • Used by Yahoo!, Looksmart, about.com, ODP, Medline, etc.

Ideas for building an automatic classification system?

How to define a classification function?

hand coded rule based systems
Hand-coded rule based systems

Experts

  • Main approach in the 80s
  • Disadvantage  knowledge acquisition bottleneck
    • too time consuming, too difficult, inconsistency issues

Knowledgeengineers

Labeleddocuments

New document

Rule 1, if … then … else

Rule N, if … then …

Document’s category

Classifier

example filtering spam email
Example: filtering spam email
  • Rule-basedclassifier

Classifier 1

Classifier 2

Taken from Hastie et al. The Elements of Statistical Learning, 2007, Springer.

machine learning approach 1
Machine learning approach (1)
  • A general inductive process builds a classifier by learning from a set of preclassified examples.
    • Determines the characteristics associated with each one of the topics.

Ronen Feldman and James Sanger, The Text Mining Handbook

machine learning approach 2
Machine learning approach (2)

Have to be Experts?

Experts

Which algorithm?

Inductive process

Labeleddocuments(training set)

How large has to be?

New document

Rules, trees, probabilities, prototypes, etc.

Document’s category

Classifier

How to represent documents?

machine learning approach 3
Machine learning approach (3)
  • Machine learning approach to TC:to develop automated methods able to classify documents with a certain degree of success

Labeled document

Trained machine

?

Training documents

(Labeled)

Learning machine

(an algorithm)

Unseen (test, query)

document

text classification1
Text classification
  • Machine learning approach to TC:

Assumption: a largeenough training set of labeleddocumentsisavailable

  • Recipe
  • Gather labeled documents
  • Construction of a classifier
    • Document representation
    • Preprocessing
    • Dimensionality reduction
    • Classification methods
  • Evaluation of a TC method

Later we will study methods that allow us to relax such assumption [semi-supervised and unsupervised learning]

document representation
Document representation
  • Represent the content of digital documents in a way that it can be processed by a computer
before representing documents preprocessing
Before representing documents: Preprocessing
  • Eliminate information about style, such as html or xml tags.
    • For some applications this information may be useful. For instance, only index some document sections.
  • Remove stop words
    • Functional words such as articles, prepositions, conjunctions are not useful (do not have an own meaning).
  • Perform stemmingorlemmatization
    • The goal is to reduce inflectional forms, and sometimes derivationally related forms.
  • am, are, is → be car, cars, car‘s → car
document representation1
Document representation
  • Transform documents, which typically are strings of characters, into a representation suitable for the learning algorithm:
    • Codify/represent/transform documents into a vector representation
  • The most common used document representation is the bag of words (BOW) approach
    • Documents are represented by the set of words that they contain
    • Word order is not captured by this representation
    • There is no attempt for understanding their content
    • The vocabulary of all of the different words in all of the documents is considered as the base for the vector representation
document representation2
Document representation

Terms in the vocabulary

(Basic units expressing document’s content)

Documents in the corpus

(one vector/row per document)

Weight indicating the contributionof word j in documenti.

V: Vocabulary from the collection (i.e., et of all different words that occur in the corpus)

Which words are good features?

How to select/extract them?

How to compute their weights?

ml conventions
(ML Conventions)

n

X={xij}

y ={yj}

m

xi

a

w

Slide taken from I. Guyon. Feature and Model Selection. Machine Learning Summer School, Ile de Re, France, 2008.

a brief note on evaluation in tc

m1

m2

m3

(A brief note on evaluation in TC)

Terms (N = |V|)

Documents (M)

  • The available data is divided into three subsets:
    • Training (m1)
      • used for the construction (learning) the classifier
    • Validation (m2)
      • Optimization of parameters of the TC method
    • Test (m3)
      • Used for the evaluation of the classifier
document representation3
Document representation
  • Simplest BOW-based representation: Each document is represented by a binary vector whose entries indicate the presence/absence of terms from the vocabulary (Boolean/binary weighting)
  • Obtain the BOW representation with Boolean weighting for these documents
term weighting extending the boolean bow
Term weighting [extending the Boolean BOW]
  • Two main ideas:
    • The importance of a term increases proportionally to the number of times it appears in the document.
      • It helps to describe document’s content.
    • The general importance of a term decreases proportionally to its occurrences in the entire collection.
      • Common terms are not good to discriminate between different classes
  • Does the order of words matters?
term weighting main approaches
Term weighting – main approaches
  • Binary weights:
    • wi,j = 1 iff document di contains term tj , otherwise 0.
  • Term frequency (tf):
    • wi,j = (no. of occurrences of tj in di)
  • tf x idf weighting scheme:
    • wi,j = tf(tj, di) × idf(tj), where:
      • tf(tj, di) indicatestheocurrences of tj in documentdi
      • idf(tj) = log [N/df(tj)], wheredf(tj) isthenumber of documetsthatcontainthetermtj.
  • These methods do not use the information of the classes, why?
a brief note on evaluation in tc1

m1

m2

m3

(A brief note on evaluation in TC)

Terms (N = |V|)

Documents (M)

  • The available data is divided into three subsets:
    • Training (m1)
      • used for the construction (learning) the classifier
    • Validation (m2)
      • Optimization of parameters of the TC method
    • Test (m3)
      • Used for the evaluation of the classifier
extended document representations
Extended document representations
  • Document representations that capture information not considered by the BOW formulation
  • Examples:
    • Based on distributional term representations
    • Locally weighted bag of words
distributional term representations dtrs
Distributional term representations (DTRs)
  • Distributional hypothesis: words that occur in the same contexts tend to have similar meanings
    • A word is known by the company it keeps!
    • Occurrence & co-occurrence
  • Distributional term representation: Each term is represented by a vector of weights that indicate how often it occur in documents or how often co-occur with other terms
  • DTRs for document representation: Combine the vectors that have the DTR for terms that appear in the document
slide24

Distributional term representations (DTRs)

  • Documents are represented by the weighted sum of the DTRs for terms appearing in the document

Contextual weights

Terms

. . .

. . .

Accommodation:

Accommodation Steinhausen - Exterior View Blumenau, Brazil

October 2004

+

DTR for

terms

. . .

. . .

Steinhausen:

Document

. . .

. . .

. . .

DTR for

document

slide25

Distributional term representations (DTRs)

  • Document occurrence representation (DOR): The representation for a term is given by the documents it mostly occurs in.
  • Term co-occurrence representation (TCOR): The representation of a term is determined by the other terms from the vocabulary that mostly co-occur with it

A. Lavelli, F. Sebastiani, and R. Zanoli. Distributional Term Representations: An Experimental Comparison, Proceedings of the International Conference of Information and Knowledge Management, pp. 615—624, 2005, ACM Press.

document representation dor tcor
Document representation (DOR & TCOR)
  • DOR-B:
    • Unweighted sum
  • DOR-TF-IDF
    • Consider term-occurrence frequency and term-usage across the collection
  • DOR-IF-IDF-W
    • Weight terms from different modalities
lowbow the locally weighted bag of words framework
LOWBOW: The locally weighted bag-of-words framework
  • Each document is represented by a set of local histograms computed across the whole document but smoothed by kernels and centered at different document locations
  • LOWBOW-based document representations can preserve sequential information in documents
  • G. Lebanon,Y. Mao,and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research. Vol. 8, pp. 2405—2441, 2007.
bow approach
BOW approach
  • Indicates the (weighted) occurrence of terms in documents
lowbow framework
LOWBOW framework
  • A set of histograms, each weighted according to selected positions in the document
slide30
BOW

1

V

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

BOW representation

lowbow
LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Identify locations

in documents

lowbow1
LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to Gaussians at the different locations

1

V

lowbow2
LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to Gaussians at the different locations

1

V

lowbow3
LOWBOW

1

V

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to Gaussians at the different locations

lowbow4
LOWBOW

China sent a senior official to attend a reception at the Ukraine embassy on Friday despite a diplomatic rift over a visit to Kiev by Taiwan's vice president Lien Chan. But an apparent guest list mix-up left both sides unsure over who would represent Beijing at the reception, held to mark Ukraine's independence day…

Benjamin Kang Lim

Weight the contribution of terms according to Gaussians at the different locations

1

V

slide36

LHs: position + frequency weighting

1

1

1

1

1

V

V

V

V

V

Position weighting

1

N

1

N

1

N

1

N

1

N

Kernel smoothing

Document

w1, w2, w3, w4, w5, w6, w7, … wN-2, wN-1, wN

. . .

μ2

μ3

μk-1

μk

μ1

Kernel locations

assignment 1
Assignment # 1
  • Read a paper on term weighting or document occurrence representations (it can be one from those available in the course page or another chosen by you)
  • Prepare a presentation of at most 10 minutes, in which you describe the proposed/adopted approach *different of those seen in class*. The presentation must cover the following aspects:
    • Underlying and intuitive idea of the approach
    • Formal description
    • Benefits and limitations (compared to the schemes seen in class)
    • Your idea(s) to improve the presented approach
suggested readings on term weighting and preprocessing
Suggested readings on term weighting and preprocessing
  • M. Lan, C. Tan, H. Low, S. Sung. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. Proc. of WWW, pp. 1032—1033, 2005.
text classification2
Text classification
  • Machine learning approach to TC:
  • Recipe
  • Gather labeled documents
  • Construction of a classifier
    • Document representation
    • Preprocessing
    • Dimensionality reduction
    • Classification methods
  • Evaluation of a TC method
dimensionality issues
Dimensionality issues
  • What do you think is the average size of the vocabulary in a small-scale text categorization problem (~1,000 - 10,000 documents)
    • It depends on the domain and type of the corpus, although usual vocabulary sizes in text classification range from a few thousands to millions of terms
dimensionality issues1
Dimensionality issues
  • A central problem in text classification is the high dimensionality of the features space.
    • There is one dimension for each unique word found in the collection  can reach hundreds of thousands
    • Processing is extremely costly in computational terms
    • Most of the words (features) are irrelevant to the categorization task

How to select/extract relevant features?

How to evaluate the relevancy of the features?

the curse of dimensionality
The curse of dimensionality
  • Dimensionality is a common issue in machine learning (in general)
  • Number of positions scale exponentially with the dimensionality of the problem
  • We need an exponential number of training examples to cover all positions

Image taken from: SamyBengio and YoshuaBengio, Taking on the Curse of Dimensionality in Joint Distributions Using Neural Networks, in: IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, volume 11, number 3, pages 550-557, 2000.

dimensionality reduction two main approaches
Dimensionality reduction: Two main approaches
    • Feature selection
      • Idea: removal of non-informative words according to corpus statistics
      • Output: subset of original features
    • Main techniques: document frequency, mutual information and information gain
  • Re-parameterization
    • Idea: combine lower level features (words) into higher-level orthogonal dimensions
    • Output: a new set of features (not words)
    • Main techniques: word clustering and Latent semantic indexing (LSI)
fs document frequency
FS: Document frequency
  • The document frequency for a word is the number of documents in which it occurs.
  • This technique consists in the removal of words whose document frequency is less than a specified threshold
  • The basic assumption is that rare words are either non-informative for category prediction or not influential in global performance.
zipf s law
Zipf's law
  • The frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

G. Kirby. Zipf’s law. UK Journal of Naval Science Volume 10, No. 3 pp 180 – 185, 1985.

fs mutual information
FS: Mutual information
  • Measures the mutual dependence of the two variables
    • In TC, it measures the information that a word t and a class c share: how much knowing word t reduces our uncertainty about class c

The idea is to select words that are very related with one class

fs mutual information1
FS: Mutual information
  • Let:
  • Then:
  • To get the global MI for term t:

A: # times t and c co-occur

B: # times t occurs without c

C: # times c occurs without t

N: # documents

fs information gain 1
FS: Information gain (1)
  • Information gain (IG) measures how well an attribute separates the training examples according to their target classification
    • Is the attribute a good classifier?
  • The idea is to select the set of attributes having the greatest IG values
    • Commonly, maintain attributes with IG > 0

How to measure the worth (IG) of an attribute?

fs information gain 2
FS: Information gain (2)

Entropy: Average information from a message that can take m values

  • Information gain  Entropy
  • Entropy characterizes the impurity of an arbitrary collection of examples.
    • It specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of the dataset (S).
  • For a binary problem:

Greatest uncertainty1 bit toencodetheclass

No uncertaintyalways positive/negativenotneedtoencodetheclass

fs information gain 3
FS: Information gain (3)
  • IG of an attribute measures the expected reduction in entropy caused by partitioning the examples according to this attribute.
    • The greatest the IG, the better the attribute for classification
    • IG < 0 indicates that we have a problem with greater uncertainty than the original
    • The maximum value is log C; C is the number of classes.
other fs methods for tc
Other FS methods for TC

G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003

other fs methods for tc1
Other FS methods for TC

G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003

feature selection the ml perspective
Feature selection the ML perspective

For a problem with n features there are 2n different subsets of features

  • Problem: find the subset of features that are more helpful for classification
    • Reduce the dimensionality of the data
    • Eliminate uninformative features
    • Find discriminate features

I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

feature selection the ml perspective1
Feature selection the ML perspective

I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

feature selection the ml perspective2
Feature selection the ML perspective

Filters: Evaluate the importance of features using methods that are independent of the classification model

Wrappers: Evaluate the importance of subsets of features using the classification model (a search strategy is adopted)

Embedded: Take advantage of the nature of the classification model being considered

I. Guyon, et al. Feature Extraction: Foundations and Applications, Springer 2006.

filters vs wrappers

Feature subset

All features

Predictor

Filter

Multiple Feature subsets

All features

Predictor

Wrapper

Filters vs. Wrappers
  • Main goal: rank subsets of useful features.
  • Danger of over-fitting with intensive search!
feature selection the ml perspective3

Original feature set

Subset of feature

Generation

Evaluation

Validation

Stopping

criterion

Selected subset of feature

yes

no

Feature selection the ML perspective
  • General diagram of a wrapperfeatureselectionmethod

process

Generation = select feature subset candidate.

Evaluation = compute relevancy value of the subset.

Stopping criterion = determine whether subset is relevant.

Validation = verify subset validity.

From M. Dash and H. Liu. http://www.comp.nus.edu.sg/~wongszec/group10.ppt

trends on feature selection for text classification
Trends on feature selection for text classification
  • How to combine the output of different feature selection methods?
  • Hybrid feature selection method (filter(s) + wrapper) for feature subset selection
fusion of feature selection methods
Fusion of feature selection methods
  • Different feature selection methods choose different features
  • The combination of the outputs of different FS methods can result in a better feature selection process

R. Neumayer et al. Combination of feature selection methods for text categorization. Proc. of European Conference on Information Retrieval, pp. 763—766, 2011.

Y. Saeys et al. Robust feature selection using ensemble feature selection techniques. Proc. of ECML/PKDD, pp. 313—325, LNAI 5112, Springer, 2008.

hybrids for feature selection
Hybrids for feature selection
  • Wrapper-based feature selection often outperforms filter-based methods. However, wrappers are too computationally demanding. (Most) Filters are extremely efficient, although most of them are univariate
  • Idea: hybrid approach:
    • Use filters to generate a ranked list of features (possibly using several FS methods)
    • Use a wrapper to select the best subset of features using the information returned by the filter

M. A. Esseghir. Effective Wrapper-Filter hybridization through GRASP Schemata. JMLR Workshop and Conference Proceedings Vol. 10: Feature Selection in Data Mining, 10:45-54, 2010.

P. Bermejo, J. A. Gamez, J. M. Puerta. A GRASP Algorithm for Fast Hybrid Feature Subset Selection in High-dimensional Datasets. Pattern Recognition Letters, 2011.

J. Pacheco, S. Casado, L. Nunez. A Variable Selection Method based on Tabu Search for Logistic Regression. European Journal of Operations Research, Vol. 199:506—511, 2009.

more on feature selection and extraction
More on feature selection and extraction …

Feature Extraction,

Foundations and Applications

I. Guyon et al, Eds.

Springer, 2006.

http://clopinet.com/fextract-book

assignment 2
Assignment # 2
  • Read a paper about feature selection/extraction for text classification (it can be one from those available in the course page or another chosen by you)
  • Prepare a presentation of at most 10 minutes, in which you describe a feature selection approach *different of those seen in class*. The presentation must cover the following aspects:
    • Underlying and intuitive idea of the proposed approach
    • Formal description
    • Benefits and limitations (compared to the approaches seen in class)
    • Your idea(s) to improve the presented approach
suggested readings on feature selection for text classification
Suggested readings on feature selection for text classification
  • G. Forman. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, 3:1289—1305, 2003
  • H. Liu, H. Motoda. Computational Methods of Feature Selection. Chapman & Hall, CRC, 2008.
  • Y. Yang, J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Proc. of the 14th International Conference on Machine Learning, pp. 412—420, 1997.
  • D. Mladenic, M. Grobelnik. Feature Selection for Unbalanced Class Distribution and Naïve Bayes. Proc. of the 16th Conference on Machine Learning, pp. 258—267, 1999.
  • I. Guyon, et al. FeatureExtractionFoundations and Applications,Springer, 2006.
  • Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, Vol. 3:1157—1182, 2003.