- 473 Views
- Updated On :
- Presentation posted in: Sports / GamesEducation / CareerFashion / BeautyGraphics / DesignNews / Politics

Ravindra Jaju An Introduction to Text Mining Outline of the presentation Initiation/Introduction ... What makes text stand apart from other kinds of data? Classification Clustering Mining on “The Web” Data Mining What: Looking for information from usually large amounts of data

An Introduction to Text Mining

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Ravindra Jaju

Initiation/Introduction ...

What makes text stand apart from other kinds of data?

Classification

Clustering

Mining on “The Web”

What: Looking for information from usually large amounts of data

Mainly two kinds of activities – Descriptive and Predictive

Example of a descriptive activity – Clustering

Example of a predictive activity - Classification

<1, 1, 0, 0, 1, 0>

<0, 0, 1, 1, 0, 1>

It could be two customers' baskets, containing (milk, bread, butter) and (shaving cream, razor, after-shave lotion) respectively.

Or, it could be two documents - “Java programming language” and “India beat Pakistan”

<550000, 155>

<750000, 115>

<120000, 165>

Data about people, <income, IQ> pairs!

Humans understand data in various forms

Text

Sales figures

Images

Computers understand only numbers

Most of the mining algorithms work only with numeric data

All data, hence, are represented as numbers so that they can lend themselves to the algorithms

Whether it is sales figures, crime rates, text, or images – one has to find a suitable way to transform data into numbers.

“Java Programming Language”

“India beat Pakistan”

OR

<1, 1, 0, 0, 1, 0>

<0, 0, 1, 1, 0, 1>

The transformation to 1's and 0's hides all relationship between Java and Language, and India and Pakistan, which humans can make out (How?)

As we have seen, data transformation (from text/word to some index number in this case) means that there is some information loss

One big challenge in this field today is to find a good data representation for input to the mining algorithms

Each word has a dictionary meaning, or meanings

Run – (1) the verb. (2) the noun, in cricket

Cricket – (1) The game. (2) The insect.

Each word is used in various “senses”

Tendulkar made 100 runs

Because of an injury, Tendulkar can not run and will need a runner between the wickets

Capturing the “meaning” of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy!

Finding out automatically who the “he” in “He is the President” given a document is hard. And “president of?” Well ...

In general, it is hard to capture these features from a text document

One, it is difficult to extract this automatically

Two, even if we did it, it won't scale!

One simplification is to represent documents as a vector of words

We have already seen examples

Each document is represented as a vector, and each component of the vector represents some “quantity” related to a single word.

“Java Programming Language”

<1, 1, 0, 0, 1, 0, 0> (document A)

“India beat Pakistan”

<0, 0, 1, 1, 0, 1, 0> (document B)

“India beat Australia”

<0, 0, 1, 1, 0, 0, 1> (document C)

What vector operation can you think of to find two similardocuments?

How about the dot product?

As we can easily verify, documents B and C will have a higher dot product than any other combination

The dot product or cosine between two vectors is a measure of similarity.

Documents about related topics should have higher similarity

Language

Java

0, 0, 0

Indonesia

How about distance measures?

Cosine similarity measure will not capture the inter-cluster distances!

Not all words are equally important

the, is, and, to, he, she, it (Why?)

Of course, these words could be important in certain contexts

We have the option of scaling the components of these words, or completely removing them from the corpus

In general, we prefer to remove the stopwords and scale the remaining words

Important words should be scaled upwards, and vice versa

One widely used scaling factor – TF-IDF

TF-IDFstands for Term Frequency and Inverse Document Frequency product, for a word.

Document/Term Clustering

Given a large set, group similar entities

Text Classification

Given a document, find what topic does it talk about

Information Retrieval

Search engines

Information Extraction

Question Answering

Activity: Group together similar documents

Techniques used

Partitioning

Hierarchical

Agglomerative

Divisive

Grid based

Model based

Partitioning

Divide the input data into k partitions

K-means, K-medoids

Hierarchical clustering

Agglomerative

Each data point is assumed to be a cluster representative

Keep merging similar clusters till we get a single cluster

Divisive

The opposite of agglomerative

Idea

Frequent terms carry more information about the “cluster” they might belong to

Highly co-related frequent terms probably belong to the same cluster

D = {D1, …, Dn} – the set of documents

DjsubsetOf T, the set of all terms

Then candidate clusters are generated from F = {F1, … , Fk}, where each Fi is a set of all frequent terms which occur together.

The problem statement

Given a set of documents, each with a label called the class label for that document

Given, a classifier which learns from the above data set

For a new, unseen document, the classifier should be able to “predict” with a high degree of accuracy the correct class to which the new document belongs

A tree

Each node represents some kind of an “evaluation” for an attribute of the data

Each edge, the decision taken

The evaluation at each node is some kind of an information gain measure

Reduction in entropy – more information gained

Entropy E(x) = -pilog2(pi)

pi represents the probability that the data corresponds to sample i

Each edge represents a choice for the value of the attribute the node represents

Good for text mining. But doesn’t scale

For a document-class data, we calculate the probabilities of occurrence of events

Bayes’ Theorem

P(c|d) = P(c) . P(d|c) / P(d)

Given a document d, the probability that it belongs to a class c is given by the above formula.

In practice, the exact values of the probabilities of each event are unknown, and are estimated from the samples

Probability of the document eventd

P(d) = P(w1, …, wn) – wi are the words

The RHS is generally a headache. We have to consider the inter-dependence of each of the wj events

Naïve Bayes – Assume all the wj events are independent. The RHS expands to

p(wj)

Most of the Bayesian text classifiers work with this simplification

This is an intermediate approach

Not all words are independent

“If java and program occur together, then boost the probability value of class computer programming”

“If java and indonesia occur together, then the document is more likely about some-other-class”

Problem?

How do we come up with co-relations like above?

Support Vector Machines

Find the best discriminant plane between two classes

k Nearest Neighbour

Association Rule Mining

Neural Networks

Case-based reasoning

Problem setting

Labeling documents is a manual process

A lot more unlabeled documents are available as compared to labeled documents

Unlabeled documents contain information which could help in the classification activity

Train a classifier with the labeled documents

Say, a Naïve Bayes classifier

This classifier estimates the model parameters (the prior probabilities of the various events)

Now, classify the unlabeled documents.

Assuming the applied labels to be correct, re-estimate the model parameters

Repeat the above step till convergence

A useful technique for estimating hidden parameters

In the previous example, the class labels were missing from some documents

Consists of two steps

E-step: Set z(k+1) = E [z | D; (k)]

M-step: Set (k+1) = arg max P( | D; z(k+1))

The above steps are repeated till convergence, and convergence does occur

Idea

Find a direction which maximizes the separation between classes.

Why?

Reduce “noise”, or rather

Enhance the differences between classes

The vector corresponding to this direction is the Fisher’s discriminant

Project the data-points onto this

For all data-points not separated by this vector, choose another ’

Repeat till all data are now separable

Note, we are looking at a 2-class case. This easily extends to multiple classes

Project all the document vectors into the space represented by the vectors as the basis vectors

Now, induce a decision tree on this projected representation

The number of attributes is highly reduced

Since this representation nicely “separates” the data points (documents), accuracy increases

The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges

Apart from the text itself, this graph structure carries a lot of information about the “usefulness” of the “nodes”

For example

10 random, average people on the streets say Mr. T. Ache is a good dentist

5 reputed doctors, including dentists, recommend Mr. P. Killer as a better dentist

Who would you choose?

HITS – Hypertext Induced Topic Selection

Nodes on the web can be categorized into two types – hubs and authorities

Authorities are nodes which one refers to for definitive information about a topic

Hubs point to authorities

HITS computes the hub and authority scores on a sub-universe of the web

How does one collect this ‘sub-universe’?

The basic steps

Au = Hv for all v pointing to u

Hu= Av for all v pointed to by u

Repeat the above till convergence

Nodes with high A scores are “relevant”

Relevant to what?

Can we use this for efficient retrieval for a query?

Similar to HITS, but all pages have only one score – a Rank

R(u) = c (R(v)/Nv)

v is the set of pages linking to u, and Nv is the number of links in v. c is a scaling factor (< 1)

The higher the rank of pages linking to a page, the higher is its own rank!

To handle rank sinks (documents which do not link outside a set of pages), the formula is modified as

R’(u) = c (R’(v)/Nv) + cE(u)

E(u) is a set of some pages, and acts as a rank source (what kind of pages?)

- Using external dictionaries
- WordNet

- Using language specific techniques
- Computational linguistics
- Use grammar for judging the “sense” of a query in the “information retrieval” scenario

- Other interesting techniques
- Latent Semantic Indexing
- Finding the latent information in documents using Linear Algebra Techniques

- Latent Semantic Indexing

- Some “purists” do not consider most of the current activities in the text mining field as real text mining
- For example, see Marti Hearst’s write-up at Untangling Text Data Mining

- One example that he mentions
- stress is associated with migraines
- stress can lead to loss of magnesium
- calcium channel blockers prevent some migraines
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD) is implicated in some migraines
- high levels of magnesium inhibit SCD
- migraine patients have high platelet aggregability
- magnesium can suppress platelet aggregability

- The above was inferred from a set of documents, with some human help

- Data Mining – Concepts and Techniques, by Jiawei Han and Micheline Kamber
- Principle of Data Mining, by David J. Hand et al
- Text Classification from Labeled and Unlabeled Documents using EM, Kamal Nigam et al
- Fast and accurate text classification via multiple linear discriminant projections, S. Chakrabarti et al
- Frequent Term-Based Text Clustering, Florian Beil et al
- The PageRank Citation Ranking: Bringing Order to the Web, Lawrence Page and Sergey Brin
- Untangling Text Data Mining, by Marti. A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
- And others …