- By
**oshin** - Follow User

- 548 Views
- Uploaded on

Download Presentation
## An Introduction to Text Mining

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline of the presentation

Initiation/Introduction ...

What makes text stand apart from other kinds of data?

Classification

Clustering

Mining on “The Web”

Data Mining

What: Looking for information from usually large amounts of data

Mainly two kinds of activities – Descriptive and Predictive

Example of a descriptive activity – Clustering

Example of a predictive activity - Classification

What kind of data is this?

<1, 1, 0, 0, 1, 0>

<0, 0, 1, 1, 0, 1>

It could be two customers' baskets, containing (milk, bread, butter) and (shaving cream, razor, after-shave lotion) respectively.

Or, it could be two documents - “Java programming language” and “India beat Pakistan”

And what kind of data is this?

<550000, 155>

<750000, 115>

<120000, 165>

Data about people, <income, IQ> pairs!

Data representation

Humans understand data in various forms

Text

Sales figures

Images

Computers understand only numbers

Working with data

Most of the mining algorithms work only with numeric data

All data, hence, are represented as numbers so that they can lend themselves to the algorithms

Whether it is sales figures, crime rates, text, or images – one has to find a suitable way to transform data into numbers.

Text mining – Working with numbers

“Java Programming Language”

“India beat Pakistan”

OR

<1, 1, 0, 0, 1, 0>

<0, 0, 1, 1, 0, 1>

The transformation to 1's and 0's hides all relationship between Java and Language, and India and Pakistan, which humans can make out (How?)

Text mining – Working with numbers (contd.)

As we have seen, data transformation (from text/word to some index number in this case) means that there is some information loss

One big challenge in this field today is to find a good data representation for input to the mining algorithms

Text Representation Issues

Each word has a dictionary meaning, or meanings

Run – (1) the verb. (2) the noun, in cricket

Cricket – (1) The game. (2) The insect.

Each word is used in various “senses”

Tendulkar made 100 runs

Because of an injury, Tendulkar can not run and will need a runner between the wickets

Capturing the “meaning” of sentences is an important issue as well. Grammar, parts of speech, time sense could be easy!

Finding out automatically who the “he” in “He is the President” given a document is hard. And “president of?” Well ...

Text Representation Issues (contd.)

In general, it is hard to capture these features from a text document

One, it is difficult to extract this automatically

Two, even if we did it, it won't scale!

One simplification is to represent documents as a vector of words

We have already seen examples

Each document is represented as a vector, and each component of the vector represents some “quantity” related to a single word.

The Document Vector

“Java Programming Language”

<1, 1, 0, 0, 1, 0, 0> (document A)

“India beat Pakistan”

<0, 0, 1, 1, 0, 1, 0> (document B)

“India beat Australia”

<0, 0, 1, 1, 0, 0, 1> (document C)

What vector operation can you think of to find two similardocuments?

How about the dot product?

As we can easily verify, documents B and C will have a higher dot product than any other combination

More on document similarity

The dot product or cosine between two vectors is a measure of similarity.

Documents about related topics should have higher similarity

Language

Java

0, 0, 0

Indonesia

Document Similarity (contd.)

How about distance measures?

Cosine similarity measure will not capture the inter-cluster distances!

Further refinements to the DV representation

Not all words are equally important

the, is, and, to, he, she, it (Why?)

Of course, these words could be important in certain contexts

We have the option of scaling the components of these words, or completely removing them from the corpus

In general, we prefer to remove the stopwords and scale the remaining words

Important words should be scaled upwards, and vice versa

One widely used scaling factor – TF-IDF

TF-IDFstands for Term Frequency and Inverse Document Frequency product, for a word.

Text Mining – Moving Further

Document/Term Clustering

Given a large set, group similar entities

Text Classification

Given a document, find what topic does it talk about

Information Retrieval

Search engines

Information Extraction

Question Answering

Clustering (Descriptive Activity)

Activity: Group together similar documents

Techniques used

Partitioning

Hierarchical

Agglomerative

Divisive

Grid based

Model based

Clustering (contd.)

Partitioning

Divide the input data into k partitions

K-means, K-medoids

Hierarchical clustering

Agglomerative

Each data point is assumed to be a cluster representative

Keep merging similar clusters till we get a single cluster

Divisive

The opposite of agglomerative

“Frequent term-based text clustering”

Idea

Frequent terms carry more information about the “cluster” they might belong to

Highly co-related frequent terms probably belong to the same cluster

D = {D1, …, Dn} – the set of documents

DjsubsetOf T, the set of all terms

Then candidate clusters are generated from F = {F1, … , Fk}, where each Fi is a set of all frequent terms which occur together.

Classification

The problem statement

Given a set of documents, each with a label called the class label for that document

Given, a classifier which learns from the above data set

For a new, unseen document, the classifier should be able to “predict” with a high degree of accuracy the correct class to which the new document belongs

Decision Tree Classifier

A tree

Each node represents some kind of an “evaluation” for an attribute of the data

Each edge, the decision taken

The evaluation at each node is some kind of an information gain measure

Reduction in entropy – more information gained

Entropy E(x) = -pilog2(pi)

pi represents the probability that the data corresponds to sample i

Each edge represents a choice for the value of the attribute the node represents

Good for text mining. But doesn’t scale

Statistical (Bayesian) Classification

For a document-class data, we calculate the probabilities of occurrence of events

Bayes’ Theorem

P(c|d) = P(c) . P(d|c) / P(d)

Given a document d, the probability that it belongs to a class c is given by the above formula.

In practice, the exact values of the probabilities of each event are unknown, and are estimated from the samples

Naïve Bayes Classification

Probability of the document eventd

P(d) = P(w1, …, wn) – wi are the words

The RHS is generally a headache. We have to consider the inter-dependence of each of the wj events

Naïve Bayes – Assume all the wj events are independent. The RHS expands to

p(wj)

Most of the Bayesian text classifiers work with this simplification

Bayesian Belief Networks

This is an intermediate approach

Not all words are independent

“If java and program occur together, then boost the probability value of class computer programming”

“If java and indonesia occur together, then the document is more likely about some-other-class”

Problem?

How do we come up with co-relations like above?

Other classification techniques

Support Vector Machines

Find the best discriminant plane between two classes

k Nearest Neighbour

Association Rule Mining

Neural Networks

Case-based reasoning

An example – “Text Classification from labeled and unlabeled documents with Expectation Maximization”

Problem setting

Labeling documents is a manual process

A lot more unlabeled documents are available as compared to labeled documents

Unlabeled documents contain information which could help in the classification activity

An example (contd.)

Train a classifier with the labeled documents

Say, a Naïve Bayes classifier

This classifier estimates the model parameters (the prior probabilities of the various events)

Now, classify the unlabeled documents.

Assuming the applied labels to be correct, re-estimate the model parameters

Repeat the above step till convergence

Expectation Maximization

A useful technique for estimating hidden parameters

In the previous example, the class labels were missing from some documents

Consists of two steps

E-step: Set z(k+1) = E [z | D; (k)]

M-step: Set (k+1) = arg max P( | D; z(k+1))

The above steps are repeated till convergence, and convergence does occur

Another example – “Fast and accurate Text Classification via Multiple Linear Discriminant Projections”

Contd.

Idea

Find a direction which maximizes the separation between classes.

Why?

Reduce “noise”, or rather

Enhance the differences between classes

The vector corresponding to this direction is the Fisher’s discriminant

Project the data-points onto this

For all data-points not separated by this vector, choose another ’

Contd.

Repeat till all data are now separable

Note, we are looking at a 2-class case. This easily extends to multiple classes

Project all the document vectors into the space represented by the vectors as the basis vectors

Now, induce a decision tree on this projected representation

The number of attributes is highly reduced

Since this representation nicely “separates” the data points (documents), accuracy increases

Web Text Mining

The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges

Apart from the text itself, this graph structure carries a lot of information about the “usefulness” of the “nodes”

For example

10 random, average people on the streets say Mr. T. Ache is a good dentist

5 reputed doctors, including dentists, recommend Mr. P. Killer as a better dentist

Who would you choose?

Kleinberg’s HITS

HITS – Hypertext Induced Topic Selection

Nodes on the web can be categorized into two types – hubs and authorities

Authorities are nodes which one refers to for definitive information about a topic

Hubs point to authorities

HITS computes the hub and authority scores on a sub-universe of the web

How does one collect this ‘sub-universe’?

HITS (contd.)

The basic steps

Au = Hv for all v pointing to u

Hu= Av for all v pointed to by u

Repeat the above till convergence

Nodes with high A scores are “relevant”

Relevant to what?

Can we use this for efficient retrieval for a query?

PageRank

Similar to HITS, but all pages have only one score – a Rank

R(u) = c (R(v)/Nv)

v is the set of pages linking to u, and Nv is the number of links in v. c is a scaling factor (< 1)

The higher the rank of pages linking to a page, the higher is its own rank!

To handle rank sinks (documents which do not link outside a set of pages), the formula is modified as

R’(u) = c (R’(v)/Nv) + cE(u)

E(u) is a set of some pages, and acts as a rank source (what kind of pages?)

Some more topics which we haven’t touched

- Using external dictionaries
- WordNet
- Using language specific techniques
- Computational linguistics
- Use grammar for judging the “sense” of a query in the “information retrieval” scenario
- Other interesting techniques
- Latent Semantic Indexing
- Finding the latent information in documents using Linear Algebra Techniques

Some more comments

- Some “purists” do not consider most of the current activities in the text mining field as real text mining
- For example, see Marti Hearst’s write-up at Untangling Text Data Mining

Some more comments (contd.)

- One example that he mentions
- stress is associated with migraines
- stress can lead to loss of magnesium
- calcium channel blockers prevent some migraines
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD) is implicated in some migraines
- high levels of magnesium inhibit SCD
- migraine patients have high platelet aggregability
- magnesium can suppress platelet aggregability
- The above was inferred from a set of documents, with some human help

References

- Data Mining – Concepts and Techniques, by Jiawei Han and Micheline Kamber
- Principle of Data Mining, by David J. Hand et al
- Text Classification from Labeled and Unlabeled Documents using EM, Kamal Nigam et al
- Fast and accurate text classification via multiple linear discriminant projections, S. Chakrabarti et al
- Frequent Term-Based Text Clustering, Florian Beil et al
- The PageRank Citation Ranking: Bringing Order to the Web, Lawrence Page and Sergey Brin
- Untangling Text Data Mining, by Marti. A. Hearst, http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
- And others …

Download Presentation

Connecting to Server..