Data mining and machine learning
Download
1 / 72

Data Mining (and machine learning) - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Data Mining (and machine learning). ROC curves Rule Induction Basics of Text Mining. Two classes is a common and special case. Two classes is a common and special case. Medical applications: cancer, or not? Computer Vision applications: landmine, or not?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Mining (and machine learning)' - moral


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining and machine learning

Data Mining(and machine learning)

ROC curves

Rule Induction

Basics of Text Mining

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case
Two classes is a common and special case

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case1
Two classes is a common and special case

Medical applications: cancer, or not?

Computer Vision applications: landmine, or not?

Security applications: terrorist, or not?

Biotech applications: gene, or not?

… …

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case2
Two classes is a common and special case

Medical applications: cancer, or not?

Computer Vision applications: landmine, or not?

Security applications: terrorist, or not?

Biotech applications: gene, or not?

… …

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case3
Two classes is a common and special case

True Positive: these are ideal. E.g. we correctly detect cancer

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case4
Two classes is a common and special case

True Positive: these are ideal. E.g. we correctly detect cancer

False Positive: to be minimised – cause false alarm – can be

better to be safe than sorry, but can be very costly.

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case5
Two classes is a common and special case

True Positive: these are ideal. E.g. we correctly detect cancer

False Positive: to be minimised – cause false alarm – can be

better to be safe than sorry, but can be very costly.

False Negative: also to be minimised – miss a landmine / cancer

very bad in many applications

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two classes is a common and special case6
Two classes is a common and special case

True Positive: these are ideal. E.g. we correctly detect cancer

False Positive: to be minimised – cause false alarm – can be

better to be safe than sorry, but can be very costly.

False Negative: also to be minimised – miss a landmine / cancer

very bad in many applications

True Negative?:

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Sensitivity and specificity common measures of accuracy in this kind of 2 class tasks
Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Sensitivity and specificity common measures of accuracy in this kind of 2 class tasks1
Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks

Sensitivity = TP/(TP+FN) - how much of the real ‘Yes’ cases

are detected? How sensitive is the classifier to ‘Yes’ cases?

Specificity = TN/(FP+TN) - how much of the real ‘No’ cases

are deteced?

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no
YES this kind of 2-class tasks NO

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no1
YES this kind of 2-class tasks NO

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no2

Sensitivity: 100% this kind of 2-class tasks

Specificity: 25%

YESNO

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no3

Sensitivity: 93.8% this kind of 2-class tasks

Specificity: 50%

YESNO

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no4

Sensitivity: 81.3% this kind of 2-class tasks

Specificity: 83.3%

YESNO

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no5

Sensitivity: 56.3% this kind of 2-class tasks

Specificity: 100%

YESNO

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Sensitivity and specificity common measures of accuracy in this kind of 2 class tasks2
Sensitivity and Specificity: common measures of accuracy in this kind of 2-class tasks

Sensitivity = TP/(TP+FN) - how much of the real TRUE cases

are detected? How sensitive is the classifier to TRUE cases?

A highly sensitive test for cancer: if “NO” then you be sure it’s “NO”

Specificity = TN/(TN+FP) - how sensitive is the classifier to

the negative cases?

A highly specific test for cancer: if “Y” then you be sure it’s “Y”.

With many trained classifiers, you can ‘move the line’ in this way.

E.g. with NB, we could use a threshold indicating how much higher

the log likelihood for Y should be than for N

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Roc curves
ROC curves this kind of 2-class tasks

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Rule induction
Rule Induction this kind of 2-class tasks

  • Rules are useful when you want to learn a clear / interpretable classifier, and are less worried about squeezing out as much accuracy as possible

  • There are a number of different ways to ‘learn’ rules or rulesets.

  • Before we go there, what is a rule / ruleset?

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Rules
Rules this kind of 2-class tasks

IF Condition … Then Class Value is …

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no6

Rules are Rectangular this kind of 2-class tasks

YESNO

IF (X>0)&(X<5)&(Y>0.5)&(Y<5) THEN YES

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no7

Rules are Rectangular this kind of 2-class tasks

YESNO

IF (X>5)&(X<11)&(Y>4.5)&(Y<5.1) THEN NO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


A ruleset
A Ruleset this kind of 2-class tasks

IF Condition1 … Then Class = A

IF Condition2 … Then Class = A

IF Condition3 … Then Class = B

IF Condition4 … Then Class = C

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no8

What’s wrong with this ruleset? this kind of 2-class tasks

(two things)

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no9

What about this ruleset? this kind of 2-class tasks

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Two ways to interpret a ruleset
Two ways to interpret a ruleset: this kind of 2-class tasks

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two ways to interpret a ruleset1
Two ways to interpret a ruleset: this kind of 2-class tasks

As a Decision List

IF Condition1 … Then Class = A

ELSE IF Condition2 … Then Class = A

ELSE IF Condition3 … Then Class = B

ELSE IF Condition4 … Then Class = C

ELSE … predict Majority Class

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Two ways to interpret a ruleset2
Two ways to interpret a ruleset: this kind of 2-class tasks

As an unordered set

IF Condition1 … Then Class = A

IF Condition2 … Then Class = A

IF Condition3 … Then Class = B

IF Condition4 … Then Class = C

Check each rule and gather votes for each class

If no winner, predict majority class

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Three broad ways to learn rulesets
Three broad ways to learn rulesets this kind of 2-class tasks

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Three broad ways to learn rulesets1
Three broad ways to learn rulesets this kind of 2-class tasks

1. Just build a decision tree with ID3 (or something else) and you can translate the tree into rules!

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Three broad ways to learn rulesets2
Three broad ways to learn rulesets this kind of 2-class tasks

2. Use any good search/optimisation algorithm.

Evolutionary (genetic) algorithms are the most

common. You will do this coursework 3.

This means simply guessing a ruleset at random,

and then trying mutations and variants, gradually

improving them over time.

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Three broad ways to learn rulesets3
Three broad ways to learn rulesets this kind of 2-class tasks

3. A number of ‘old’ AI algorithms exist that still work well, and/or can be engineered to work with an evolutionary algorithm. The basic idea is: iterated coverage

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Yes no10

Take each class in turn .. this kind of 2-class tasks

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no11

Pick a random member of that this kind of 2-class tasks

class in the training set

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no12

Extend it as much as possible this kind of 2-class tasks

without including another class

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no13

Extend it as much as possible this kind of 2-class tasks

without including another class

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no14

Extend it as much as possible this kind of 2-class tasks

without including another class

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no15

Extend it as much as possible this kind of 2-class tasks

without including another class

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no16

Next class this kind of 2-class tasks

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no17

Next class this kind of 2-class tasks

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Yes no18

And so on… this kind of 2-class tasks

YESNO

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10 11 12


Text as data the basics
Text as Data: the basics this kind of 2-class tasks

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


So, the most frequent words in a document [email protected]

carry the most useful information ... ?

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Desktops, laptops, LED-TVs [email protected]

Which is which


Some motivations for text mining
Some motivations for text mining [email protected]

  • Start your own company

    http://www.text-analytics.com/comp.html

  • Recommendations

    • “if you like that, you might also like these …”

    • On amazon, or any general product sales site, this can be based on distances between (e.g.) 200 word summaries or ToC of a book, or text that describes a product in a catalogue

  • Document classification

  • Coping with information overload

  • Sentiment analysis ... Hotel reviews /product reviews



A one slide text mining tutorial
A one-slide text-mining tutorial convert to numbers

an article about poltics

NOW you can do

Clustering,

Retrieving similar

Documents,

Supervised

Classification

Etc...

(0.1, 0.2, 0, 0.02 ...)

(0.4, 0, 0.1, 0 ...)

(0.11, 0.3, 0, 0.01 ..)

an essay about sport

another article about politics

Vectors based on word frequencies.

One key issue is to choose the right set of words (or other features)


How did i get these vectors from these two documents
How did I get these vectors from these two `documents’? convert to numbers

<h1> Compilers</h1>

<p> The Guardian uses several

compilers for its daily cryptic

crosswords. One of the most

frequently used is Araucaria,

and one of the most difficult

is Bunthorne.</p>

<h1> Compilers: lecture 1 </h1>

<p> This lecture will introduce the

concept of lexical analysis, in which

the source code is scanned to reveal

the basic tokens it contains. For this,

we will need the concept of

regular expressions (r.e.s).</p>

26, 2, 2

35, 2, 0


What about these two vectors
What about these two vectors? convert to numbers

<h1> Compilers</h1>

<p> The Guardian uses several

compilers for its daily cryptic

crosswords. One of the most

frequently used is Araucaria,

and one of the most difficult

is Bunthorne.</p>

<h1> Compilers: lecture 1 </h1>

<p> This lecture will introduce the

concept of lexical analysis, in which

the source code is scanned to reveal

the basic tokens it contains. For this,

we will need the concept of

regular expressions (r.e.s).</p>

1, 1, 1, 0, 0, 0

0, 0, 0, 1, 1, 1


From this MASTER WORD LIST (ordered) convert to numbers

(Crossword, Cryptic, Difficult, Expression, Lexical, Token)

If a document contains `crossword’, it gets a 1 in position 1 of the vector, otherwise 0. If it contains `lexical’, it gets a 1 in position 5, otherwise 0, and so on.

How similar would be the vectors for two docs about

crossword compilers?

.


Turning a document into a vector
Turning a document into a vector convert to numbers

We start with a template for the vector, which needs a master list of terms . A term can be a word, or a number, or anything that appears frequently in documents.

There are almost 200,000 words in English – it would take much too

long to process documents vectors of that length.

Commonly, vectors are made from a small number (50—1000) of

most frequently-occurring words.

However, the master list usually does not include words from a stoplist,

Which contains words such as the, and, there, which, etc … why?


The tfidf encoding term frequency x inverse document frequency
The TFIDF Encoding convert to numbers(Term Frequency x Inverse Document Frequency)

A term is a word, or some other frequently occuring item

Given some term i, and a document j, the term count

is the number of times that term i occurs in document j

Given a collection of k terms and a set D of documents, the term frequency, is:

… considering only the terms of interest, this is the proportion of document j that is made up from term i.


The tfidf encoding term frequency x inverse document frequency1
The TFIDF Encoding convert to numbers(Term Frequency x Inverse Document Frequency)

A term is a word, or some other frequently occuring item

Given some term i, and a document j, the term count

is the number of times that term i occurs in document j

Given a collection of k terms and a set D of documents, the term frequency, is:

frequency of this word in this doc

total number of words in this doc

… considering only the terms of interest, this is the proportion of document j that is made up from term i.


Term frequency is a measure of convert to numbers

the importance of this term in this document

Inverse document frequency (which we see next) is a measure of the discriminatory valueof the term in the collection of documents we are looking at..

It is a measure of the rarity of this word in this document collection

E.g. high term frequency for “apple” means that apple is an important word in a specific document.

But high document frequency (low inverse document frequency) for “apple”, given a particular set of documents, means that apple does not carry much useful information, since it is in all of the documents.


Inverse document frequency of term convert to numbersi is:

Log of: … the number of documents in the master collection,

divided by the number of those documents that contain the term.


Tfidf encoding of a document
TFIDF encoding of a document convert to numbers

So, given:

- a background collection of documents

(e.g. 100,000 random web pages,

all the articles we can find about cancer

100 student essays submitted as coursework …)

- a specific ordered list (possibly large) of terms

We can encode any document as a vector of TFIDF numbers,

where the ith entry in the vector for document j is:


Turning a document into a vector1
Turning a document into a vector convert to numbers

Suppose our Master List is:

(banana, cat, dog, fish, read)

Suppose document 1 contains only:

“Bananas are grown in hot countries, and cats like bananas.”

And suppose the background frequencies of these words in a large

random collection of documents is (0.2, 0.1, 0.05, 0.05, 0.2)

The document 1 vector entry for word w is:

This is just a rephrasing of TFIDF, where:

freqindoc(w) is the frequency of w in document 1,

and freq_in_bg(w) is the `background’ frequency in our

reference set of documents


Turning a document into a vector2
Turning a document into a vector convert to numbers

Master list: (banana, cat, dog, fish, read)

Background frequencies: (0.2, 0.1, 0.05, 0.05, 0.2)

Document 1:

“Bananas are grown in hot countries, and cats like bananas.”

Frequencies are proportions. The background frequency of banana is

0.2, meaning that 20% of documents in general contain `banana’, or bananas, etc. (note that read includes reads, reading, reader, etc…)

The frequency of banana in document 1 is also 0.2 – why?

The TFIDF encoding of this document is:

Suppose another document has

exactly the same vector – will it

be the same document?

0.464, 0.332, 0, 0, 0


Vector representation of documents underpins
Vector representation of documents underpins: convert to numbers

Many areas of automated document analysis

Such as: automated classification of documents

Clustering and organising document collections

Building maps of the web, and of different web communities

Understanding the interactions between different scientific communities, which in turn will lead to helping with automated WWW-based scientific discovery.


Example recent work of my phd student hamouda chantar
Example / recent work of my PhD student Hamouda Chantar convert to numbers

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


Three datasets classification main issue feature selection
Three datasets / classification / convert to numbersmain issue: Feature Selection


Hamouda s work
Hamouda’s work convert to numbers

Focus on automated classification of an article (e.g. Finance, Economics, Sport, Culture, ...)

Emphasis on Feature Selection – which words or other features should constitute the vectors, to enable accurate classification?


Example categories this is the akhbar alkhaleej dataset
Example categories: convert to numbersthis is the Akhbar-Alkhaleej dataset


We look at 3 pre classified datasets
We look at 3 pre-classified datasets convert to numbers

Akhbar-Alkhaleej: 5690 Arabic news documents gathered evenly from the online newspaper "Akhbar-Alkhaleej"

Alwatan: 20,291 Arabic news documents gathered from online newspaper "Alwatan”

Al-jazeera-News:1500 documents from the Al-Jazeera news site.


Is gd arabdata
is.gd convert to numbers/arabdata


We look at 3 classification methods when evaluating feature subsets on the test set
We look at 3 classification methods convert to numbers(when evaluating feature subsets on the test set)

C4.5: well-known decision tree classifier, we use weka’s implementation, “J48”

Naive Bayes: It’s Naive, and it’s Bayes

SVM: with a linear kernel


Results alwatan dataset
Results: Alwatan dataset convert to numbers


Results on al jazeera dataset
Results on Al Jazeera dataset convert to numbers



tara convert to numbers

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html


ad