- By
**astin** - Follow User

- 120 Views
- Uploaded on

Download Presentation
## INFO 422 Knowledge-Based Systems

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Objectives,

Prerequisite

and Content

Objectives,

Prerequisite

and Content

Brief Introduction

to Lectures

Discussion

and

Conclusion

Objectives

This course provides:

- fundamental techniques of knowledge discovery and data mining (KDD)
- issues in KDD practical use and tools
- case-studies of KDD application

Nothing special but the followings are expected:

- experience of computer use

- basis of databases, statistics,
- and mathematics

- programming skills

- Overview of KDD
- Mining association rules
- Mining action rules
- Decision tree induction
- Distributed knowledge systems and distributed query answering
- Cluster analysis

Objectives,

Prerequisite

and Content

Brief Introduction

to Lectures

Discussion

and

Conclusion

Brief introduction to lectures

Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

3. KDD Applications

4. Data Mining Methods

5. Challenges for KDD

KDD is the automatic extraction of non-obvious,

hidden knowledge from large volumes of data.

KDD is the automatic extraction of non-obvious,

hidden knowledge from large volumes of data.

106-1012 bytes:

we never see the whole

data set, so will put it in

the memory of computers

What is the knowledge?

How to represent

and use it?

Then run Data

Mining algorithms

We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily.

Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data.

Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”.

Knowledge can be considered data at

a high level of abstraction and generalization.

Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes

...

10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS

...

Numerical attribute categorical attribute missing values class labels

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15

THEN Prediction = VIRUS [87,5%]

[confidence, predictive accuracy]

How to acquire knowledge for

knowledge-based systems

remains as the main difficult

and crucial problem.

People gathered and stored so much data because they think some valuable assets

are implicitly coded within it.

?

knowledge

base

inference

engine

Rawdata is rarely of direct benefit.

Its true value depends on the ability to extract information useful for decision support.

Tradition: via knowledge engineers

Impractical Manual Data Analysis

New trend: via automatic programs

Benefits of Knowledge Discovery

Value

Disseminate

DSS

Generate

MIS

EDP

Rapid Response

Volume

EDP: Electronic Data Processing

MIS: Management Information Systems

DSS: Decision Support Systems

1. What is KDD and Why ?

2. The KDD Process

3. KDD Applications

4. Data Mining Methods

5. Challenges for KDD

non-trivial process

Justified patterns/models

valid

novel

Previously unknown

useful

Can be used

understandable

by human and machine

The KDD process

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandablepatterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

Understand the domain and Define problems

Collect and

Preprocess Data

Data Mining

Extract Patterns/Models

Interpret and Evaluate discovered knowledge

Putting the results

in practical use

The Knowledge Discovery Process

5

a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations

4

3

2

1

KDD is inherently

interactive and iterative

Data organized by function

Create/select

target database

Data warehousing

1

Select sampling

technique and

sample data

Supply missing

values

Eliminate

noisy data

2

Normalize

values

Transform

values

Create derived

attributes

Find important

attributes &

value ranges

4

3

Select DM

task (s)

Select DM

method (s)

Extract

knowledge

Test

knowledge

Refine

knowledge

Query & report generation

Aggregation & sequences

Advanced methods

Transform to

different

representation

5

Main Contributing Areas of KDD

Statistics

[data warehouses:

integrated data]

Infer info from data

(deduction & induction, mainly numeric data)

[OLAP: On-Line

Analytical Processing]

KDD

Databases

Machine Learning

Store, access, search, update data (deduction)

Computer algorithms that improve automatically through experience (mainly induction, symbolic data)

1. What is KDD and Why ?

2. The KDD Process

3. KDD Applications

4. Data Mining Methods

5. Challenges for KDD

Manufacturing information

Business information

- Marketing and sales

data analysis

- Investment analysis

- Loan approval

- Fraud detection

- etc.

- Controlling and scheduling

- Network management

- Experiment result analysis

- etc.

Personal information

Scientific information

- Sky survey cataloging

- Biosequence Databases

- Geosciences: Quakefinder

- etc.

KDD: Opportunity and Challenges

Competitive Pressure

Data Rich

Knowledge Poor

(the resource)

KDD

Data Mining

Technology

Mature

Enabling Technology

(Interactive MIS, OLAP,

parallel computing, Web, etc.)

KDD: A New and Fast Growing Area

KDD workshops: since 1989.

Inter. Conferences: KDD (USA), first in 1995;

PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997.

ML’04/PKDD’04 (in Pisa, Italy)

Industry interests and competition: IBM, Microsoft,

Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …

About 80% of the Fortune 500 companies are involved in

data mining projects or using data mining systems.

JAPAN: FGCS Project (logic programming and reasoning).

“Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.

1. What is KDD and Why ?

2. The KDD Process

3. KDD Applications

4. Data Mining Methods

5. Challenges for KDD

finding the description

of several predefined

classes and classify

a data item into one

of them.

identifying a finite

set of categories or

clusters to describe

the data.

Clustering

Classification

finding a model

which describes

significant dependencies

between variables.

maps a data item

to a real-valued

prediction variable.

Regression

Dependency

Modeling

discovering the

most significant

changes in the data

finding a

compact description

for a subset of data

Deviation and

change detection

Summarization

“What factors determine cancerous cells?”

Examples

General

patterns

Data

Mining

Algorithm

- Rule Induction

- Decision tree

- Neural Network

Classification

Algorithm

Cancerous Cell Data

Classification: Rule Induction

“What factors determine a cell is cancerous?”

If Color = light

and Tails = 1

and Nuclei = 2

ThenHealthy Cell(certainty = 92%)

If Color = dark

and Tails = 2

and Nuclei = 2

ThenCancerous Cell(certainty = 87%)

Classification: Decision Trees

Color = dark

Color = light

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

cancerous

healthy

#tails=1

#tails=2

#tails=1

#tails=2

healthy

cancerous

healthy

cancerous

Classification: Neural Networks

“What factors determine a cell is cancerous?”

Color = dark

# nuclei = 1

…

# tails = 2

Healthy

Cancerous

Download Presentation

Connecting to Server..