integrated data mining systems
Download
Skip this Video
Download Presentation
Integrated Data Mining Systems

Loading in 2 Seconds...

play fullscreen
1 / 42

Integrated Data Mining Systems - PowerPoint PPT Presentation


  • 63 Views
  • Uploaded on

Integrated Data Mining Systems. Wei-Min Shen Information Sciences Institute University of Southern California. Outline. Objectives for Integrated System System Architecture Necessary Capabilities Representation Languages Actual System Descriptions. Objectives for Integrated KDD Systems.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Integrated Data Mining Systems' - mikel


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
integrated data mining systems

Integrated Data Mining Systems

Wei-Min Shen

Information Sciences Institute

University of Southern California

UCLA Data Mining Short Course (3)

outline
Outline
  • Objectives for Integrated System
  • System Architecture
  • Necessary Capabilities
  • Representation Languages
  • Actual System Descriptions

UCLA Data Mining Short Course (3)

objectives for integrated kdd systems
Objectives for Integrated KDD Systems
  • Carry out the entire KDD process
    • Data selection
    • Data preprocessing
    • Data transformation
    • Data mining
    • Interpretation and evaluation
  • Coherently integrate complementary techniques
  • Amplifyhuman capabilities (e.g. see a lot)
  • Allow human to control the KDD process

UCLA Data Mining Short Course (3)

system architecture
System Architecture
  • Necessary elements
    • Access to existing data sets or databases
    • Representation and storage of knowledge
    • Basic data mining techniques
      • Deduction
      • Induction
      • Visualization
      • Use of human guidance

UCLA Data Mining Short Course (3)

deduction
Deduction
  • A rigid inference procedure from the general to the specific
    • “All computers have CPU” “X is a computer” “X has CPU”
  • Seek evidences for a general hypothesis
    • “Maybe all computers have CPU”
    • “Check how many computers in my database have CPU”

UCLA Data Mining Short Course (3)

induction
Induction
  • A “not so rigid” inference procedure from the specific to the general
    • “I drove yesterday,” “you drove yesterday,” “he drove yesterday,” …...
    • “every one drove yesterday”
  • Seek for general patterns from data
  • There are many popular induction methods
    • Decision trees, rules and lists, NN, ILP, ...

UCLA Data Mining Short Course (3)

visualization
Visualization
  • Allow humans to see very large amounts of data in one visual field
  • Provide clues for abstractions by humans

UCLA Data Mining Short Course (3)

the use of human guidance
The Use of Human Guidance
  • The need for human guidance
    • Large amount of data
    • Large search space for possible patterns
    • Machines do not human’s intuition yet
  • How to encode human knowledge into data mining process?

UCLA Data Mining Short Course (3)

representation
Representation
  • Languages for data access and manipulation
    • SQL, Datalog, LDL++, Cobol, C++, …
  • Languages for representing knowledge
    • Prolog, LDL++, Loom, …
  • Prefer languages that serve multiple purpose

UCLA Data Mining Short Course (3)

examples of integrated systems
Examples of Integrated Systems
  • IBM’s Intelligent Miner, Advanced Scout
  • Recon
  • DBMiner
  • DataCrystal
  • many more

UCLA Data Mining Short Course (3)

advanced scout
Advanced Scout
  • A system that helps NBA coaches to find and use patterns hidden in historical game data
  • Example patterns
    • “Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots”
  • Widely used by many NBA teams, and coaches say that “it is written with coach in mind”

UCLA Data Mining Short Course (3)

recon
Recon
  • Inputs: Relational databases
  • Outputs: Rule-based models
  • Integrate induction, deduction, visualization

UCLA Data Mining Short Course (3)

recon architecture
Recon Architecture

Graphical User Interface

Command Module

Rule

Induction

Deductive

Database

Visualization

Knowledge

Repository

Target DB

Recon Server

External DB

UCLA Data Mining Short Course (3)

recon visualization
Recon Visualization
  • Obtain a global view of a data set
    • a view of tables and columns
  • Noticing important phenomena hold on subsets of data
    • Clusters
    • Trends
    • Correlation

UCLA Data Mining Short Course (3)

recon deductive database
Recon Deductive Database
  • Define concepts
    • high-growth:
      • earnings-per-share-growth>50% and dividend-growth>50%
  • Allow new concepts to be defined on the existing ones
  • Effect: prepare subsets of data for further analysis

UCLA Data Mining Short Course (3)

recon rule induction
Recon Rule Induction:
  • User define target concepts
  • Learn a set of rules for the target concepts
  • Has heuristics for modifying existing rules
  • Example:
    • If a stock is high-growth at time t, then its return oninvestment two quarters later will be greater than 20%

UCLA Data Mining Short Course (3)

dbminer architecture
DBMiner Architecture

Graphical User Interface

SQL Server

Discovery Module

Concept

Hierarchy

Database

UCLA Data Mining Short Course (3)

dbminer functionalities
DBMiner Functionalities
  • Inputs: Databases and Concept Hierarchy
  • Outputs:
    • Characteristic rules (hypothesis evidence)
    • Discriminate rules (evidence  hypothesis)
    • Multi-level association rules

UCLA Data Mining Short Course (3)

dbminer key idea
DBMiner Key Idea
  • Attribute-Oriented Induction
    • Organize values of each attribute into a hierarchy of concepts
    • Perform rule induction at certain “prime” level in the hierarchies
  • learn rules at a

UCLA Data Mining Short Course (3)

datacrystal knowledgeminer
DataCrystal (KnowledgeMiner)
  • A common-representation language
    • “Metapatterns”
  • An integrated, efficient search engine
    • “The Discovery Loop”

UCLA Data Mining Short Course (3)

metapatterns
Metapatterns
  • Specifications for type and form of pattern
  • An example of metapattern

P(X,Y) & Q(Y,Z) R(X,Z)

  • Examples of discovered patterns

citizen(X,Y) & officialLanguage(Y,Z)  speaks(X,Z) [0.98]

parent(X,Y) & ancestor(Y,Z)  ancestor(X,Z) [0.99]

  • Other Metapatterns

Ingredients(X, a, b) & Property(X,Y) Cluster(Y)

connects(C,D) & Feature(C,X) & Feature(D,Y)  eql(X,Y)

UCLA Data Mining Short Course (3)

the discovery loop
The Discovery Loop

discovered

KnowledgeBase

Patterns

Metapattern

  • citizen(X,Y) & officialLanguage(Y,Z) speaks(X,Z)

Generator

Inductive Actions

Metapatterns

P(X,Y) & Q(Y,Z) R(X,Z)

computeStrength

supervised learning

clustering

case-based reasoning

regression analysis

visualization

Data

Deductive DB

Queries

Data

DBs

UCLA Data Mining Short Course (3)

datacrytal applications
DataCrytal Applications
  • Discover common-sense regularities from a large knowledge base (MCC)
    • goodStudent(X,Y), taughtBy(Y,Z)  likedBy(X,Z) [0.99]
  • Find circuit patterns from a telecommunication database (Bellcore)
    • connect(X,’cab’,Y,’ept’),endLoc(X,U),loc(Y,V)  eql(U,V) [0.98]
  • Build prediction models from a chemical research database (Eastman Chemical)
    • percentage(X,’g306’,Y),density(X,W) F35 (Y,W)
  • Construct fault-detection rules from a semiconductor manufacture control database (Motorola)
    • receipt(W,2),p41(W,Y),time(W,179)  allowedVariance(0.9,3.4)

UCLA Data Mining Short Course (3)

metapattern generation
Metapattern Generation
  • Metapatterns are hard to design
    • A time consuming interactive process
  • Challenges
    • No pre-labeled examples
    • No pre-specified concepts
    • Mostly relational concepts
    • Unsupervised Learning of relational patterns
  • So we need to generate metapatterns automatically

UCLA Data Mining Short Course (3)

the algorithm
The Algorithm
  • Inputs: schema, value ranges, thresholds, and domain knowledge (optional)
  • Outputs: relational patterns
  • Three main steps
    • Step 1.Find connections among tables
      • relational patterns can only be found among connected tables
    • Step 2. Generate transitive metapatterns
      • transitive patterns constitute a very interesting subset of relational patterns (implication, inheritence, transfer through, function dependency)
    • Step 3. Generate other metapatterns based on previous metapatterns

UCLA Data Mining Short Course (3)

step 1 find connections
Step 1. Find connections
  • Identify columns that are significantly connected
    • two columns are significantly connected if they have the same type and their ranges overlap significantly
    • domain knowledge can be used here for
      • eliminating unnecessary connections (e.g., length, width)
      • establishing syntactically different connections (e,g., color, frequency)
  • Construct the significant connection table (SCT)
    • a reference name is created for each connected pair
    • the reference names and the table names are used as rows and columns of the SCT

UCLA Data Mining Short Course (3)

an abstract db example
An Abstract DB Example

Schema and value ranges

T1: C11 char(2) C12 integer [1-9] C13 float[0.1-0.9]

T2: C21 integer[11-19] C22 float[0.1-0.9] C23 char(3)

T3: C31 integer[11-19] C32 char(2)

T4: C41 char(3) C42 float[0.0-0.1] C43 integer[1-9]

UCLA Data Mining Short Course (3)

abstract db data tables

T1

T2

T3

T4

Abstract DB Data Tables

UCLA Data Mining Short Course (3)

db example continue
DB Example Continue ...

Significant Connection Table

T1 T2 T3 T4

X1 C13 C22

X2 C11 C32

X3 C12 C43

X4 C21 C31

X5 C23 C41

UCLA Data Mining Short Course (3)

step 2 generate metapatterns
Step 2: Generate Metapatterns
  • Convert SCT to a graph G
  • Find all predicate cycles in G
  • Generate the complete set of transitive metapatterns

UCLA Data Mining Short Course (3)

db example continue1
DB Example Continue ...

A GrapghG constructed from SCT

T1,X1

T2,X1

T1,X2

T3,X2

T4,X3

T1,X3

T2,X4

T3,X4

T4,X5

T2,X5

UCLA Data Mining Short Course (3)

db example continue2
DB Example Continue ...

All Predicate Cycls found in G

(T2 X1 X4) (T3 X4 X2) (T1 X2 X1)

(T2 X1 X5) (T4 X5 X3) (T1 X3 X1)

(T2 X5 X1) (T1 X1 X2) (T3 X2 X4) (T2 X4 X5)

(T2 X4 X5) (T4 X5 X3) (T1 X3 X1) (T2 X1 X4)

(T1 X3 X1) (T2 X1 X4) (T3 X4 X2) (T1 X2 X3)

(T3 X2 X4) (T2 X4 X5) (T4 X5 X3) (T1 X3 X2)

(T1 X2 X3) (T4 X3 X5) (T2 X5 X1) (T1 X1 X2)

(T2 X1 X4) (T3 X4 X2) (T1 X2 X3) (T4 X3 X5) (T2 X5 X1)

(T1 X1 X2) (T3 X2 X4) (T2 X4 X5) (T4 X5 X3) (T1 X3 X1)

UCLA Data Mining Short Course (3)

db example continue3
DB Example Continue...
  • The complete set of metapatterns

P1(Y1,Y2) & Q1(Y2,Y3) => R1(Y1,Y3)

P2(Y1,Y2) & Q2(Y2,Y3) & W2(Y3,Y4) => R1(Y1,Y4)

P3(Y1,Y2) & Q3(Y2,Y3) & W3(Y3,Y4) & V3(Y4,Y5) => R3(Y1,Y5)

UCLA Data Mining Short Course (3)

pattern evaluation
Pattern Evaluation
  • Evaluate each instantiated pattern p of metapattern P by
    • Computing two values:
      • strength: ps = prob(R | L,U,I) = (|R|+1) / (|L| + 2)
      • base: pb = sqrt( (1- ps) ps / N )
    • Comparing with specified thresholds s and b:

if pb < b,

then if (ps > s) or (ps < (1-s))

then accept p

else mark p as plausible

else discard p

UCLA Data Mining Short Course (3)

examples of evaluation
Examples of Evaluation

when s=0.8, and b=0.5

accept

(T2 X4 X1) (T3 X4 X2) (T1 X2 X3) => (T1 X3 X1) [0.8, 0.15]

(T1 X2 X1) (T3 X4 X2) (T2 X4 X5) (T4 X5 X3) => (T1 X3 X1) [0.9, 0.11]

plausible

(T1 X2 X3) (T4 X5 X3) (T2 X4 X5) => (T3 X4 X2) [0.5, 0.14]

discard

(T3 X4 X2) (T2 X4 X5) (T4 X5 X3) => (T1 X2 X3) [0.4, 0.9]

UCLA Data Mining Short Course (3)

step 3 propose more metapatterns
Step 3. Propose More Metapatterns
  • For each metapattern P that has many plausible patterns, do
    • Select a (meta)constraint C and append it to the left hand side of P
      • C must connect to at least one predicate in P
      • C is a build-in predicate (e.g., =)
      • C is suggested by the domain knowledge
  • An Example

P1(Y1,Y2) & Q1(Y2,Y3) & S1(Y2,O) => R1(Y1,Y3)

UCLA Data Mining Short Course (3)

a small network example

0

7

1

3

4

6

8

2

5

A Small Network Example

UCLA Data Mining Short Course (3)

network data tables
Network Data Tables

Can-reach

Linked-to

UCLA Data Mining Short Course (3)

network example continue
Network Example Continue ...

Schema and Value Ranges

CAN-REACH: A1 integer[0-8] A2 integer[0-8]

LINKED-TO: B1 integer[0-8] B2 integer[0-8]

Significant Connection Table

CAN-REACH LINKED-TO

X1 A1 B1

X2 A2 B1

X3 A2 B2

UCLA Data Mining Short Course (3)

network example continue1
Network Example Continue ...

The SCT Graph

CR, X1

LT, X1

CR, X2

LT, X2

CR, X3

LT, X3

UCLA Data Mining Short Course (3)

network example continue2
Network Example Continue ...

All Predicate Cycles

(LINKED-TO X1 X3) (CAN-REACH X1 X3)

(LINKED-TO X3 X1) (CAN-REACH X1 X2) (LINKED-TO X2 X3)

(CAN-REACH X1 X2) (LINKED-TO X2 X3) (CAN-REACH X3 X1)

Evaluate against DB

(LINKED-TO X1 X3) => (CAN-REACH X1 X3) [1.0, 10]

(CAN-REACH X1 X2) (LINKED-TO X2 X3) => (CAN-REACH X1 X3) [1.0, 11]

(CAN-REACH X1 X2) (CAN-REACH X1 X3) => (LINKED-TO X2 X3) [0.1, 89]

(CAN-REACH X1 X3) (LINKED-TO X2 X3) => (CAN-REACH X1 X2) [0.4, 31]

(CAN-REACH X1 X3) => (LINKED-TO X1 X3) [0.5, 19]

UCLA Data Mining Short Course (3)

characteristics of metapattern generation
Characteristics of Metapattern Generation
  • Unsupervised learning of relational (transitivity) patterns
    • with no pre-specify concepts
    • with no pre-label examples
    • that have probabilistic significance
    • directly from databases

UCLA Data Mining Short Course (3)

ad