slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible in PowerPoint Presentation
Download Presentation
Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible in

Loading in 2 Seconds...

play fullscreen
1 / 30

Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible in - PowerPoint PPT Presentation


  • 256 Views
  • Uploaded on

Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible information from very large databases The Data Mining Process data prospecting and surveying transformed data preprocess & transform database selected data make model

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible in' - Albert_Lan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide2

Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible information from very large databases

slide3
Direct Kernel Methods

The Data Mining Process

data prospecting

and surveying

transformed

data

preprocess

& transform

database

selected

data

make model

select

Interpretation&

rule formulation

slide4

How is Data Mining Different?

  • Emphasis on large data sets
    • - Not all data fit in memory (necessarily)
    • - Outlier detection, rare events, errors, missing data, minority classes
    • - Scaling of computation time with data size is an issue
    • - Large data sets: i.e., large number of records and/or large number of attributes
    • fusion of databases
  • Emphasis on finding interesting, novel non-obvious information
    • - It is not necessarily known what exactly one is looking for
    • - Models can be highly nonlinear
    • - Information nuggets can be valuable
  • Different methods
    • - Statistics
    • - Association rules & Pattern recognition
    • - AI
    • - Computational intelligence (neural nets, genetic algorithms, fuzzy logic)
    • - Support vector machines and kernel-based methods
    • - Visualization (SOM, pharmaplots)
  • Emphasis on explaining and feedback
  • Interdisciplinary nature of data mining
slide5
Direct Kernel Methods

Data Mining Challenges

  • Large data sets
  • - Data sets can be rich in the number of data
  • - Data sets can be rich in the number of attributes
  • Data preprocessing and feature definition
  • - Data representation
  • - Attribute/Feature selection
  • - Transforms and scaling
  • Scientific data mining
  • - Classification, multiple classes, regression
  • - Continuous and binary attributes
  • - Large datasets
  • - Nonlinear Problems
  • Erroneous data, outliers, novelty, and rare events
  • - Erroneous data
  • - Outliers
  • - Rare events
  • - Novelty detection
  • Smart visualization techniques
  • Feature Selection & Rule formulation
slide6
Direct Kernel Methods

WISDOM

UNDERSTANDING

KNOWLEDGE

INFORMATION

DATA

slide7
Direct Kernel Methods

A Brief History in Data Mining:

Pascal  Bayes  Fisher  Werbos Vapnik

  • A brief history of statistics and statistical learning theory:
  • - From the calculus of chance to the calculus of probabilities (Pascal  Bayes)
  • - From probabilities to statistics (Bayes  Fisher)
  • - From statistics to machine learning (Fisher & Tuckey Werbos Vapnik)
  • The meaning of “Data Mining” changed over time:
  • - Pre 1993: “Data mining is art of torturing the data into a confession”
  • - Post 1993: “Data mining is the art of charming the data into confession”
  • From AI expert systems  data-driven expert systems:
  • - Pre 1990: The experts speak (AI Systems)
  • - Post 1995: Attempts to let the data to speak for themselves
  • - 2000+: The data speak …
  • From the supermarket scanner to the human genome
  • - Pre 1998: Database marketing and marketing driven applications
  • - Post 1998: The emergence of scientific data mining
  • From theory to application
slide8

Database

Marketing

Finance

Health Insurance

Medicine

Bioinformatics

Manufacturing

“Homeland” “Security”

WWW Agents

Text Retrieval

BioDefense

Data Mining Applications and Operations

  • Data Preparation
  • - Missing data
  • - Data cleansing
  • - Visualization
  • - Data transformation
  • Clustering/Classification
  • Statistics
  • Factor analysis/Feature selection
  • Associations
  • Regression models
  • Data driven expert systems
  • Meta-Visualization/Interpretation
slide9
Direct Kernel Methods

Direct Kernel Methods for Data Mining: Outline

  • Classical (linear) regression analysis and the learning paradox
  • Resolving the learning paradox by
  • - Resolving the rank deficiency (e.g., PCA)
  • - Regularization (e.g., Ridge Regression)
  • Linear and nonlinear kernels
  • Direct kernel methods for nonlinear regression
  • - Direct Kernel Principal Component Analysis  DK-PCA
  • - (Direct) Kernel Ridge Regression Least Squares SVM (LS-SVM)
  • - Direct Kernel Partial Least Squares  Partial Least-Squares SVM
  • - Direct Kernel Self-Organizing Maps  DK-SOM
  • Feature selection, memory requirements, hyperparameter selection
  • Examples:
  • - Nonlinear toy examples (DK-PCA Haykin’s Spiral, LS-SVM for Cherkassky data)
  • - K-PLS for Time series data
  • - K-PLS for QSAR drug design
  • - LS-SVM Nerve agent classification with electronic nose
  • - K-PLS with feature selection on microarray gene expression data (leukemia)
  • - Direct Kernel SOM and DK-PLS for Magnetocardiogram data
  • - Direct Kernel SOM for substance identification from spectrograms
slide10
Direct Kernel Methods

Outline

  • Classical (linear) regression analysis and the learning paradox
  • Resolving the learning paradox by
  • - Resolving the rank deficiency (e.g., PCA)
  • - Regularization (e.g., Ridge Regression)
  • Linear and nonlinear kernels
  • Direct kernel methods for nonlinear regression
  • - Direct Kernel Principal Component Analysis  DK-PCA
  • - (Direct) Kernel Ridge Regression Least Squares SVMs (LS-SVM)
  • - Direct Kernel Partial Least Squares  Partial Least-Squares SVMs
  • - Direct Kernel Self-Organizing Maps  DK-SOM
  • Feature selection, memory requirements, hyperparameter selection
  • Examples:
  • - Nonlinear toy examples (DK-PCA Haykin’s Spiral, LS-SVM for Cherkassky data)
  • - K-PLS for Time series data
  • - K-PLS for QSAR drug design
  • - LS-SVM Nerve agent classification with electronic nose
  • - K-PLS with feature selection on microarray gene expression data (leukemia)
  • - Direct Kernel SOM and DK-PLS for Magnetocardiogram data
slide11
Direct Kernel Methods

Review: What is in a Kernel?

  • A kernel can be considered as a (nonlinear) data transformation
  • - Many different choices for the kernel are possible
  • - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel
  • The RBF or Gaussian kernel is a symmetric matrix
  • - Entries reflect nonlinear similarities amongst data descriptions
  • - As defined by:
slide12

Docking Ligands is a Nonlinear Problem

DDASSL

Drug Design and Semi-Supervised Learning

slide13
Direct Kernel Methods

Histograms

PIP (Local Ionization Potential)

Wavelet Coefficients

Electron Density-Derived TAE-Wavelet Descriptors

  • Surface properties are encoded on 0.002 e/au3 surface

Breneman, C.M. and Rhem, M. [1997] J. Comp. Chem., Vol. 18 (2), p. 182-197

  • Histograms or wavelet encoded of surface properties give Breneman’s TAE property descriptors
  • 10x16 wavelet descriptore
slide14
Direct Kernel Methods
  • Binding affinities to human serum
  • albumin (HSA): log K’hsa
  • Gonzalo Colmenarejo, GalaxoSmithKline
  • J. Med. Chem. 2001, 44, 4370-4378
  • 95 molecules, 250-1500+ descriptors
  • 84 training, 10 testing (1 left out)
  • 551 Wavelet + PEST + MOE descriptors
  • Widely different compounds
  • Acknowledgements: Sean Ekins (Concurrent)
  • N. Sukumar (Rensselaer)
slide15
Direct Kernel Methods

Validation Model: 100x leave 10% out validations

slide16
Direct Kernel Methods

Feature Selection (data strip mining)

PLS, K-PLS, SVM, ANN

Fuzzy Expert System Rules

GA or Sensitivity Analysis to select descriptors

slide17
Direct Kernel Methods

K-PLS Pharmaplots

511 features

32 features

slide18
Direct Kernel Methods

Microarray Gene Expression Data for Detecting Leukemia

  • 38 data for training
  • 36 data for testing
  • Challenge: select ~10 out of 6000 genes
  • used sensitivity analysis for feature selection

(with Kristin Bennett)

slide21
Direct Kernel Methods

with Wunmi Osadik and Walker Land (Binghamton University)

Acknowledgement: NSF

slide22
Direct Kernel Methods

Magnetocardiography at CardioMag Imaging inc.

slide23
Direct Kernel Methods

Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid).

Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace

slide24
Direct Kernel Methods

Magneto-cardiogram Data

with Karsten Sternickel (Cardiomag Inc.) and Boleslaw Szymanski (Rensselaer)

Acknowledgemnent: NSF SBIR phase I project

slide25
Direct Kernel Methods

SVMLib

Linear PCA

SVMLib

Direct Kernel PLS

slide26
Direct Kernel Methods

Direct Kernel PLS with 3 Latent Variables

slide27
Direct Kernel Methods

Direct Kernel

with Robert Bress and Thanakorn Naenna

slide28

WORK IN PROGRESS

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

DDASSL

Drug Design and Semi-Supervised Learning

slide29
Direct Kernel Methods

Santa Fe Time Series Prediction Competition

  • 1994 Santa Fe Institute Competition: 1000 data chaotic laser data, predict next 100 data
  • Competition is described in Time Series Prediction: Forecasting the Future and
  • Understanding the Past, A. S. Weigend & N. A. Gershenfeld, eds., Addison-Wesley, 1994
  • Method: - K-PLS with  = 3 and 24 latent variables
  • - Used records with 40 past data for training for next point
  • - Predictions bootstrap on each other for 100 real test data
  • Entry “wouldhave won” the competition
slide30
Direct Kernel Methods

www.drugmining.com

Kristin Bennett and Mark Embrechts