Data Mining: How to make islands of knowledge emerging out of oceans of data

Data Mining:How to make islands ofknowledge emerging out of oceans of data Hugues Bersini IRIDIA - ULB

PLAN • Rapid intro to data warehouse • data mining: • two super techniques of data mining incomprehensible: Understand and predict Lazy for time series prediction Bagfs for classification

The Data Miner Steps • Data Warehousing • Data Preparation • Cleaning + Homogeneisation • Transformation - Composition • Reduction • For time series: time adjustment • Data Modelling : What researchers are mainly interested in.

Data Warehouse

Re-organization of data • Subject oriented • integrated • transversals • with history • non volatile • from production data ---> to decision-based data

Data Mining Uunderstand and predict

Modelling the data: only if structure and regularities in the dataData mining IS NOT OLAP WHY ?? To predict new data To understand the data

The main techniques of data-mining • Clustering • Outlier detection • Association analysis • Forecasting • Classification

Data Mining: to understand and/or to predict discovering structure in data discovering I/O relationship in data

Nothing new under the sun • New methods extending old ones in the domain of non-linear (NN) and symbolic (decision tree) • Exponential explosion of data • Extracting from huge data base More sensitive than ever

Exploit Decisions Data store Data Store Data Store Data Store • Data volume doubles every 18 months • world-wide • Problem • How to extract relevant knowledge for • our decisions from such amounts of data? • Solutions • Throw it away before using it (most popular) • Query it (Query and OLAP tools) • Summarize it: extract essence from the bulk • according to targeted decision (Data Mining) CEDITI September 2, 1998 • 3

Discovering structure in data • When in a space with a metric • Hierarchical clustering • K-Means • NN clustering - Kohonen’s map • In space without any metric but a cost function: • Grouping Genetic Algorithms ....

Clustering and outlier

Market Basket Analysis: Association analysis Quantity bought

Calcul of Improvement

Discovering I/O relationship in data ? y x(t) ?? x t classification time series prediction O = the class I = (x,y) O = x(t+1) I = x(t) understanding I/O relationship Predicting which O for new I

Le CV d’IRIDIA en data mining • Reconnaissance de défauts vitreux chez Glaverbel • Prediction de fluctuations boursières avec MasterFood et dieteren • Reconnaissance d’incidents et prédiction de charge électrique avec Tractebel • Analyse des retards aériens avec Eurocontrôle • Modélisation de Processus Industriel avec Honeywell, FAFER et Siemens • Moteur de recherche Internet convivial avec la Region Wallonne • Classification de pixels pour les images de satelittes

Financial prediction Task: predict the future trends of the financial series. Goal: automatic trading system to anticipate the fluctuations of the market.

Economic variables Task: predict how many cars will be matriculated next year. Goal: support the marketing campaign of a car dealer.

Modeling of industrial plants Rolling steel mill Task: predict the flow stress of the steel plate as a function of the chemical and physical properties of the material. Goal: cope with different types of metals, reduce the production time and improve final quality.

Control Waste water treatment plant Task: model the dynamics of the plant on the basis of accessible information. Goal: control the level of water pollutants.

Environmental problems Algae summer blooming Task: predicting the biological state (e.g. density of algae communities) as a function of chemicals. Goal: make automatic the analysis of the state of the river by monitoring chemical concentrations.

In the medical domain • automatic diagnosis of cancer • detection of respiratory problems • electrocardiogram analysis • help to paraplegic

APPLICATION DU DATA MINING DANS LE DOMAINE DU CANCER: Application à l'aide au diagnostic et au pronosticen pathologie tumorale. En collaboration avec le Laboratoire d'Histopathologie (R. Kiss), Faculté de Médecine, U.L.B.

critères histologiques: - perte de différenciation - invasion critères cytologiques: - taille des noyaux - mitoses - plages d’hyperchromatisme bilan clinique patient tumeur chirurgie DIAGNOSTIC (pathologistes) traitement adjuvant faible, modéré, élevé Amélioration du diagnostic Adéquation du traitement Augmentation de la survie

Exemple:Tumeurs primitives cérébrales (adultes): GLIOMES

“Objectivation” d’éléments diagnostiques quantification de critères (cytologiques et histologiques) microscopie assistée par ordinateur traitement des données Extraction d’informations diagnostiques et/ou prognostiques fiables et reproductibles

500 à 1000 noyaux • par tumeurs. • 30 variables tumorales: • moyenne • déviation standard

Application to teledetection

Bagfs

On internet • The Hyperprisme project • Text Mining • Automatic profiling of users • Key words: positif, negatif,… • Automatic grouping of users on the basis of their profiles • See Web

Different approaches Data Model Non readable Accuracy of prediction Non comprehensible Comprehensible SVM Local Global

Understanding and Predicting Building Models A model needs data to exist but, once it exists, it can exist without the data. Structure To fit the data Model Parameters Linear, NN, Fuzzy, ID3, Wavelet, Fourier, Polynomes,...

From data to prediction RAW DATA TRAINING DATA PREPROCESSING MODEL LEARNING PREDICTION

Supervised learning input PHENOMENON output error OBSERVATIONS MODEL prediction • Finite amount of noisy observations. • No a priori knowledge of the phenomenon.

Model learning MODEL GENERATION PARAMETRIC IDENTIFICATION MODEL VALIDATION STRUCTURAL IDENTIFICATION MODEL SELECTION

The Practice of Modelling Accurate Simple Robust Understandable good for decision Data + Optimisation Methods THE MODEL Physical Knowledge Engineering Models Rules of Thumb Linguistic Rules

Comprehensible models • Decision trees • Qualitative attributes • Force the attributes to be treated separately • classification surfaces parallel to the axes • good for comprehension because they select and separate the variables

Decision trees • Very used in practice. One of the favorite data mining methods • Work with noisy data (statistical approaches) can learn logical model out of data expressed by and/or rules • ID3, C4.5 ---> Quinlan • Favoring little trees --> simple models

At every stage the most discriminant attribute • The tree is being constructed top-down adding a new attribute at each level • The choice of the attribute is based on a statistical criteria called : “the information gain” • Entropie = -pouilog2poui - pnonlog2pnon • Entropie = 0 if Poui/non = 1 • Entropie = 1 if Poui/non = 1/2

Information gain • S = set of instances, A set of attributes and v set of values of attributes A • Gain (S,A) = Entropie(S)-Sv|Sv|/|S|*Entropie(Sv) • the best A is the one that maximises the Gain • The algorithm runs in a recursive way • The same mechanism is reapplied at each level

Mais !!!! Remboursement d’emprunt Is a good client if (x - y)>30000 . Salaire mensuel 30000

Other comprehensible models • Fuzzy logic • Realize an I/O mapping with linguistic rules • If I eat “a lot” then I take weight “a lot”

Exemple trivial Linéaire, optimal automatique, simple Y X

Le flou

Data Mining: How to make islands of knowledge emerging out of oceans of data

Data Mining: How to make islands of knowledge emerging out of oceans of data

Presentation Transcript

Motivation: Why data mining What is data mining Data Mining: On what kind of data Data mining functionality Classificati

A Taste of Data Mining

Introduction to Data Mining; Overview of Web Mining

The Process of Data Mining

Data Mining The Art and Science of Obtaining Knowledge from Data

Data Mining of Very Large Data

DATA MINING Extracting Knowledge From Data

Overview of Data Mining

Mining Complex Types of Data

DATA, DATA , DATA: What to Make of It And How To Use It

Mining of Massive Datasets: Knowledge discovery from data

Overview of Data Mining

Vital Concepts of Data Mining

Use data mining services and get most out of your data

Advantages of Data mining in Data science

Benefits of Data Mining

How to Make the Most Out of Your Customer Data

Application of Data Mining to Physics Summary Data

DATA, DATA , DATA: What to Make of It And How To Use It

How to make sense out of unstructured data?

Data Mining The Art and Science of Obtaining Knowledge from Data

Overview of Data Mining