INTRODUCTION TO SYMBOLIC DATA ANALYSIS

INTRODUCTION TOSYMBOLIC DATA ANALYSIS E. Diday CEREMADE. Paris–Dauphine University TUTORIAL: 13 June 2014 Activity Center, Academia Sinica, Taipei, Taiwan

OUTLINE PART 1: BUILDING SYMBOLIC DATA FROM STANDARD OR COMPLEX DATA PART 2: SYMBOLIC DATA ANALYSIS Is Symbolic Data Analysis a new paradigm? .PART 3: OPEN DIRECTION OF RESEARH PART 4: SDA SOFTWARES: SODAS, SYR and R PART 5: INDUSTRIAL APPLICATIONS

PART 1 BUILDING SYMBOLIC DATA FROM STANDARD OR COMPLEX DATA

What is a standard Data Table? • It is a set of individuals (i.e. observations) described by a set of • Numerical variables (as age, weight,..) or • Categorical variables (as Nationality, club name,…). • Example:

What are Complex Data? Any data which cannot be considered as a “standard observations x standard variables” data table. Example The individuals are Towers of nuclear power plants described by Table 1) Observations: Cracks . Variables: Cracks description. Table 2) Observations: corrosions. Variables: corrosion description . Table 3) Observations: vertices of a grid. Variables: Gap depression from the ground. .

Why considering classes of individuals as new individuals? • Example: • if wewish to know whatmakes a playerwins, we are interested by a standard data table where the individuals are the players (in rows) described (in columns) by their standard caracteristic variables. • If ourwishisnow to know whatmakes a team wins, we are interested by a data table where the teams (in rows) are descibed by caracteristic variables of the teams taking care on the variability of the playersinsideeach team. • The teams canbenowconsidered as new individuals of higherleveldescribed by symbolic variables taking care on the variability of the individualsinsideeach class.

X’1 X’j C1 Ci Ck From standard data tables to symbolic data tables Symbolic Data Table describing Teams (i.e. classes of individuals) Standard data table describing Football players (individuals). in each cell a number (age) or a category (Nationality) A symbolic data in each cell (Bar chart age of the Messi Team) NationalitiesBar chart Weightinterval Age Bar chart Some columns are contigency tables

SYMBOLIC DATA EXPRESS VARIABILITY INSIDE CLASSES OF INDIVIDUALS Here the variation (of weight, nationality, …) concerns the players of each team. Therefore each cell can contain: A number, an interval, a sequence of categorical values, a sequence of weighted values as a barchart, a distribution, … THIS NEW KIND OF VARIABLES ARE CALLED « SYMBOLIC » BECAUSE THEY ARE NOT PURELY NUMERICAL IN ORDER TO EXPRESS THE INTERNAL VARIATION INSIDE EACH CLASS.

What is the actual failure which has produced the SDA Paradigm? The failure is that in the actual practice • Only the “individual” kind of observations is considered. • Therefore these individual observations are only described by standard numerical and categorical variables.

The SDA paradigm shift It is the transition • from “individual observations” described by standard variables of numerical or categorical values. • To “classes of individuals” (considered as “higher level observations”) • Described by “symbolic variables”, of “symbolic values” (intervals, probability distributions, sets of categories or numbers, random variables,…) • taking care on the variability inside the classes • “symbolic values” can not be treated as numbers.

Building Symbolic Data needs three steps First Step: we have a standard data table TAB1, where individuals are described by numerical or categorical random variables Yj . Second step : we have a Table 2: where classes of individuals are described by random variables Y’j with random variables Yijvalue. • Thirdstep: we have a symbolic data table Table3:where the random variables Yij are represented by: • Probability distributions, histograms, bar charts, percentiles,… • Intervals Min, Max, interquartilintervaletc. • Set of numbers or categories • Functions as Time Series.

VARIABLES • Standard variables value: • numerical (income, profit,…), • categorical (Countries, Stock-Exchange places,..) • Symbolic variables value: • interval, • bar chart, • Histogram, etc.

Ten examples of Symbolic variables

What kind of questions and how are they structured?

How to build symbolic data from standard or complex data? • How to categorize the numerical, ordinal, nominal ground variables, in order that the obtained symbolic histograms or barchart variables for each class? • First: find the discretisation which discriminates as well as possible these classes. • Second or simultaneously: Maximize the correlation between the bins.

SOME ADVANTAGES of SYMBOLIC DATA: • Work at the needed level of generality without loosing variability. • Reduce simple or complex huge data. • Reduce number of observations and number of variables. • Reduce missing data. • Ability to extract simplified knowledge and decision from complex data. • Solve confidentiality (classes are not confidential as individuals). • Facilitate interpretation of results: decision trees, factorial analysis new graphic kinds. • Extent Data Mining and Statistics to new kinds of data with much industrial applications.

PART 2SYMBOLIC DATA ANALYSIS

SYMBOLIC DATA ANALYSIS TOOLS HAVE BEEN DEVELOPPED • Graphical visualisation of Symbolic Data • Correlation, Mean, Mean Square, distribution of a symbolic variables. • Dissimilarities between symbolic descriptions • Clustering of symbolic descriptions • S-Kohonen Mappings • S-Decision Trees • S-Principal Component Analysis • S-Discriminant Factorial Analysis • S-Regression • Etc...

From standardobservations to classes, the correlation is not the same! Y2 x x x x Y1 • Observations data are uniformly distributed in the circle: • no correlation between Y1 and Y2 for intial observations data. • A correlation appears between the two variables for the centers of a given partition in 4 classes.

WHY SYMBOLIC DATA CANNOT BE REDUCED TO A CLASSICAL STANDARD DATA TABLE? Symbolic Data Table Transformation in classical data Concern: The initial variables are lost and the variation is lost!

Divisive Clustering or Decision tree Symbolic Analysis Classical Analysis Weight Max Weight

PCA and NETWORK OF BAR CHART DATAof 30 Iris Fisher Data Clusters* Any symbolic variable (set of bins variables) can be projected. Here the species variable. * SYROKKO Company afonso@syrokko.com

The Symbolic Variables contributions are inside the smallest hyper cube containing the correlation sphere of the bins

Numerical versus symbolical space of representation a1 a2 C1 Ci Ck a2i a1i b1 b1i a1i b2i b1i a2i b2i b2 Y1 Y2 C1 Ci (Y1(Ci ), Y2(Ci )) = ([a1i , b1i ], ([a2i , b2i ]) Y2 Ck x Numerical representation of interval variables Bi-plot of interval variables b1 Ci Y1 b2 Ci x x a1 a2

Bi-plot of histogram variables Y1 Y2 C1 • The joint probability can be inferred by a copula model Ci Copula Ck

PART 3: OPEN DIRECTION OF RESEARH • Models of models • Law of parameters of laws • Laws of vectors of laws. • Copulas needed. • Four general convergence theorem. • Optimisation in non supervised learning (hierarchical and pyramidal clustering).

From lower level of individual observation to higher level observation of classes: higher level models are needed Teams X’1 X’j C1 Ci Ck Table 1 Table 2 A symbolic data (age of Messi team) A number (age of Messi) • Xj is a standard random numerical variable • X’j is a random variable with histogram value • Question: if the law of Xj is given what is the law of X’j? (Dirichlet models useful).

Why using copula models in Symbolic Data Analysis? • f(i, j, j’) is the joint probability of the variables j and j’ for the individual i. • In case of independency , we have • f(i, j, j’) = f(i, j’). f(i, j’), • If thereis no dépendancy: • f(i, j, j’) = Copula(f(i, j’). f(i, j’)) • Aim of Copula model in SDA: • find the Copula which minimises the differencewith the joint. • In order to avoid the restriction to independencyhypotheses and to reduce the cost of f(i, j, j’) computing.

FOUR THEOREM TO BE PROVED FOR ANY EXTENDED METHOD TO SYMBOLIC DATA. M(n, k) issupposed to be a SDA methodwhere k is the number of classes obtained on n initial individuals THEOREME 1 : If the k classes are fixed and n tends towardsinfinity, then M(n, k) converges towards a stable position. THEOREME 2 : If k increasesuntilgetting a single individual by class, then M(n, k) converges towards a standard one. THEOREME 3 : I k and n increasessimulataneouslytowardsinfinity, then M(n, k) converges towards a stableposition. THEOREME 4 If the k lawsassociated to the k classes are considered as a sample of a law of laws, then M(n, k) applied to thissample converges to M(n, k) applied to thislaw. Exemples : Théorème 1: il a été démontré dans Diday, Emilion (CRAS, Choquet 1998), pour les treillis de Galois: à mesure que la taille de la population augmente les classes (décrites par des vecteurs de distributions), s’organisent dans un treillis de Galois qui converge. Emilion (CRAS, 2002) donne aussi un théorème dans le cas de mélanges de lois de lois utilisant les martingales et un modèle de Dirichlet. Théorème 2: Par ex, l’ACP classique MO est un cas particulier de l’ACP notée M(n, k) construite sur les vecteurs d’intervalles. Théorème 3: c’est le cadre de données qui arrivent séquentiellement (de type « Data Stream ») et des algorithmes de type one pass (voir par ex Diday, Murty (2005)). Théorème 4: Dans le cas d'une classification hiérarchique ou pyramidale 2D, 3D etc. la convergence signifie que les grands paliers et leur structure se stabilisent. Dans le cas d’une ACP la convergence signifie que les axes factoriels se stabilisent.

x1 x2 x5 x3 x4 x2 x3 x4 x1 x5 Optimisation in clustering d is the given dissimilarity Ultrametric dissimilarity = U Hierarchies W = |d - U | Each class is described by symbolic data Pyramides Robinsonian dissimilarity = R 3D Spatial Pyramid S1 W = |d - R | S2 Yadidean dissimilarity = Y C3 C2 A 1B1 C1 W = |d - Y |

PART 4: SDA SOFTWARES: SODAS RSDA SYR

SoftwareTo build symbolic data from standard or complex data and analyze symbolic data, different software packages exist today.SODAS - academic free package, though registration required and a code needed for installation, http://www.info.fundp.ac.be/asso/sodaslink.htmMuch Symbolic data data bases can be found at http://www.ceremade.dauphine.fr/SODAS/RSDA: academic free packages are available on CRAN: oldemar.rodriguez@gmail.comSYR: professional package, see : afonso@syrokko.com

SODAS SOFTWARE CARTE DE KOHONEN DE CONCEPTS ANALYSE FACTORIELLE: ACP de variables à valeur intervalle Superposition de deux deux étoîles associées à deux classes de la pyramides Arbre de décision sur variables à valeur histogramme ou intervalle The objective of SCLUST is the clustering of symbolic objects by a dynamic algorithm based on symbolic data tables. The aim is to build a partition of SO´s into a predefined number of classes. Each class has a prototype in the form of a SO. The optimality criterion used is based on the sum of proximities between the individuals and the prototypes of the clusters. Pyramide classifiante

FROM DATA BASE TO SYMBOLIC DATA IN SODAS Individuals Classes Relational Data Base QUERY Description of individuals Columns: symbolic variables Classes Class description Symbolic Data Table Cells contain Symbolic Data

SYR SOFTWARE Produce a Symbolic Data Table from complex data. Manage Symbolic Data Tables: sort rows and columns by discriminant power Analyse Symbolic data tables: SPCA,Sclustering… Produce network, rules and decision trees.

SYR: SYMBOLIC DATA TABLE MANAGEMENT SYMBOLIC DATA TABLE • Sorting rows by min, max of intervals or frequencies of barchart is possible. • Sorting variables by discriminate power of the concepts is also possible. * SYROKKO Company eliezer@syrokko.com

PART 5: INDUSTRIAL APPLICATIONS

Each row represents a train going on the bridge at a given temperature, each cell contains until 800.000 values. Each cell is transformed in HISTOGRAM from a PROJECTION or from WAVELETS Time Series Data table:Anomaly detection on a bridge LCPC (Laboratoire Central Des Ponts et Chaussées) and SNCF Data Sensor 1 Sensor 2 Sensor 3 …. Sensor N Trains

HIERARCHICAL DATA* Symbolic procedure From numerical description of pigs to symbolic description of Farms • Numerical variables and • Categorical variables are transformed in Bar Chart of the frequencies based on 30 animals, Or in interval value variables 19 variables Description of pig respiratory diseases 125 farms x 30 animals Median score (continuous var.) Animal frequencies (categorical var.) 64 variables Description of pig respiratory diseases 125 farms *C. Fablet, S. Bougeard (AFSSA)

Step 1: Symbolic Description of Farms* * SYROKKO Company afonso@syrokko.com

Nuclear Power PlantFindCorrelationsBetween3 Standard Data Tables of Different observation units and different Variables

Craks Cartography of the towel by a grid Inspection machine NUCLEAR POWER PLANT Nuclear thermal power station Inspection : PB: FIND CORRELATIONS BETWEEN 3 CLASSICAL DATA TABLES OF DIFFERENT UNITS AND VARIABLES: Table 1) Observations: Cracks . Variables: Cracks description. Table 2) Observations: vertices of a grid. Variables: Gap deviation at different periods compared to the initial model position. Table 3) Observations: vertices of a grid. Variables: Gap depression from the ground. ARE Transformed in ONE Symbolic Data Table where the classes the towers. On this new table SDA can be applied.

FROM COMPLEX DATA TO SYMBOLIC DATA

Towers on PCA first axes • PCA on chooosen symbolic variables • Three clusters.visualisation • Interval and bar chart variables can be seen.. • A network of the strongest links can be represented. NETSYR results (SYR software)

Symbolic variables projection inside the hypercube of the correlation sphere

Telephone calls text mining in order to discover “themes” without using semantic INITIAL DATA: 2 814 446 rows Each calling session is called a document. We start after lemmatisation with a table of • 31454 documents • 2258 words Correspondence between documents and words.

First Steps:building overlapping clusters of documents and words: CLUSTSYR 70 x 2258 2 814 446 rows: Correspondence documents, words 31454 documents x2258 words 70 Overlapping Clusters of Documents described by the tf-idf of 2258 words. 80 x 70 2258 x 70 80 overlapping clusters of words described by their tf-idf in the 70 clusters of Docs. 2258 Words described by their tf-idf on the 70 clusters of Docs.

Next step: STATSYREach cluster of documents is described by the 80 clusters of words called “themes” Themes Classes of documents WORDS in Each Theme

GRAPHICAL REPRESENTATIONby NETSYR from SYR software GRAPHICAL REPRESENTATION of themes , document classes, by Pie Charts And their Bar chart description. Overlapping Clusters SOCIAL NEWORK Based on dissimilarities ANNOTATION : of Themes and Document classes Moving, Zooming… We obtain finally a clear representation of the main themes , their classes and their links : “failures”, “budget”,”addresses”, “vacation” etc..

A Survey on Security • A sample of people of three regions (Vex, Val, Plai) have answered to three questions: • Gender: M or W, • Security: priority to • Fight Against Unemployment (FAU), • Juvenile Delinquency (JD) • Drug addict (D)), • Death penalty (Yes or No). Gender, Security , D. Penalty are « barchart value variables » M, W, FAU, JD…are « bins »

INTRODUCTION TO SYMBOLIC DATA ANALYSIS

INTRODUCTION TO SYMBOLIC DATA ANALYSIS

Presentation Transcript

Introduction to Symbolic Logic

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Symbolic Logic

Introduction to Data analysis

Introduction to Data Analysis.

INTRODUCTION TO SYMBOLIC LOGIC

Introduction to Data Analysis.

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis.

Introduction to Data Analysis

Introduction to Data Analysis

Introduction to Data Analysis.

Introduction to Data Analysis

INTRODUCTION TO SYMBOLIC LOGIC

Symbolic Analysis