# Tutorial 1 - PowerPoint PPT Presentation

1 / 28

Tutorial 1. General Introduction to SDA. Yin-Jing Tien ( 田銀錦 ) Institute of Statistical Science Academia Sinica gary@stat.sinica.edu.tw June 13, 2014. Symbolic data Analysis (SDA) ( Diday 1987). Text: Billard and Diday (2006):

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Tutorial 1

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Presentation Transcript

Tutorial 1

General Introduction to SDA

Yin-Jing Tien (田銀錦)

Institute of Statistical Science

gary@stat.sinica.edu.tw

June 13, 2014

Symbolic data Analysis (SDA)

(Diday 1987)

Text:

Billard and Diday (2006):

Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley.

Diday, E., Noirhomme-Fraiture, M. (2008):

Symbolic Data Analysis and The SODAS Software. JohnWiley & Sons Ltd., Chichester, England.

Symbolic data

(Diday 1987)

• Classical Data : Individuals:single value

• Single player

• age = 25, eye color = blue

• Symbolic Data : Symbolic units (Concept: groups)

• Team

• interval : age range = [20, 36]

• multiple values: eye color = {blue,brown,black}

Symbolic data analysis When?

• When we are interested the higher level units (Concept: groups/classes ).

• When the initial data are composed by

• Symbolic data tables

• When the data is BIG

Symbolic data types

Symbolic data types (quantitative)

Multi-valued symbolic random variable Y is one or more values

{12,23,20}

Interval-valued symbolic random variable Y is one that takes values in an interval

[17, 25]

Modal multi-valued

{0.5, 3/8, 1.5, 4/8, 2, 1/8}

Modal interval-valued (Histogram)

{[12,40), 1/7, [40, 60), 2/7, [60, 80], 4/7}

Symbolic data types (qualitative)

Multi-valued symbolic random variable Y is one or more values

E.g., Bird Colors, Y=color

Modal multi-valued

{single, 3/8, married, 5/8}

Basic Descriptive Statistics: Interval Value

Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k.

Sample Mean of Iiis

Sample Variance of Ziis

Basic Descriptive Statistics: Interval Value

Rewrite as

Total Variation = Within Variation + Between Variation

Within Variation

Between Variation

Similarity between Variables (interval-valued data) (Billard and Diday (2006))

The empirical covariancefunction between Ziand Zjis

The empirical correlation coefficient between Ziand Zjis

Where

Distance between concept

Definition 7.6: The Cartesian join A⊕B between two sets A and B is their

componentwise union,

Definition 7.7: The Cartesian meet A⊗B between two sets A and B is their

componentwise intersection,

Distance between concept

Distance between concept (Multi-valued)

The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991)

(relative sizes)

(relative content)

Distance between concept (Multi-valued)

Example: Color and Habitat of Birds (Table 7.2)

Y1 = Color, Y2 = Habitat

For Y1: D11(ω1, ω2)=(|2-1|)/2 = 1/2

D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2

p=2

The Gowda-Didaydissimilarity

For Y2:D11(ω1, ω2)=(|2-1|)/2 = 1/2

D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2

D(ω1, ω2)=(1/2+1/2)+(1/2+1/2) = 2

Normalized (adjust for scale) weights are32

D(ω1, ω2)=(1/2+1/2)/3+(1/2+1/2)/2= 5/6

Distance between concept (Multi-valued)

TheIchino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994)

ϕj(ω1, ω2)= )

ϕ1(ω1, ω2)= 2-1+γ(2*1-2-1)

= 1-γ

For Y1:

ϕ2(ω1, ω2)= 2-1+γ(2*1-2-1)

= 1-γ

For Y2:

Takingγ=0.5

UnweightedMinkowskidistance

Dq(ω1, ω2)= (0.5q+0.5q)1/q

Weighted Minkowskidistance ( )

Dq(ω1, ω2)= ((0.5/3)q+(0.5/2)q)1/q

Distance between concept (Interval-valued)

Let Zi= (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k.

The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991)

Dj(ω1, ω2)

for the variable Yj

D(ω1, ω2) =

(relative length)

(relative content)

(relative position)

length of the entire distance spanned by ω1andω2

, if the intervals overlap

length of the intersection

, otherwise

total length in covered by the observe values of Yj

Distance between concept (Interval-valued)

The Ichino-Yaguchi dissimilarity measure(Ichino and Yaguchi, 1994)

ϕj(ω1, ω2) = )

=

(empty if no interaction)

=

The generalized Minkowski distance of order q ≥1 between two interval-valued

observations ξ(ω1) and ξ(ω2) is

dq(ω1, ω2)

Where ϕj(ω1, ω2) is the Ichino-Yaguchidistance and is a weight function associated with variable Yj .

ϕj(ω1, ω2)

When q = 1  City Block distance

When q = 2 Euclidean distance

Distance between concept (Interval-valued)

The Hausdorff Distance(Chavent and Lechevallier, 2002)

ϕj(ω1, ω2))

d(ω1, ω2)

The Euclidean Hausdorff Distance

d(ω1, ω2)

Where ϕj(ω1, ω2) is the HausdorffDistance

The Normalization Euclidean Hausdorff Distance

Where

d(ω1, ω2)

The Span Normalization Euclidean Hausdorff Distance

Where the span =

d(ω1, ω2)

Distance between concept (Interval-valued)

Example: Take the first 3 observations

only of veterinary data

D(ω1, ω2) =

Gowda-Didaydissimilarity

(Y1)

|120-158|/65]

(Y2)

Distance between concept (Interval-valued)

TheIchino-Yaguchidissimilarity

ϕj(ω1, ω2) = )

=

(empty if no interaction)

=

ϕ1(ω1, ω2) = |180-120|)

= 58+(-58)

ϕ2(ω1, ω2) = |355-222.2|)

= 100.8+

The generalized Minkowski distance

When q = 1  City Block distance

When q = 2 Euclidean distance

dq(ω1, ω2)

Distance between concept (Interval-valued)

TheHausdorffDistance

ϕj(ω1, ω2))

d(ω1, ω2)

ϕ1(ω1, ω2))38

38 + 99.8 = 137.8

ϕ2(ω1, ω2))99.8

The Euclidean Hausdorff Distance

d(ω1, ω2)

The Normalization Euclidean Hausdorff Distance

]288.78

d(ω1, ω2)

The Span Normalization Euclidean Hausdorff Distance

=

= 185-120 = 65

d(ω1, ω2)

= 355-117.2 = 237.8

Distance between concept (group) of interval-valued data

Comparison of between-concept distance measures

Interval-valued symbolic data analysis

• Books(Bock and Diday (2000), Billard and Diday (2003,

• 2006), and Diday and Noirhomme-Fraiture (2008))

• PCA(Chouakria, Cazes, and Diday (2000); Palumbo and

• Lauro (2003); Gioia and Lauro (2006); Hamada,

• Minami, and Mizuta (2008))

• Clustering analysis ( Brito (2002); Souza and de

• Carvalho (2004); Chavent et al. (2006); Bock (2008))

• Discriminant analysis (Lauro, Verde, and Palumbo (2000);

• Duarte Silva and Brito (2006))

• MDS (Groenen et al. (2006); Minami and Mizuta (2008)

• Regression (Billard and Diday (2000); de Carvalho et al.

• (2004))

Visualization Tools for Symbolic Data (Analysis)

Symbolic Data Analysis Software

• SODAS (2003)

FREE from 2 European Consortium

• SYR (2008)

More professional from SYROKKO Company

www.syrokko.com