slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
zag@math.nsc.ru PowerPoint Presentation
Download Presentation
zag@math.nsc.ru

Loading in 2 Seconds...

play fullscreen
1 / 65

zag@math.nsc.ru - PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on

Cognitive data analysis Nikolay Zagoruiko Institute of Mathematics of the Siberian Devision of the Russian Academy of Sciences, Pr. Koptyg 4, 630090 Novosibirsk, Russia,. zag@math.nsc.ru. Area of interests

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'zag@math.nsc.ru' - liseli


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Cognitive data analysisNikolay ZagoruikoInstitute of Mathematics of the Siberian Devisionof the Russian Academy of Sciences,Pr. Koptyg 4, 630090 Novosibirsk, Russia,

zag@math.nsc.ru

slide2

Area of interests

Data Analysis, Pattern Recognition, Empirical Prediction, Discovering of Regularities, Data Mining, Machine Learning, Knowledge Discovering, Intelligence Data Analysis

Cognitive Calculations

Human-centered approach:

The person - object of studying its cognitive mechanisms

The decision of new strategic tasks is impossible without the accelerated increase of an intellectual level of means of supervision, the analysis and management.

The person - the subject using results of the analysisComplexity of functioning of these means and character of received results complicate understanding of results. In these conditions the person, actually, is excluded from a man-machine control system.

specificity of dm tasks
Specificity of DM tasks:

Great volumes of data

Polytypic attributes

Quantity of attributes >> numbers of objects

Presence of noise and blanks

Absence of the information on distributions and dependences

abundance of methods is result of
Abundance of methods is result of

absence the uniform approach

to the decision of tasks of different type

That can learn at the person?

1 person understands a results if classes are divided by the perpendicular planes
1. Person understands a results if classes are divided by the perpendicular planes

y

y

y

Y’

X=0.8Y-3

x

x

x

a

X’

2 person understands a results if classes are described by standards
2. Personunderstands a results if classes are described by standards

y

y

y

*

*

*

*

*

*

Y’

*

*

x

x

x

X’

Уникальная способность человека распознавать

трудно различимые образы основана на его умении

выбирать информативные признаки.

slide10

If at the solving of different classification tasks the person passes from one basis to another? Most likely, peoples use some universal psycho-physiological function

Our hypothesis:

Basic function, used by the person at the classification, recognition, feature selection etc.,

consists in measure of

similarity

slide12

Similarity is not absolute,

but a relative category

Is a objectb close to a or it is distant?

a

b

slide13

Similarity is not absolute,

but a relative category

Is a objectb close to a or it is distant?

a

b

a

b

c

slide14

Similarity is not absolute,

but a relative category

Is a objectb close to a or it is distant?

a

b

a

b

c

a

b

c

We should know the answer on question:

In competition with what?

all pattern recognition methods are based on hypothesis of compactness braverman e m 1962

Compact ness

All pattern recognition methods are based on hypothesis of compactnessBraverman E.M., 1962

The patterns are compact if

-the number of boundary points is not enough in comparison with their common number;

- compact patterns are separated from each other refer to not too elaborate borders.

slide17

Compact ness

Similarity between objects of one pattern should be maximal

Similarity between objects of different patterns should be minimal

slide18

Compactness

Defensive capacity:

Compact patterns should satisfy

to condition of the

Maximal similarity

between objects

of the same pattern

tolerance

Compactness

Tolerance:

Compact patterns should satisfy

to the condition

Maximal difference

of these objects with the objects of other patterns

informativeness by fisher for normal distribution

Criteria

Informativeness by Fisherfor normal distribution

Compactness has the same sense and can be used as a criteria of informativeness, which is invariant to

low of distribution and to relation of NM

selection of feature
Selection of feature

Initial set of features Xo

1, 2, 3, …..… …. j…. …..… N

Engine

GRAD

Variant of subset X

<1,2,…,n>

Criteria

FRiS-compactness

Bad

Good

algorithm grad
Algorithm GRAD

GRAD

It based on combination of two greedy algorithms:

forwardandbackwardsearches.

At a stage forward algorithm Addition is used J.L. Barabash, 1963

At a stage backward algorithm Deletion is used Merill T. and Green O.M., 1963

algorithm addel

GRAD

Algorithm AdDel

To easing influence of collecting errors a relaxation method it is applied.

n1 - number of most informative attributes, add-on to subsystem (Add),

n2<n1 - number of less informative attributes, eliminated from subsystem (Del).

AdDel Relaxation method:n steps forward - n/2 steps back

Algorithm AdDel. Reliability (R) of recognition at

different dimension space.

R(AdDel) > R(DelAd) > R(Ad) > R(Del)

algorithm grad1

GRAD

Algorithm GRAD
  • AdDel can work with groups of attributes (granules) of different capacity m=1,2,3,…: , , ,…

The granules can be formed by the exhaustive search method.

  • But: Problem of combinatory explosion!

Decision:orientation on individual informativeness of attributes

f

It allows to granulate a most informative part attributes only

L

Dependence of frequency f hits in an informative subsystem

from serial number L on individual informativeness

algorithm grad gr anulated a d d el

GRAD

Algorithm GRAD(GRanulated AdDel)

1. Independent testing N attributes

Selection m1<<N first best (m1 granules power 1)

2. Forming combinations

Selection m2<< first best (m2 granules power 2)

3. Forming combinations

Selection m3<< first best (m3 granules power 3)

M =<m1,m2,m3> - set of secondary attributes (granules)

AdDel selects m*<<|M| best granules, which included n*<<N attributes

comparison of the criteria cv fris

Criteria

Comparison of the criteria (CV - FRiS)

Order of attributesby informativeness

..............C = 0,661

....... .......C = 0,883

noise

noise

N=100M=2*100

mt=2*35mC =2*65 +noise

some real tasks
Some real tasks

Task K M N

Medicine:

Diagnostics of Diabetes II type 3 43 5520

Diagnostics of Prostate Cancer 4 322 17153

Recognition of type of Leukemia 2 38 7129

Microarray data 2 1000 500000

9 genetic tables 2 50-150 2000-12000

Physics:

Complex analysis of spectra 7 20-400 1024

Commerse:

Forecasting of book sealing

(Data Mining Cup 2009) - 4812 1862

recognition of two types of leukemia all and aml
Recognition of two types of Leukemia - ALL and AML

ALL AML

Training set 38 27 11N= 7129

Control set342014

I. Guyon, J. Weston, S. Barnhill, V. Vapnik

Gene Selection for Cancer Classification using Support Vector Machines.

Machine Learning. 2002, 46 1-3: pp. 389-422.

slide36

Pentium T=3 hours

Pentium T=15 sec

В 27 первых подпространствах P =34/34

FRiS Decision Rules P

0,72656537/1 , 1833/1 , 2641/2 , 4049/234

0,713731454/1 , 2641/1 , 4049/134

0,712082641/1 , 3264/1 , 4049/134

0,71077435/1 , 2641/2 , 4049/2 , 6800/134

0,709932266/1 , 2641/2 , 4049/234

0,709732266/1 , 2641/2 , 2724/1 , 4049/234

0,707112266/1 , 2641/2 , 3264/1 , 4049/234

0,705742641/2 , 3264/1 , 4049/2 , 4446/134

0,70532435/1 , 2641/2 , 2895/1 , 4049/234

0,702432641/2 , 2724/1 , 3862/1 , 4049/234

Training set 38 Test set 34

N g Vsuc Vext Vmed Tsuc Text Tmed P

7129 0,95 0,01 0,42 0,85 -0,05 0,42 29

4096 0,82 -0,67 0,30 0,71 -0,77 0,34 24

2048 0,97 0,00 0,51 0,85 -0,21 0,41 29

1024 1,00 0,41 0,66 0,94 -0,02 0,47 32

512 0,97 0,20 0,79 0,88 0,01 0,51 30

256 1,00 0,59 0,79 0,94 0,07 0,62 32

128 1,00 0,56 0,80 0,97 -0,03 0,46 33

64 1,00 0,45 0,76 0,94 0,11 0,51 32

32 1,00 0,45 0,65 0,97 0,00 0,39 33

  • 1,00 0,25 0,66 1,00 0,03 0,38 34

8 1,00 0,21 0,66 1,00 0,05 0,49 34

4 0,97 0,01 0,49 0,91 -0,08 0,45 31

2 0,97 -0,02 0,42 0,88 -0,23 0,4430

1 0,92 -0,19 0,45 0,79 -0,27 0,2327

Name of gene Weight

2641/1 , 4049/1 33

2641/1 32

Zagoruiko N., Borisova I., Dyubanov V., Kutnenko O.

I.Guyon, J.Weston, S.Barnhill, V.Vapnik

comparison with 10 methods
Comparison with 10 methods
  • Jeffery I.,Higgins D.,Culhane A. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. //
  • http://www.biomedcentral.com/1471-2105/7/359

9 tasksonmicroarray data.10 methods the feature selection.

Independent attributes. Selection of n first (best).

Criteria – min of errors on CV: 10 time by 50%.

Decision rules:

Support Vector Machine (SVM), Between Group Analysis (BGA),

Naive Bayes Classification (NBC), K-Nearest Neighbors (KNN).

methods of selection
Methods of selection

Methods Results

Significance analysis of microarrays (SAM) 42

Analysis of variance (ANOVA)43

Empirical Bayes t-statistic 32

Template matching38

maxT37

Between group analysis (BGA) 43

Area under the receiver operating characteristic curve (ROC) 37

Welch t-statistic39

Fold change47

Rank products 42

FRiS-GRAD 12

Empirical Bayes t-statistic – for middle set of objects

Area under a ROC curve – for small noise and large set

Rank products – for large noise and small set

results of comperasing
Results of comperasing
  • Задача N0m1/m2maxof 4 GRAD
  • ALL1 12625 95/33 100.0100.0
  • ALL2 12625 24/101 78.2 80.8
  • ALL3 12625 65/35 59.1 73.8
  • ALL4 12625 26/67 82.1 83.9
  • Prostate 12625 50/53 90.2 93.1
  • Myeloma 12625 36/137 82.9 81.4
  • ALL/AML 7129 47/25 95.9 100.0
  • DLBCL 7129 58/19 94.393.5
  • Colon 2000 22/40 88.6 89.5

average 85.7 88.4

unsettled problems
Unsettled problems
  • Censoring of training set
  • Recognition with boundary
  • Stolp+corridor (FRiS+LDR)
  • Imputation
  • Associations
  • Unite of tasks of different types (UC+X)
  • Optimization of algorithms
  • Realization of program system (OTEX 2)
  • Applications (medicine, genetics,…)
  • …..
conclusion
Conclusion

FRiS-function:

1.Provides effective measure of similarity, informativeness and compactness

2.Provides unification of methods

3.Provides high quality of decisions

Publications:

http://math.nsc.ru/~wwwzag

thank you
Thank you!
  • Questions, please?
decision rules choosing a standards stolps

Stolp

Decision rulesChoosing a standards(stolps)
  • The stolp is an object which

protects own objects

and does not attack another's objects

Defensive capacity:

Similarity of the objects to a stolp should be maximal

a minimum of the miss of the targets,

Tolerance:

Similarity of the objects to another's objects - minimally

a minimum of false alarms

algorithm fris stolp

Stolp

Algorithm FRiS-Stolp

Compact patterns should satisfy

to two conditions:

F(j,i)|b=(R2-R1)/(R2+R1)

Defencive capacity:

Maximal similarity

of objects on stolp i

Tolerance:

Maximal difference

of other’s objects with stolp i

slide46

Stolp

Algorithm FRiS-Stolp

F(j,i)|b=(R2-R1)/(R2+R1)

Security:Maximal similarity

of objects on stolp i

Tolerance:Maximal difference

of other’s objects with stolp i

slide59

Decision rules

Алгоритм FRiS-Stolp

fris class2
Сравнение FRiS-Class с другими алгоритмами таксономии

K