Data mining on nij data
Download
1 / 27

Data Mining on NIJ data - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

Data Mining on NIJ data. Sangjik Lee. Unstructured Data Mining. Text. Image. Keyword Extraction. Feature Extraction. Structured Data Base. Structured Data Base. Data Mining. Data Mining. Handwritten CEDAR Letter. Document Level Features. 1. Entropy 2. Gray-level threshold

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Data Mining on NIJ data' - keely


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Unstructured Data Mining

Text

Image

Keyword Extraction

Feature Extraction

Structured Data Base

Structured Data Base

Data Mining

Data Mining



Document Level Features

  • 1. Entropy

  • 2. Gray-level threshold

  • 3. Number of black pixels

  • 4. Stroke width

  • 5. Number of interior contours

  • 6. Number of exterior contours

  • 7. Number of vertical slope components

  • 8. Number of horizontal slope components

  • 9. Number of negative slope components

  • 10. Number of positive slope components

  • 11. Slant

  • 12. Height

Measure of

Pen Pressure

Measure of

Writing Movement

Measure of

Stroke Formation

Slant

Word Proportion


Sy(i,j)

-1

tan

Sx(i,j)

Character Level Features


Character Level Features

Gradient :00000000001100000000110000111000000011100000001100000011000100000000110000000000000111001100011111000011110000000010

01010000010001110011111001111100000100000100000000000000000000

01000001001000 (192)

Structure :000000000000000000001100001110001000010000100000010000

000000000100101000000000011000010100110000110000000000000100100

011001100000000000000110010100000000000001100000000000000000000

000000010000(192)

Concavity :11110110100111110110011000000110111101101001100100000

110000011100000000000000000000000000000000000000000111111100000

000000000000 (128)


0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .95 .49 .70 .71 .50 .10 .30

0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .94 .49 .75 .70 .50 .11 .30

0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .94 .49 .67 .74 .50 .10 .30

1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 .93 .72 .33 .47 .50 .21 .28

1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 .93 .74 .33 .48 .50 .22 .26

1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 .93 .79 .36 .54 .50 .18 .27

1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .92 .30 .61 .66 .60 .11 .35

1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .94 .42 .72 .66 .60 .11 .32

1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 .94 .40 .75 .67 .60 .12 .34

1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 .96 .30 .60 .59 .50 .10 .21

1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 .95 .32 .60 .59 .50 .09 .22

1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 .95 .30 .66 .60 .50 .10 .21

Gen Age Han Edu Ethn Sch

M F <14 <24 <44 <64 <84 >85 L R H C H W B A O U F

dark blob hole slant width skew ht

int int int real int real int

Writer and Feature Data

Writer data

Feature data (normalized)


Instances of the Data (normalized) .49 .70 .71 .50 .10 .30

Feature document level data (12 features)

Entropy dark pixel blob hole hslope nslope pslope vslope slant width ht

real int int int int int int int int real int int

.95 .49 .70 .71 .50 .10 .51 .92 .13 .47 .32 .21

.94 .49 .75 .70 .50 .11 .53 .84 .26 .54 .35 .18

.94 .49 .67 .74 .50 .10 .45 .85 .23 .48 .32 .22

.93 .72 .33 .47 .50 .21 .28 .30 .66 .60 .42 .10

.93 .74 .33 .48 .50 .22 .26 .30 .60 .59 .45 .10

.93 .79 .36 .54 .50 .18 .27 .32 .60 .59 .52 .09

.92 .30 .61 .66 .60 .11 .35 .49 .70 .71 .57 .10

.94 .42 .72 .66 .60 .11 .32 .49 .67 .74 .53 .10

.94 .40 .75 .67 .60 .12 .34 .49 .75 .70 .54 .11

.96 .30 .60 .59 .50 .10 .21 .30 .66 .60 .36 .10

.95 .32 .60 .59 .50 .09 .22 .30 .60 .59 .39 .10

.95 .30 .66 .60 .50 .10 .21 .32 .60 .59 .34 .09


Data Mining on sub-group .49 .70 .71 .50 .10 .30

White female

White male

Black female

Black male


Gen Age Han Edu Ethn Sch

M F <14 <24 <44 <64 <84 >85 L R H C H W B A O U F

Data Mining on sub-group (Cont.)

  • Subgroup analysis is useful information to be mined.

  • 1-constraint subgroups

  • {Male: Female},

  • {White : Black : Hispanic}, etc.

  • 2-constraints subgroups

  • {Male-white: Female-white}, etc.

  • 3-constraints subgroups

  • {Male-white-25~45: Female-white-25~45}, etc.

There are a combinatorially large number of subgroups.


subgroups Ethn Sch

Gender

Age

Handedness

Ethnicity

eDucation

Schooling

W

If |W| < support, reject

Constraints

1

G

A

H

E

D

S

2

GA

GH

GE

GD

GS

AH

AE

AD

AS

HE

HD

HS

ED

ES

DS

……

3

GAH

GAE

GAD

GAS

GHE

GHD

GHS

GED

GES

GDS

AHE

.

.

.

.

.

.

GAHEDS


Database Ethn Sch

Writer data

Raw feature data

Normalized feature data

Color Scale

1.0

0.0

~


Feature Database (White and Black) Ethn Sch

Female

Male

white

black

white

black

12~24

25~44

45~64

>= 65


What to do Ethn Sch

  • 1. Feature Selection

  • Process that chooses an optimal subset of features according to a certain criterion (Feature Selection for knowledge discovery and data mining by Huan Liu and Hiroshi Motoda)

  • Since there are limited number of writer in each sub-group, reduced subset of features is needed.

  • To improve performance (speed of learning, predictive accuracy, or simplicity of rules)

  • To visualize the data for model selection

  • To reduce dimensionality and remove noise


7-11 Ethn Sch

7-9

Feature Selection

Example of feature selection

1-3

9-11

Feature 9-10 ~ 11-12

Feature 1-2 ~ 2-3

Feature 6-10 ~ 8-12

  • Knowing that some features are highly correlated to some others can help removing redundant features


What to do Ethn Sch

  • 2. Visualization of trend (if any) of writer sub-groups

  • Useful tool so that we can quickly obtain an overall structural view of the trend of sub-group

  • Seeing is Believing !


Implementation of Subgroup Analysis on NIJ Data Ethn Sch

Task: Which writer subgroup is more distinguishable than others (if any)?

Writer Data

Find a subgroup that has enouth support

Feature Data

Data Preparation

Subgroup Classifier


The Result of Subgroup Classification Results Ethn Sch

Procedure for writer subgroup analysis

  • Find subgroup that has enough support

  • Choose ‘the other’ (complement) group

  • Make data sets(4) for Artificial Neural Network

  • Train ANN and get the results from two test sets

  • Limit

  • 3 categoris are used (gender, ethnicity and age)

  • up to 2 constraints are considered

  • only Document-level features are used


This is a test. Ethn Sch

This is a sample writing for

document 1 written by an author a.

Feature space representation of

Handwritten document is

This is a test.

This is a sample

writing for

document 1 written by an author a. of

Handwritten document is

Subgroup Classifier

dark

1

blob

Feature

extraction

Writer is

Which group?

hole

slant

height

Artificial neural network (11-6-1)



They’re distinguishable, but why... Ethn Sch

  • Need to explain why they’re distinguishable

  • ANN does a good job, but can’t explain clearly its output

  • 12 features are too many to explain and visualize

  • Only 2 (or 3) dimensions are visualizable

  • Question : Does a reasonable two or three dimensional representation of the data exist that may be analyzed visually?

  • Reference : Feature Selection for Knowledge Discovery and Data Mining

  • - Huan Liu and Hiroshi Motoda


Feature Extraction Ethn Sch

  • Common characteristic of feature extraction methods is that they all produce new features y based on the original features x.

  • After feature extraction, representation of data is changed so that many techniques such as visualization, decision tree building can be conveniently used.

  • Feature extraction started, as early as in 60’s and 70’s, as a problem of finding the intrinsic dimensionality of a data set - the minimum number of independent features required to generate the instances


Visualization Perspective Ethn Sch

  • Data of high dimensions cannot be analyzed visually

  • It is often necessary to reduce it’s dimensionality in order to visualize the data

  • The most popular method of determining topological dimensionality is the Karhunen-Loeve (K-L) method (also called Principal Component Analysis) which is based on the eigenvalues of a covariance matrix(R) computed from the data


Visualization Perspective Ethn Sch

  • The M eigenvectors corresponding to the M largest eigenvalues of R define a linear transformation from the N-dimensional space to an M-dimensional space in which the features are uncorrelated.

  • This property of uncorrelated features is derived from a theorem stating that if the eigenvalues of a matrix are distinct, then the associated eigenvectors are linearly independent

  • For the purpose of visualization, one may take the M features corresponding to the M largest eigenvalues of R


Applied to the NIJ data Ethn Sch

1. Normalize each feature’s values into a range [0,1]

2. Obtain the correlation matrix for the 12 original features

3. Find eigenvalues of the correlation matrix

4. Select the largest two eigenvalues should be chosen

5. Output the chosen eigenvectors associated with the chosen eigenvalues. Here we obtain a 12 * 2 transformation matrix M

6. Transform the normalized data Dold into data Dnew of extracted features as follows:

Dnew = Dold M

The resulting data is of 2-dimensional having the original class label attached to each instance



Applied to the NIJ data Ethn Sch

Sample Iris data (the original is 4-dimensional)


ad