Exhaustkd
Download
1 / 23

exhaustkd - PowerPoint PPT Presentation


  • 218 Views
  • Uploaded on

exhaustkd Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters Overview Introduction: Timbl’s k & distance weighting Idea: Read knn sets and distances from Timbl’s output Implementation: exhaust.py & cvexhaust.py Examples: diminutive & Prosit data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'exhaustkd' - ostinmannual


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Exhaustkd l.jpg

exhaustkd

Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters


Overview l.jpg
Overview

  • Introduction:

    • Timbl’s k & distance weighting

  • Idea:

    • Read knn sets and distances from Timbl’s output

  • Implementation:

    • exhaust.py & cvexhaust.py

  • Examples:

    • diminutive & Prosit data

  • Discussion:

    • limitations & improvements


Introduction l.jpg
Introduction

Knn classification without distance weighting

k=1

?

X

X

k=2

k=3

X

Y

Y

X

Y

X

X


Introduction cont l.jpg
Introduction (cont.)

Knn classification with distance weighting

X

k=1

?

X

k=2

k=3

Y

X

Y

X

Y

X

X


Introduction cont5 l.jpg
Introduction (cont.)

  • Distance weighting methods:

    • Z (no weighting)

    • ID (inverse distance)

    • IL (inverse linear)

    • EDa (exponential decay with alpha a)


Slide6 l.jpg
Idea

  • Knn classification is actually a two-step process:

    • Determine the nearest neighbor sets (= those instances with similar features, optionally using feature weighting and MVDM)

    • Determine the majority class within all nearest neighbors at maximally the k-th distance, optionally using distance weighting

  • We can do step 2 without repeating step 1!


Idea cont l.jpg
Idea (cont.)

  • The +v option allows you to write the knn sets and their distances to the output file

    • +vn := write nearest neighbors

    • +vdi :=write distance (of the instance to be classified, and of its knn sets)

    • +vdb := write class distribution (of the instance to be classified, and of its knn sets)

  • Example:

    • Timbl -f dimin.train -t dimin.test -k3 +vn+di+db


Slide8 l.jpg

=,=,=,=,+,k,u,=,-,bl,u,m,E,E { E 4.00000, P 3.00000 } 0.0000000000000

# k=1, 1 Neighbor(s) at distance: 0.00000

# =,=,=,=,+,k,u,=,-,bl,u,m,{ P 1 }

# k=2, 1 Neighbor(s) at distance: 0.0594251

# =,=,=,=,+,m,K,=,-,bl,u,m,{ E 1 }

# k=3, 5 Neighbor(s) at distance: 0.103409

# =,=,=,=,+,m,O,z,-,bl,u,m,{ E 1, P 1 }

# =,=,=,=,+,st,K,l,-,bl,u,m,{ E 1 }

# =,=,=,=,+,m,y,r,-,bl,u,m,{ E 1, P 1 }

+,m,I,=,-,d,A,G,-,d,},t,J,J { J 8.00000 } 0.28274085738293

# k=1, 1 Neighbor(s) at distance: 0.282741

# -,v,@,r,+,v,A,l,-,p,},t,{ J 1 }

# k=2, 6 Neighbor(s) at distance: 0.311890

# =,=,=,=,=,=,=,=,+,k,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,p,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,xr,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,l,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,h,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,fr,},t,{ J 1 }

# k=3, 1 Neighbor(s) at distance: 0.325529

# +,m,K,=,-,d,@,=,-,pr,a,t,{ J 1 }


Idea cont9 l.jpg
Idea (cont.) 0.0000000000000

  • From this output, you can

    • read the knn members and their distances

    • repeat classification for smaller k’s and other distance weightings

    • without calculating the knn sets and their distances again

  • Hence

    • classification is potentially much faster

    • exhaustively trying all combinations of k and distance weighting becomes feasible


Implementation l.jpg
Implementation 0.0000000000000

  • Python scripts

    • exhaustkd

    • cvexhaustkd

  • Requirements:

    • Minimally python 2.1

    • Expenv libraries

  • Input

    • List of classes

    • Timbl output file(s) produced with +vn+di+db and a high k

    • Some option settings

  • Output

    • tables with performance measures (accuracy, recall, precision, F-score) for all combinations of k and d


Slide11 l.jpg

$ exhaustkd -h 0.0000000000000

usage:

exhaustkd [options] CLASSES FILE

exhaustkd [options] CLASSES <FILE

purpose:

Timbl's +vn+di+di option causes it to add the nearest neigbors and

their distances to its output. This output can then be passed on to

exhaustkd to perform an exhaustive classification over a range over k's

and distance weighting metrics. It will always try Z, ID, and IL.

Optionally, various settings of ED can be tried.

exhaustkd tabulates the performance for all settings. Which evaluation

metrics are reported depends on the PATTERN of the -o option.

A PATTERN is a comma-separated list of one or more of the

following symbols:

A = accuracy

K = kappa

P = combined precision

R = combined recall

F = combined f-score

pC = precision on class C

rC = recall on class C

fC = F-score on class C

args:

CLASSES classes as a comma separated list

FILE classifier output file


Slide12 l.jpg

options: 0.0000000000000

--version show program's version number and exit

-h, --help show this help message and exit

-aFLOAT1,FLOAT2,...,FLOATn, --alphas=FLOAT1,FLOAT2,...,FLOATn

values to try as the alpha constant in the exponential

decay metric (default is none)

-bFLOAT, --beta=FLOAT

beta in F score calculation (default is 1.0)

-dSTRING, --delimiter=STRING

column delimiter (default is ' ')

-f, --full-output output all available evaluation metrics for every

setting

-kINT, --max-k=INT the maximum number of nearest neighbors to try (default

is 1)

-nINT, --n-best=INT the number of settings reported in the n-best list

(default is 10)

-oPATTTERN, --output=PATTTERN

output patttern (default is 'A,P,R,F')

-rINT, --random-seed=INT

seed for random generator (default is current system

time)

-t{once|continue|random}, --tie-resolution={once|continue|random}

tie resolution by increasing k once (default), by

increasing k continuously, or by choosing randomly

-%, --percent output in percentages


Slide13 l.jpg

examples: 0.0000000000000

exhaustkd -k5 -a1.0,2.0 X,Y,Z output_file

perform an exhaustive classification into classes X,Y,Z

with k from 1 to 5, and distance metrics Z, ID, IL, ED1.0 and D2.0

exhaustkd -% -opX,rX,fX X,Y,Z <output_file

output precision, recall, and f score percentages on class X

exhaustkd -k10 -tcontinue -0A X,Y,Z output_file

ouput accuracy when using continuous tie resolution upto k=10


Example diminutive l.jpg
Example: diminutive 0.0000000000000

  • Commands

    • Timbl -f dimin.train -t dimin.test -o out -k5 +vn+db+di

    • exhaustkd -d, -k5 -a1,5 -oA -% P,T,J,E,K out


Slide15 l.jpg

================================================================================================================================================================

Accuracy (%)

================================================================================

k Z ID IL ED1.0 ED5.0

1 96.74 96.74 96.74 96.74 96.74

2 97.37 96.74 96.63 96.42 96.63

3 96.42 96.42 97.05 96.53 96.95

4 95.05 95.68 96.42 95.68 96.32

5 95.37 95.26 96.42 95.47 95.79

Rank: Score: k: d:

1 97.37 1 Z

2 97.05 2 IL

3 96.95 2 ED5.0

4 96.74 1 ID

5 96.74 0 Z

6 96.74 0 IL

7 96.74 0 ID


Example prosit breaks l.jpg
Example: Prosit breaks================================================================================

  • Commands:

    • For each of the 10 folds, a Timbl with -k31 -o out0??

    • cvexhaustkd -a1,5 -k30 -% -oA,P,R,F,pB,rB,fB B,- out0?? >exhaustive-report


Slide17 l.jpg

================================================================================================================================================================

Accuracy (%)

================================================================================

k: Z: ID: IL: ED1.0: ED5.0:

1 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35

2 96.00 0.44 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35

3 96.00 0.44 96.00 0.44 95.89 0.48 96.01 0.44 95.99 0.43

4 96.28 0.41 96.02 0.47 95.99 0.47 96.02 0.46 96.02 0.47

5 96.28 0.41 96.29 0.41 96.22 0.49 96.28 0.41 96.28 0.40

6 96.37 0.44 96.28 0.41 96.24 0.41 96.27 0.42 96.27 0.41

7 96.37 0.44 96.38 0.44 96.36 0.46 96.37 0.44 96.36 0.45

8 96.41 0.45 96.42 0.47 96.37 0.44 96.41 0.47 96.41 0.47

9 96.41 0.45 96.42 0.44 96.42 0.46 96.41 0.44 96.43 0.43

10 96.40 0.42 96.45 0.43 96.45 0.45 96.45 0.43 96.45 0.43

11 96.40 0.42 96.41 0.41 96.46 0.46 96.40 0.42 96.43 0.43

12 96.43 0.41 96.44 0.44 96.46 0.47 96.43 0.45 96.45 0.44

13 96.43 0.41 96.43 0.41 96.47 0.44 96.42 0.41 96.44 0.42

14 96.44 0.47 96.46 0.45 96.47 0.46 96.46 0.45 96.48 0.46

15 96.44 0.47 96.44 0.47 96.47 0.47 96.44 0.47 96.46 0.46

16 96.41 0.49 96.45 0.47 96.45 0.46 96.45 0.47 96.47 0.47

17 96.41 0.49 96.41 0.48 96.46 0.47 96.40 0.49 96.45 0.47

18 96.40 0.48 96.45 0.47 96.47 0.47 96.44 0.47 96.46 0.47

19 96.40 0.48 96.40 0.47 96.47 0.48 96.39 0.48 96.44 0.45

20 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.46

20 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.46

21 96.37 0.49 96.37 0.48 96.48 0.49 96.37 0.49 96.40 0.47

22 96.38 0.49 96.39 0.46 96.48 0.47 96.40 0.47 96.41 0.46

23 96.38 0.49 96.38 0.49 96.48 0.49 96.38 0.49 96.41 0.48

24 96.39 0.51 96.41 0.48 96.48 0.50 96.41 0.48 96.43 0.47

25 96.39 0.51 96.39 0.50 96.47 0.49 96.39 0.51 96.42 0.49

26 96.40 0.48 96.43 0.50 96.49 0.49 96.43 0.49 96.45 0.49

27 96.39 0.48 96.40 0.48 96.48 0.49 96.40 0.48 96.43 0.49

28 96.40 0.46 96.41 0.48 96.48 0.49 96.41 0.48 96.41 0.47

29 96.40 0.46 96.40 0.46 96.48 0.49 96.40 0.46 96.43 0.47

30 96.39 0.49 96.43 0.46 96.48 0.49 96.42 0.47 96.43 0.47


Discussion time l.jpg
Discussion: Time================================================================================

  • Normal time:

    • An avarage 10 fold CV Tmbl experiment on the Prosit breaks requires about 30 hours (min. 20 to max. 50 hours)

    • Here we have k x distance weighting = 30 x 5 = 150 CV experiments

    • Thus, this would normally require about 150 x 30 = 4500 hours = 188 days

  • Time with exhaustkd:

    • A single 10 fold CV experiment with dumping of the knn sets requires about 30 hours

    • Running cvexhaustkd takes about 3 minutes (!)

    • Therefore, we have reduced the required by a factor 150

    • (BTW the “seconds taken” reported by Timbl are a little off :-)


Discussion memory space l.jpg
Discussion: Memory & Space================================================================================

  • Memory:

    • Exhaustkd works locally, reading an instance and its nn’s from file, classifying, and adding the result to a confusion matrix

    • Consumes very little memory (2-5MB)

  • Disk Space:

    • writing knn’s to output can take a lot of space

    • E.g. upsampled Prosit break data with k=31 requires about 1.8GB


Discussion limitations l.jpg
Discussion: limitations================================================================================

  • Obviously, the k of exhaustkd can never be larger than the real k (= the k of the original Timbl experiment)

  • Actually, the k of exhaustkd must be one less than the real k

  • Reason: tie resolution

    • In case of a tie, k is increased by one

  • Also, output of exhaustkd may differ slightly from Timbl’s output

  • Reason: tie resolution

    • If a tie is still unresolved after increasing k,Timbl resorts to a random choice

    • The exact random behaviour cannot be reproduced by exhaustkd


Discussion limitations cont l.jpg
Discussion: limitations (cont.)================================================================================

  • Timbl output (accuracy and #ties):

  • Exhaustkd output (average accuracy and SD):

k Z ID IL ED1.0

1 96.74 4/5 96.74 4/5 96.63 3/5 96.74 4/5

2 97.37 10/12 96.74 0/1 96.74 4/5 96.42 0/1

3 96.42 6/8 96.42 0/1 96.74 0/1 96.53 1/1

4 95.05 3/5 95.68 0/0 96.74 1/1 95.68 0/0

5 95.37 5/5 95.26 0/0 96.42 0/0 95.47 0/0

k: Z: ID: IL: ED1.0:

1 96.78 0.05 96.74 0.00 96.74 0.00 96.74 0.00

2 97.25 0.09 96.74 0.00 96.63 0.00 96.42 0.00

3 96.46 0.05 96.42 0.00 97.05 0.00 96.53 0.00

4 95.05 0.00 95.68 0.00 96.42 0.00 95.68 0.00

5 95.37 0.00 95.26 0.00 96.42 0.00 95.47 0.00


Discussion limitations cont22 l.jpg
Discussion: Limitations (cont.)================================================================================

  • Limit on number of nn’s:

    • Currently, if the number of nn’s exceeds 500, Timbl will only write the first 500

    • Because you don’t want to dump the whole instance base (!)

    • However, would be nice if this was an option


Discussion plans l.jpg
Discussion: Plans================================================================================

  • Exhaustkd can be faster:

    • Code not really profiled yet

    • Code can be (partly) compiled to C

  • Exhaustkd can be combined with methods that optimise feature weighting (-w) and featture metric (-m) options

    • Paramsearch/Iterative Deepening

  • Experiment with exhaustkd’s 3 options for tie resolution:

    • Random

    • Increase k once

    • Increase k continously until tie is resolved

  • Wild plans:

    • Can exhaustkd be a part of Timbl?


ad