exhaustkd
Download
Skip this Video
Download Presentation
exhaustkd

Loading in 2 Seconds...

play fullscreen
1 / 23

exhaustkd - PowerPoint PPT Presentation


  • 220 Views
  • Uploaded on

exhaustkd Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters Overview Introduction: Timbl’s k & distance weighting Idea: Read knn sets and distances from Timbl’s output Implementation: exhaust.py & cvexhaust.py Examples: diminutive & Prosit data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'exhaustkd' - ostinmannual


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
exhaustkd

exhaustkd

Searching exhaustively for an optimal setting of Timbl’s -k and -d parameters

overview
Overview
  • Introduction:
    • Timbl’s k & distance weighting
  • Idea:
    • Read knn sets and distances from Timbl’s output
  • Implementation:
    • exhaust.py & cvexhaust.py
  • Examples:
    • diminutive & Prosit data
  • Discussion:
    • limitations & improvements
introduction
Introduction

Knn classification without distance weighting

k=1

?

X

X

k=2

k=3

X

Y

Y

X

Y

X

X

introduction cont
Introduction (cont.)

Knn classification with distance weighting

X

k=1

?

X

k=2

k=3

Y

X

Y

X

Y

X

X

introduction cont5
Introduction (cont.)
  • Distance weighting methods:
    • Z (no weighting)
    • ID (inverse distance)
    • IL (inverse linear)
    • EDa (exponential decay with alpha a)
slide6
Idea
  • Knn classification is actually a two-step process:
    • Determine the nearest neighbor sets (= those instances with similar features, optionally using feature weighting and MVDM)
    • Determine the majority class within all nearest neighbors at maximally the k-th distance, optionally using distance weighting
  • We can do step 2 without repeating step 1!
idea cont
Idea (cont.)
  • The +v option allows you to write the knn sets and their distances to the output file
    • +vn := write nearest neighbors
    • +vdi :=write distance (of the instance to be classified, and of its knn sets)
    • +vdb := write class distribution (of the instance to be classified, and of its knn sets)
  • Example:
    • Timbl -f dimin.train -t dimin.test -k3 +vn+di+db
slide8

=,=,=,=,+,k,u,=,-,bl,u,m,E,E { E 4.00000, P 3.00000 } 0.0000000000000

# k=1, 1 Neighbor(s) at distance: 0.00000

# =,=,=,=,+,k,u,=,-,bl,u,m,{ P 1 }

# k=2, 1 Neighbor(s) at distance: 0.0594251

# =,=,=,=,+,m,K,=,-,bl,u,m,{ E 1 }

# k=3, 5 Neighbor(s) at distance: 0.103409

# =,=,=,=,+,m,O,z,-,bl,u,m,{ E 1, P 1 }

# =,=,=,=,+,st,K,l,-,bl,u,m,{ E 1 }

# =,=,=,=,+,m,y,r,-,bl,u,m,{ E 1, P 1 }

+,m,I,=,-,d,A,G,-,d,},t,J,J { J 8.00000 } 0.28274085738293

# k=1, 1 Neighbor(s) at distance: 0.282741

# -,v,@,r,+,v,A,l,-,p,},t,{ J 1 }

# k=2, 6 Neighbor(s) at distance: 0.311890

# =,=,=,=,=,=,=,=,+,k,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,p,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,xr,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,l,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,h,},t,{ J 1 }

# =,=,=,=,=,=,=,=,+,fr,},t,{ J 1 }

# k=3, 1 Neighbor(s) at distance: 0.325529

# +,m,K,=,-,d,@,=,-,pr,a,t,{ J 1 }

idea cont9
Idea (cont.)
  • From this output, you can
    • read the knn members and their distances
    • repeat classification for smaller k’s and other distance weightings
    • without calculating the knn sets and their distances again
  • Hence
    • classification is potentially much faster
    • exhaustively trying all combinations of k and distance weighting becomes feasible
implementation
Implementation
  • Python scripts
    • exhaustkd
    • cvexhaustkd
  • Requirements:
    • Minimally python 2.1
    • Expenv libraries
  • Input
    • List of classes
    • Timbl output file(s) produced with +vn+di+db and a high k
    • Some option settings
  • Output
    • tables with performance measures (accuracy, recall, precision, F-score) for all combinations of k and d
slide11

$ exhaustkd -h

usage:

exhaustkd [options] CLASSES FILE

exhaustkd [options] CLASSES <FILE

purpose:

Timbl\'s +vn+di+di option causes it to add the nearest neigbors and

their distances to its output. This output can then be passed on to

exhaustkd to perform an exhaustive classification over a range over k\'s

and distance weighting metrics. It will always try Z, ID, and IL.

Optionally, various settings of ED can be tried.

exhaustkd tabulates the performance for all settings. Which evaluation

metrics are reported depends on the PATTERN of the -o option.

A PATTERN is a comma-separated list of one or more of the

following symbols:

A = accuracy

K = kappa

P = combined precision

R = combined recall

F = combined f-score

pC = precision on class C

rC = recall on class C

fC = F-score on class C

args:

CLASSES classes as a comma separated list

FILE classifier output file

slide12

options:

--version show program\'s version number and exit

-h, --help show this help message and exit

-aFLOAT1,FLOAT2,...,FLOATn, --alphas=FLOAT1,FLOAT2,...,FLOATn

values to try as the alpha constant in the exponential

decay metric (default is none)

-bFLOAT, --beta=FLOAT

beta in F score calculation (default is 1.0)

-dSTRING, --delimiter=STRING

column delimiter (default is \' \')

-f, --full-output output all available evaluation metrics for every

setting

-kINT, --max-k=INT the maximum number of nearest neighbors to try (default

is 1)

-nINT, --n-best=INT the number of settings reported in the n-best list

(default is 10)

-oPATTTERN, --output=PATTTERN

output patttern (default is \'A,P,R,F\')

-rINT, --random-seed=INT

seed for random generator (default is current system

time)

-t{once|continue|random}, --tie-resolution={once|continue|random}

tie resolution by increasing k once (default), by

increasing k continuously, or by choosing randomly

-%, --percent output in percentages

slide13

examples:

exhaustkd -k5 -a1.0,2.0 X,Y,Z output_file

perform an exhaustive classification into classes X,Y,Z

with k from 1 to 5, and distance metrics Z, ID, IL, ED1.0 and D2.0

exhaustkd -% -opX,rX,fX X,Y,Z <output_file

output precision, recall, and f score percentages on class X

exhaustkd -k10 -tcontinue -0A X,Y,Z output_file

ouput accuracy when using continuous tie resolution upto k=10

example diminutive
Example: diminutive
  • Commands
    • Timbl -f dimin.train -t dimin.test -o out -k5 +vn+db+di
    • exhaustkd -d, -k5 -a1,5 -oA -% P,T,J,E,K out
slide15

================================================================================================================================================================

Accuracy (%)

================================================================================

k Z ID IL ED1.0 ED5.0

1 96.74 96.74 96.74 96.74 96.74

2 97.37 96.74 96.63 96.42 96.63

3 96.42 96.42 97.05 96.53 96.95

4 95.05 95.68 96.42 95.68 96.32

5 95.37 95.26 96.42 95.47 95.79

Rank: Score: k: d:

1 97.37 1 Z

2 97.05 2 IL

3 96.95 2 ED5.0

4 96.74 1 ID

5 96.74 0 Z

6 96.74 0 IL

7 96.74 0 ID

example prosit breaks
Example: Prosit breaks
  • Commands:
    • For each of the 10 folds, a Timbl with -k31 -o out0??
    • cvexhaustkd -a1,5 -k30 -% -oA,P,R,F,pB,rB,fB B,- out0?? >exhaustive-report
slide17

================================================================================================================================================================

Accuracy (%)

================================================================================

k: Z: ID: IL: ED1.0: ED5.0:

1 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35

2 96.00 0.44 94.83 0.35 94.83 0.35 94.83 0.35 94.83 0.35

3 96.00 0.44 96.00 0.44 95.89 0.48 96.01 0.44 95.99 0.43

4 96.28 0.41 96.02 0.47 95.99 0.47 96.02 0.46 96.02 0.47

5 96.28 0.41 96.29 0.41 96.22 0.49 96.28 0.41 96.28 0.40

6 96.37 0.44 96.28 0.41 96.24 0.41 96.27 0.42 96.27 0.41

7 96.37 0.44 96.38 0.44 96.36 0.46 96.37 0.44 96.36 0.45

8 96.41 0.45 96.42 0.47 96.37 0.44 96.41 0.47 96.41 0.47

9 96.41 0.45 96.42 0.44 96.42 0.46 96.41 0.44 96.43 0.43

10 96.40 0.42 96.45 0.43 96.45 0.45 96.45 0.43 96.45 0.43

11 96.40 0.42 96.41 0.41 96.46 0.46 96.40 0.42 96.43 0.43

12 96.43 0.41 96.44 0.44 96.46 0.47 96.43 0.45 96.45 0.44

13 96.43 0.41 96.43 0.41 96.47 0.44 96.42 0.41 96.44 0.42

14 96.44 0.47 96.46 0.45 96.47 0.46 96.46 0.45 96.48 0.46

15 96.44 0.47 96.44 0.47 96.47 0.47 96.44 0.47 96.46 0.46

16 96.41 0.49 96.45 0.47 96.45 0.46 96.45 0.47 96.47 0.47

17 96.41 0.49 96.41 0.48 96.46 0.47 96.40 0.49 96.45 0.47

18 96.40 0.48 96.45 0.47 96.47 0.47 96.44 0.47 96.46 0.47

19 96.40 0.48 96.40 0.47 96.47 0.48 96.39 0.48 96.44 0.45

20 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.46

20 96.37 0.49 96.41 0.46 96.48 0.48 96.41 0.47 96.43 0.46

21 96.37 0.49 96.37 0.48 96.48 0.49 96.37 0.49 96.40 0.47

22 96.38 0.49 96.39 0.46 96.48 0.47 96.40 0.47 96.41 0.46

23 96.38 0.49 96.38 0.49 96.48 0.49 96.38 0.49 96.41 0.48

24 96.39 0.51 96.41 0.48 96.48 0.50 96.41 0.48 96.43 0.47

25 96.39 0.51 96.39 0.50 96.47 0.49 96.39 0.51 96.42 0.49

26 96.40 0.48 96.43 0.50 96.49 0.49 96.43 0.49 96.45 0.49

27 96.39 0.48 96.40 0.48 96.48 0.49 96.40 0.48 96.43 0.49

28 96.40 0.46 96.41 0.48 96.48 0.49 96.41 0.48 96.41 0.47

29 96.40 0.46 96.40 0.46 96.48 0.49 96.40 0.46 96.43 0.47

30 96.39 0.49 96.43 0.46 96.48 0.49 96.42 0.47 96.43 0.47

discussion time
Discussion: Time
  • Normal time:
    • An avarage 10 fold CV Tmbl experiment on the Prosit breaks requires about 30 hours (min. 20 to max. 50 hours)
    • Here we have k x distance weighting = 30 x 5 = 150 CV experiments
    • Thus, this would normally require about 150 x 30 = 4500 hours = 188 days
  • Time with exhaustkd:
    • A single 10 fold CV experiment with dumping of the knn sets requires about 30 hours
    • Running cvexhaustkd takes about 3 minutes (!)
    • Therefore, we have reduced the required by a factor 150
    • (BTW the “seconds taken” reported by Timbl are a little off :-)
discussion memory space
Discussion: Memory & Space
  • Memory:
    • Exhaustkd works locally, reading an instance and its nn’s from file, classifying, and adding the result to a confusion matrix
    • Consumes very little memory (2-5MB)
  • Disk Space:
    • writing knn’s to output can take a lot of space
    • E.g. upsampled Prosit break data with k=31 requires about 1.8GB
discussion limitations
Discussion: limitations
  • Obviously, the k of exhaustkd can never be larger than the real k (= the k of the original Timbl experiment)
  • Actually, the k of exhaustkd must be one less than the real k
  • Reason: tie resolution
    • In case of a tie, k is increased by one
  • Also, output of exhaustkd may differ slightly from Timbl’s output
  • Reason: tie resolution
    • If a tie is still unresolved after increasing k,Timbl resorts to a random choice
    • The exact random behaviour cannot be reproduced by exhaustkd
discussion limitations cont
Discussion: limitations (cont.)
  • Timbl output (accuracy and #ties):
  • Exhaustkd output (average accuracy and SD):

k Z ID IL ED1.0

1 96.74 4/5 96.74 4/5 96.63 3/5 96.74 4/5

2 97.37 10/12 96.74 0/1 96.74 4/5 96.42 0/1

3 96.42 6/8 96.42 0/1 96.74 0/1 96.53 1/1

4 95.05 3/5 95.68 0/0 96.74 1/1 95.68 0/0

5 95.37 5/5 95.26 0/0 96.42 0/0 95.47 0/0

k: Z: ID: IL: ED1.0:

1 96.78 0.05 96.74 0.00 96.74 0.00 96.74 0.00

2 97.25 0.09 96.74 0.00 96.63 0.00 96.42 0.00

3 96.46 0.05 96.42 0.00 97.05 0.00 96.53 0.00

4 95.05 0.00 95.68 0.00 96.42 0.00 95.68 0.00

5 95.37 0.00 95.26 0.00 96.42 0.00 95.47 0.00

discussion limitations cont22
Discussion: Limitations (cont.)
  • Limit on number of nn’s:
    • Currently, if the number of nn’s exceeds 500, Timbl will only write the first 500
    • Because you don’t want to dump the whole instance base (!)
    • However, would be nice if this was an option
discussion plans
Discussion: Plans
  • Exhaustkd can be faster:
    • Code not really profiled yet
    • Code can be (partly) compiled to C
  • Exhaustkd can be combined with methods that optimise feature weighting (-w) and featture metric (-m) options
    • Paramsearch/Iterative Deepening
  • Experiment with exhaustkd’s 3 options for tie resolution:
    • Random
    • Increase k once
    • Increase k continously until tie is resolved
  • Wild plans:
    • Can exhaustkd be a part of Timbl?
ad