Fast methods for kernel based text analysis
Download
1 / 37

Fast Methods for Kernel-based Text Analysis - PowerPoint PPT Presentation


  • 178 Views
  • Uploaded on

Fast Methods for Kernel-based Text Analysis. Taku Kudo 工藤 拓 Yuji Matsumoto 松本 裕治 NAIST (Nara Institute of Science and Technology). 41st Annual Meeting of the Association for Computational Linguistics , Sapporo JAPAN. Background. Kernel methods (e.g., SVM) become popular

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Fast Methods for Kernel-based Text Analysis' - kenyon-lane


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Fast methods for kernel based text analysis

Fast Methods for Kernel-based Text Analysis

Taku Kudo 工藤 拓

Yuji Matsumoto 松本 裕治

NAIST (Nara Institute of Science and Technology)

41st Annual Meeting of the Association for

Computational Linguistics, Sapporo JAPAN


Background
Background

  • Kernel methods (e.g., SVM)become popular

  • Can incorporate prior knowledge independently from the machine learning algorithms by giving task dependent kernel (generalized dot-product)

  • High accuracy


Problem
Problem

  • Too slow to use kernel-based text analyzers to the real NL applications (e.g., QA or text mining) because of their inefficiency in testing

  • Some kernel-based parsers run only at 2 - 3 seconds/sentence


Goals
Goals

  • Build fast but still accurate kernel- based text analyzers

  • Make it possible to use them to wider range of NL applications


Outline
Outline

  • Polynomial Kernel of degree d

  • Fast Methods for Polynomial kernel

    • PKI

    • PKE

  • Experiments

  • Conclusions and Future Work


Outline1
Outline

  • Polynomial Kernel of degree d

  • Fast Methods for Polynomial kernels

    • PKI

    • PKE

  • Experiments

  • Conclusions and Future Work


Kernel methods
Kernel Methods

Training data

No need to represent example in an explicit    feature vector

Complexity of testing is O(L ・|X|)


Kernels for sets 1 3
Kernels for Sets (1/3)

Focus on the special case where examples   are represented as sets

The instances inNLP are usually           represented as sets (e.g., bag-of-words)

Feature set:

Training data:


Kernels for sets 2 3
Kernels for Sets (2/3)

  • Simple definition:

  • Combinations (subsets) of features

2nd order

3rd order


Kernels for sets 3 3

Head-word: ate

Head-POS: VBD

Modifier-word: cake

Modifier-POS: NN

Head-word: ate

Head-POS: VBD

Modifier-word: cake

Modifier-POS: NN

Head-POS/Modifier-POS: VBD/NN

Head-word/Modifier-POS: ate/NN

X=

Heuristic selection

X=

Subsets (combinations) of basic features are critical   to improve overall accuracy in many NL tasks

Previous approaches select combinations heuristically

Kernels for Sets (3/3)

Dependent (+1) or independent (-1) ?

I ate a cake

PRP VBD DT NN

head

modifier


Polynomial kernel of degree d

Explicit form

is a set of all subsets of with      exactly elements in it

is prior weight to the subsets with size

(subset weight)

Polynomial Kernel of degree d

Implicit form


Example cubic kernel d 3

Explicit form:

Example (Cubic Kernel d=3 )

Implicit form:

Up to 3 subsets are used as new features


Outline2
Outline

  • Polynomial Kernel of degree d

  • Fast Methods for Polynomial kernel

    • PKI

    • PKE

  • Experiments

  • Conclusions and Future Work


Toy example
Toy Example

Feature Set: F={a,b,c,d,e}

Examples:

α

X

j

j

1

0.5

-2

1

2

3

{a, b, c}

{a, b, d}

{b, c, d}

#SVs L =3

Kernel:

Test Example:

X={a,c,e}


Pkb baseline
PKB (Baseline)

K(X,X’) = (|X∩X’|+1)

α

X

j

{a, b, c}

{a, b, d}

{b, c, d}

K(Xj,X)

1

2

3

1

0.5

-2

Test Example

X={a,c,e}

f(X) = 1・(2+1) + 0.5・(1+1) - 2 (1+1) = 15

Complexity is always O(L・|X|)


Pki inverted representation
PKI (Inverted Representation)

K(X,X’) = (|X∩X’|+1)

Inverted Index

α

Xj

B = Avg. size

a

b

c

d

{1,2}

{1,2,3}

{1,3}

{2,3}

Test Example

X= {a, c, e}

{a, b, c}

{a, b, d}

{b, c, d}

1

2

3

1

0.5

-2

f(X)=1・(2+1) + 0.5・(1+1) - 2 (1+1) = 15

Average complexity is O(B・|X|+L)

Efficient if feature space is sparse

Suitable for many NL tasks


Pke expanded representation
PKE (Expanded Representation)

  • Convert into linear form by calculating vector w

  • projects X into its subsets space


Pke expanded representation1

W (Expansion Table)

C

w

φ

{a}

{b}

{c}

{d}

{a,b}

{a,c}

{a,d}

{b,c}

{b,d}

{c,d}

{a,b,c}

{a,b,d}

{a,c,d}

{b,c,d}

1

-0.5

10.5

-3.5

-7

-10.5

18

12

6

-12

-18

-24

6

3

0

-12

c3(0)=1, c3(1)=7,

c3(2)=12, c3(3)=6

Test Example

X={a,c,e}

7

αj

Xj

1

2

3

1

0.5

-2

{a, b, c}

{a, b, d}

{b, c, d}

12

{φ,{a},{c}, {e},

{a,c},{a,e},

{c,e},{a,c,e}}

F(X)= - 0.5 + 10.5

– 7 + 12

= 15

6

w({b,d}) = 12 (0.5 – 2 ) = -18

d

Complexity is O(|X| ) ,

 independent of the number of SVs (L)

Efficient if the number of SVs is large

PKE (Expanded Representation)

3

K(X,X’) = (|X∩X’|+1)


Pke in practice
PKE in Practice

  • Hard to calculate Expansion Tableexactly

  • Use Approximated Expansion Table

  • Subsets with smaller |w| can be removed, since |w| represents a contribution to the final classification

  • Use subset mining (a.k.a. basket mining) algorithm for efficient calculation


Subset mining problem
Subset Mining Problem

set

id

{a}:3

{b}:3

{c}:3 {d}:2

{a b}:2 {b c}: 2

{a c}:2 {a d}: 2

1

{ a c d }

2

{ a b c }

3

{ a b d }

4

{ b c e }

Results

Transaction Database

Extract all subsets that occur in no less than     sets of the transaction database

and no size constraints → NP-hard

Efficient algorithms have been proposed          (e.g., Apriori, PrefixSpan)


Feature selection as mining

Direct generation with subset mining

σ=10

s w

s

φ

{a}

{b}

{c}

{d}

{a,b}

{a,c}

{a,d}

{b,c}

{b,d}

{c,d}

{a,b,c}

{a,b,d}

{a,c,d}

{b,c,d}

W

-0.5

10.5

-3.5

-7

-10.5

12

12

6

-12

-18

-24

6

3

0

-12

10.5

-10.5

12

12

-12

-18

-24

-12

{a}

{d}

{a,b}

{a,c}

{b,c}

{b,d}

{c,d}

{b,c,d}

Exhaustive generation and testing

→ Impractical!

Feature Selection as Mining

αi

Xi

{a, b, c}

{a, b, d}

{b, c, d}

1

2

3

1

0.5

-2

  • Can efficiently build the approximated table

  • σ controls the rate of approximation


Outline3
Outline

  • Polynomial Kernel of degree d

  • Fast Methods for Polynomial kernel

    • PKI

    • PKE

  • Experiments

  • Conclusions and Future Work


Experimental settings
Experimental Settings

  • Three NL tasks

    • English Base-NP Chunking (EBC)

    • Japanese Word Segmentation (JWS)

    • Japanese Dependency Parsing (JDP)

  • Kernel Settings

    • Quadratic kernel is applied to EBC

    • Cubic kernel is applied to JWS and JDP


Results english base np chunking
Results (English Base-NP Chunking)


Results japanese word segmentation
Results (Japanese Word Segmentation)


Results japanese dependency parsing
Results (Japanese Dependency Parsing)


Results
Results

  • 2 - 12 fold speed up in PKI

  • 30 - 300 fold speed up in PKE

  • Preserve the accuracy when we set an appropriate σ


Comparison with related work
Comparison with related work

  • XQK [Isozaki et al. 02]

    • Same concept as PKE

    • Designed only for the Quadratic Kernel

    • Exhaustively creates the expansion table

  • PKE

    • Designed for general Polynomial Kernels

    • Uses subset mining algorithms to create the expansion table


Conclusions
Conclusions

  • Propose two fast methods for the polynomial kernel of degree d

    • PKI (Inverted)

    • PKE (Expanded)

  • 2-12 fold speed up in PKI, 30-300 fold speed up in PKE

  • Preserve the accuracy


Future work
Future Work

  • Examine the effectiveness in a general machine learning dataset

  • Apply PKE to other convolution kernels

    • Tree Kernel [Collins 00]

      • Dot-product between trees

      • Feature space is all sub-tree

      • Apply sub-tree mining algorithm [Zaki 02]


English base np chunking
English Base-NP Chunking

Extract Non-overlapping Noun Phrase from text

[NP He ] reckons [NP the current account deficit ] will narrow to

[NP only # 1.8 billion ]in [NP September ] .

  • BIO representation (seeing as a tagging task)

    • B: beginning of chunk

    • I: non-initial chunk

    • O: outside

  • Pair-wise method to 3-class problem

  • training: wsj15-18, test: wsj20 (standard set)


Japanese word segmentation
Japanese Word Segmentation

Taro made Hanako read a book

Sentence:

太 郎 は 花 子 に 本 を 読 ま せ た

↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

Boundaries:

If there is a boundary between and

, otherwise

  • Distinguish the relative position

  • Use also the character types of Japanese

  • Training: KUC 01-08, Test: KUC 09


Japanese dependency parsing
Japanese Dependency Parsing

私は   ケーキを   食べる

I-top cake-acc. eat

I eat a cake

  • Identify the correct dependency relations    between two bunsetsu(base phrase in English)

  • Linguistic features related to the modifier   and head (word, POS, POS-subcat, inflections, punctuations, etc)

  • Binary classification (+1 dependent, -1 independent)

  • Cascaded Chunking Model [kudo, et al. 02]

  • Training: KUC 01-08, Test: KUC 09


Kernel methods 1 2
Kernel Methods (1/2)

Suppose a learning task:

training examples

X : example to be classified

Xi: training examples

: weight for examples

: a function to map examplesto another vectorial space


Pke expanded representation2
PKE (Expanded Representation)

If we calculate in advance ( is the indicator function)

for all subsets


Trie representation
TRIE representation

root

w

10.5

-10.5

12

12

-12

-18

-24

-12

{a}

{d}

{a,b}

{a,c}

{b,c}

{b,d}

{c,d}

{b,c,d}

a

b

c

d

10.5

-10.5

c

c

d

d

b

-24

12

12

-12

-18

d

-12

Compress redundant structures

Classification can be done by simply     traversing the TRIE


Kernel methods1
Kernel Methods

Training data

No need to represent example in an explicit    feature vector

Complexity of testing is O(L |X|)


ad