1 / 38

Dynamic Edit Distance Table under a General Weighted Cost Function

Dynamic Edit Distance Table under a General Weighted Cost Function. Heikki Hyyrö (University of Tampere, Finland) Kazuyuki Narisawa (Kyushu University, Japan) and Shunsuke Inenaga (Kyushu University, Japan). Contents. Edit Distance Left Increment/Decrement Edit Distance Problem

bijan
Download Presentation

Dynamic Edit Distance Table under a General Weighted Cost Function

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Edit Distance Table under a General Weighted Cost Function HeikkiHyyrö(University of Tampere, Finland) Kazuyuki Narisawa(Kyushu University, Japan) and ShunsukeInenaga (Kyushu University, Japan)

  2. Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary

  3. Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary

  4. Edit Distance minimum total cost dfor transforming stringx[1:n]toy[1:m] Example x=prague, y = passage Ins. = Del. = Sub. =1 Edit Distance = Sub. + Ins. + Ins. + Del. = 1+1+1+1 = 4

  5. Dynamic Programming

  6. Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary

  7. Right Increment/Decrement • Right I/D of Edit Distance • input : D of strings A and B • output : D’ of stringsAandB’( B = B’aorBa= B’) • easy to compute • insert or delete right column of D → D’ :O(m) decrement increment

  8. Left Increment/Decrement • Left I/D of ED • input : D of stringsA andB • output : D of stringsAandB’( B = aB’oraB= B’) • difficult to compute • values of left side effect to the values of right side increment decrement

  9. Contribution • Propose an efficient algorithm for Left I/D problem with any nonnegative integer costs • Left I/D problem • input : ED table D of strings A and B • output : ED table D’ of strings A and B’ • B = aB’ (decrement) • B’ = aB(increment) • costs of operations are nonnegative integers

  10. Applications • Cyclic String Comparison [Landau et. al 1998] • Computing Approximate periods [Schmidt 1998] • Edit distance for sliding window • String Kernel based on Edit distance • kernel is mapping to high dimensional feature space • used in Support Vector Machine(classifier)

  11. Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary

  12. Related Work • naïve method • computeD’ from scratch • O(nm) time • Kim & Park algorithm [2004] • Each operation has cost 1 • Compute difference representation DRof table D • Using Change TableCh • O(n+m) time

  13. Definition • Left Increment/Decrement Problem • input : DR table of stringAandB • output : DR’ table of stringAandB’ • B = aB’ (decrement) • B’ = aB(increment) • Each cost (Ins., Del., Sub.) is a Non Negative Integer • Kim & Park algorithm : each cost is 1

  14. Difference Representation under minus upper right minus left

  15. DR’ – DR We need not update all cells

  16. Change Table • Ch[i, j] = D’[i, j] – D[i, j] • cost = 1 • values in Ch : –1, 0, 1 • is separated into three areas

  17. Affected Entries • entries whereDR’[i, j] ≠ DR[i, j] • they must be updated • affected entries arealong the borders of three areas in Ch

  18. Sketch of Kim & Park Algorithm • Update affected entries • scan borders in Ch, computing Ch and DR’ • Time Complexity : O(n+m)

  19. Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary

  20. General Costs • Chcan be separated into more than three areas • the number of areas depends on the costs • the values are not limited to –1, 0, 1 • Kim & Park algorithm • is specialized to the three area case • can not be applied with general costs Example Ins. = 2, Del. = 2, Sub. =1

  21. Our Algorithm • Update only affected entries • without Ch • compute only DR’.U andDR’.L • Time complexity : O(min{c(n+m), nm}) • c is the maximum cost DR’.L – DR.L DR’.U – DR.U D’ – D

  22. Affected Entry • DR’[i, j] ≠ DR[i, j] • Kim & Park Algorithm • computes DR’and Ch for computing Affected Entry • Our Algorithm • compute affected entry by only DR table • use following lemma

  23. comparison of pseudo codes

  24. comparison of behaviors our algorithm Kim & Park algorithm

  25. Contents • Edit Distance • Left Increment/Decrement Edit Distance Problem • Related Work • Our Algorithm • Experiments • Summary

  26. Experiments • stringsA[1:m]andB[1:m] • Total time of computing representations of edit distance between Aand B[ j:m] for j = m, m–1,…, 1 • left incremental computation • Machine Specifications • CentOS Linux • Xeon 3.0GhHz • 16GB memory

  27. Experiment 1 • Time comparison with naïve algorithm • costs • chosen randomly • Insertion = 137,Deletion = 116, Substitution = 242 • Random data • alphabet size2,3, …, 52 • string length100, 200, …, 5000

  28. Result 1

  29. Result 1

  30. Experiment 2 • Time comparison with Kim & Parkalgorithm • costs • Insertion = Deletion = Substitution = 1 • Random data • alphabet size 2, 3, , …, 52 • string length 100, 200, …, 5000

  31. Result 2

  32. Result 2

  33. Experiment 3 • TimeCompare with naïve algorithm • Corpus • English(reuters news) • costs • Insertion= 137, Deletion = 116, Substitution = 242 • string length : 1000, 2000, 3000, 4000, 5000 • Protein data(canterbury corpus: E.coli) • costs proposed in [Kurtz 1996] • string length : 1000, 2000, 3000, 4000, 5000

  34. Result 3 English News Protein Data

  35. Summary • Algorithm for Left I/D problem • nonnegative integer costs • O( min{c(n+m), nm} ) • cis the maximum cost • experimentally fast

  36. Related Work • naïve method • computeD’ from scratch • O(nm) time • Kim & Park algorithm [2004] • Each operation has cost 1 • Compute difference representation DR →DR’ • Using Change TableCh • O(n+m) time naïve Kim & Park O(nm) D DR, Ch O(nm) O(n+m) D’ DR’, Ch O(1) O(n+m) Edit Distance

More Related