1 / 12

Subsequent String Kernel by Han Cheng Liang

Subsequent String Kernel by Han Cheng Liang. Advanced Machine Learning Prof. Tony Jebara. Subsequent String Kernel (SSK) . SSK function Measures how similar two strings are by how many subsequences they share in common. The subsequences do not have to be contiguous:.

nora
Download Presentation

Subsequent String Kernel by Han Cheng Liang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Subsequent String Kernel by Han Cheng Liang Advanced Machine Learning Prof. Tony Jebara

  2. Subsequent String Kernel (SSK) • SSK function • Measures how similar two strings are by how many subsequences they share in common. • The subsequences do not have to be contiguous: • s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie” • s = science is organized knowledge • t = wisdom is organized life • s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie”

  3. s = science is organized knowledge • t = wisdom is organized life • subsequence = “sie” = u SSK Continued • But the further apart the first and the last characters in the subsequence are, the more it is penalized. Define a decay factor,

  4. SSK Formally Defined • Alphabet set: • Set of all subsequences with length n, from alphabet set : • string • string • u is a subsequence of s, if there’s a set of indices i, such that • length of the subsequence u:

  5. SSK Formally Defined • run time:

  6. Improve Performance Using DP • Can improve the runtime to • Define: • 3 Basic Cases: • Recursive Step:

  7. DP Continued • Define: • Two Cases:

  8. Experiments Performed • SSK vs. NGK vs. WK • Varying sequence lengths and decay factors • Combining Kernels of Different lengths- has potential • Combining SSK and NGK- no good • Combining SSK with different decay factors- no good

  9. Subsequent Word Kernel • Instead of having individual letters and the space character in have whole English words. • The size of the alphabet set much larger, but using the DP technique, the runtime is still

  10. Experiments • Data: Yahoo! News. News articles from AP, Reuters, etc. • Four categories: business, politics, entertainment, sports • 60 articles in each, 50 of them used for training and 10 used for testing • Comparable performance to SSK (n=3). Accuracy rate both around 90%. Outperformed SSK in some categories and underperformed in others. • Combining SSK and word subsequence Kernel did not yield improvements.

  11. Kernel Estimation • SSK: used most frequent contiguous subsequences found in some data set • Me: used most frequently used English words. • Results: • top 2000: bad • top 3000: bad • top 4000: 80% accuracy

  12. Future Work • Kernels with different lengths • Upper/lower bounds for the kernel estimation.

More Related