150 likes | 155 Views
Similarity Metric for Strings and Graphs. Dr. David Dailey david.dailey@sru.edu Dr. Beverly Gocal beverly.gocal@sru.edu Dr. Deborah Whitfield deborah.whitfield@sru.edu. Outline. Introduction Graph distance String Distance Definitions Examples Implementation Theoretical Results
E N D
Similarity Metric for Strings and Graphs Dr. David Dailey david.dailey@sru.edu Dr. Beverly Gocal beverly.gocal@sru.edu Dr. Deborah Whitfield deborah.whitfield@sru.edu
Outline • Introduction • Graph distance • String Distance • Definitions • Examples • Implementation • Theoretical Results • String Space Examples
Problem Framework • Distance • may be defined for any structure • Overlap of the substructures of two structures • Strings • Graphs • Algebraic structures • Semi-groups • Trees • Web site and web page similarity
Background • Past 15 years • Over 20 papers on graph similarity • Several more on string similarity • Semi-Group • Let T=(S, A) together with the concatenation operation, where A consists of the set of axioms • x, y S, xy S • x, y, z S, x(yz) = (xy)z
Graph and String • Graph: Let T=(S, A) together with a relation ~ where A consists of the set of axioms • x, y S, x ~ y y ~ x • x , (x ~ x) • String Let T=(S,A) together with an associative operation (expressed by concatenation). • Then let Sn be defined recursively by • S1 = S and • Sn = S x Sn-1 and • S* be defined as the infinite union of ordered tuples: S1 S2 …Sn
Approaches • Levenshtein distance calculates minimum number of transformations • Largest shared substructure • Smallest super structure • All of these approaches are relative
Exhaustive Substructure Vector Space (ESVS) • Enumerate all substructures within T and U • Union those two sets (T* U*) =Z • |Z|-dimensional vector space • z(T) be the number of occurrences of structure z as a substructure of T • Calculate Minkowski distance d(T,U)
String Distance Example • Alphabet S = {a,b,c}, a = abaac and b = cbaac • a*= {a,b,c,ab, ba,aa,ac,aba,baa,aac, abaa, baac, abaac} • b* = {a,b,c,cb,ba,aa,ac,cba, baa, aac,cbaa, baac,cbaac} • Z= { a, b, c, ab, cb, ba, aa, ac, cba, aba, baa, aac, cbaa, abaa, baac, cbaac, abaac } (underlined elements are unique to b and boldfaced are unique to a*) • Equal frequency: I = {b, c, ba, aa, ac, baa, aac, baac} • Different frequency: D={a}, • Unique: O= {ab, cb, cba ,aba, cbaa, abaa, cbaac, abaac} • |I| = 8 , |D| = 1, and |O| = 8
String Distance Example • |I| = 8 , |D| = 1, and |O| = 8 • |I| +|D| +|O| = |Z| = 18 . • Contribution of O is |O| • Contribution of I is 0 - substrings appear equally often • Contribution of D, in this case will be 1. • d(a,b) = contribution(I)+ contribution(D)+ contribution(O) = 9
Examples • A= aabc B= abcd • S= {a, a, aa, aab, aabc, ab, abc, b, bc, c} • T= {a, ab, abc, abcd, b, bc, bcd, c, cd, d} • Counts for S and T • a:2 aa:1 aab:1 aabc:1 ab:1 abc:1 b:1 bc:1 c:1 • a:1 ab:1 abc:1 abcd:1 b:1 bc:1 bcd:1 c:1 cd:1 d:1 • Differences: a:1 aad:1 aab:1 aabc:1 ab:0 abc:0 abcd:1 b:0 bc:0 bcd:1 c:0 cd1:0 d:1 • Distance (aabc, abcd) = 8
Examples • Too tedious by hand • http://srufaculty.sru.edu/david.dailey/javascript/StringDistances.html • Distance (aabc, abcd) = 8
Theoretical Results • Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I D|=0), then d(a,b) = n(n+1) • Conjecture: if |a|=|b|=n and a and b share no substrings in common (i.e., |I D|=0), then d(a,b) = n(n+1) • Lemma: if a=an then d(a,aa)= n2 + n(n+1)/2 • Conjecture: if |a|=|b|=n , then d(a,aa)=d(a,ab)=d(b,ab)=d(b,bb)= n2 + n(n+1)/2
Explorations of String Space • Pretty pics
Conclusion • Exhaustive substructure vector space • Calculate distance • Interesting observations used to study structure similarity based on size