Substring Statistics

Substring Statistics Kyoji Umemura Kenneth Church

Goal: Words  Substrings(Anything you can do with words, we can do with substrings) Sound Bite • Haven’t achieved this goal • But we can do more with substrings than you might have thought • Review: • Using Suffix Arrays to compute Term Frequency and Document Frequency for All Substrings in a Corpus (Yamamoto & Church) • Tri-grams  Million-grams • Tutorial: Make substring statistics look easy • Previous treatments are (a bit) inaccessible • Generalization: • Document Frequency (df)  dfk (adaptation) • Applications • Word Breaking (Japanese & Chinese) • Term Extraction for Information Retrieval Sound Bite

The chance of Two Noriegas is Closer to p/2 than p2:Implications for Language Modeling, Information Retrieval and Gzip • Standard indep models (Binomial, Multinomial, Poisson): • Chance of 1stNoriega is p • Chance of 2nd is also p • Repetition is very common • Ngrams/words (and their variant forms) appear in bursts • Noriega appears several times in a doc, or not at all. • Adaptation & Contagious probability distributions • Discourse structure (e.g., text cohesion, given/new): • 1stNoriega in a document is marked (more surprising) • 2nd is unmarked (less surprising) • Empirically, we find first Noriega is surprising (p≈6/1000) • But chance of two is not surprising (closer to p/2 than p2) • Finding a rare word like Noriega is like lightning • We might not expect lightning to strike twice in a doc • But it happens all the time, especially for good keywords • Documents ≠ Random Bags of Words Motivation & Background: Unigrams  Substrings (ngrams)

Three Applications & Independence Assumptions:No Quantity Discounts • Compression: Huffman Coding • |encoding(s)| = ceil(−log2 Pr(s)) • Two Noriegas consume twice as much space as one • |encoding(s s)| = |encoding(s)| + |encoding(s)| • No quantity discount • Indep is the worst case: any dependencies  less H (space) • Information Retrieval • Score(query, doc) = ∑term in doc tf(term, doc) idf(term) • idf(term): inverse doc freq: −log2 Pr(term) = −log2 df(term)/D • tf(term, doc): number of instances of term in doc • Two Noriegas are twice as surprising as one (2 idf v. idf) • No quantity discount: any dependencies  less surprise • Speech Recognition, OCR, Spelling Correction • I  Noisy Channel  O • Pr(I) Pr(O|I) • Pr(I) = Pr(w1, w2 … wn) ≈ ∏k Pr(wk|wk-2, wk-1) Log tf smoothing

Interestingness Metrics:Deviations from Independence • Poisson (and other indep assumptions) • Not bad for meaningless random strings • Deviations from Poisson are clues for hidden variables • Meaning, content, genre, topic, author, etc. • Analogous to mutual information (Hanks) • Pr(doctor…nurse) >> Pr(doctor) Pr(nurse)

If we had a good description of the distribution: Pr(k) • Then we could compute any summary statistic • Moments: mean, var • Entropy: • Adaptation: Pr(k≥2|k ≥1)

Poisson Mixtures: More Poissons  Better Fit(Interpretation: Each Poisson is conditional on hidden variables: meaning, content, genre, topic, author, etc.)

Adaptation: Three Approaches • Cache-based adaptation • Parametric Models • Poisson, Two Poisson, Mixtures (neg binomial) • Non-parametric • Pr(+adapt1) ≡ Pr(test|hist) • Pr(+adapt2) ≡ Pr(k≥2|k ≥1)

Positive & Negative Adaptation • Adaptation: • How do probabilities change as we read a doc? • Intuition: If a word w has been seen recently • +adapt: prob of w (and its friends) goes way up • −adapt: prob of many other words goes down a little • Pr(+adapt) >> Pr(prior) > Pr(−adapt)

Adaptation: Method 1 • Split each document into two equal pieces: • Hist: 1st half of doc • Test: 2nd half of doc • Task: • Given hist • Predict test • Compute contingency table for each word

Adaptation: Method 1 • Notation • D = a+b+c+d (library) • df = a+b+c (doc freq) • Prior: • +adapt • −adapt

Priming, Neighborhoods and Query Expansion • Priming: doctor/nurse • Doctor in hist  Pr(Nurse in test) ↑ • Find docs near hist (IR sense) • Neighborhood ≡ set of words in docs near hist (query expansion) • Partition vocabulary into three sets: • Hist: Word in hist • Near: Word in neighborhood − hist • Other: None of the above • Prior: • +adapt • Near • Other

Adaptation: Hist >> Near >> Prior • Magnitude is huge • p/2 >> p2 • Two Noriegas are not much more surprising than one • Huge quantity discounts • Shape: Given/new • 1st mention: marked • Surprising (low prob) • Depends on freq • 2nd: unmarked • Less surprising • Independent of freq • Priming: • “a little bit” marked

Adaptation is Lexical • Lexical: adaptation is • Stronger for good keywords (Kennedy) • Than random strings, function words (except), etc. • Content ≠ low frequency

Adaptation: Method 2 • Pr(+adapt2) • dfk(w) ≡ number of documents that • mention word w • at least k times • df1(w) ≡ standard def of document freq (df)

Pr(+adapt1) ≈ Pr(+adapt2)Within factors of 2-3 (as opposed to 10-1000) 3rd mention Priming

Adaptation helps more than it hurts Hist is a great clue • Examples of big winners (Boilerplate) • Lists of major cities and their temperatures • Lists of major currencies and their prices • Lists of commodities and their prices • Lists of senators and how they voted • Examples of big losers • Summary articles • Articles that were garbled in transmission Hist is misleading

Recent Work (with Kyoji Umemura) • Applications: Japanese Morphology (text  words) • Standard methods: dictionary-based • Challenge: OOV (out of vocabulary) • Good keywords (OOV) adapt more than meaningless fragments • Poisson model: not bad for meaningless random strings • Adaptation (deviations from Poisson): great clues for hidden variables • OOV, good keywords, technical terminology, meaning, content, genre, author, etc. • Extend dictionary method to also look for substrings that adapt a lot • Practical procedure for counting dfk(s) for all substrings s in a large corpus (trigrams  million grams) • Suffix array: standard method for computing freq and loc for all s • Yamamoto & Church (2001): count df for all s in large corpus • df (and many other ngram stats) for million-grams • Although there are too many substrings s to work with (n2) • They can be grouped into a manageable number of equiv classes (n) • Where all substrings in a class share the same stats • Umemura (unpublished): generalize method for dfk • Adaptation for million-grams Today’s Talk

Adaptation Conclusions • Large magnitude (p/2 >> p2) • big quantity discounts • Distinctive shape • 1st mention depends on freq • 2nd does not • Priming: between 1st mention and 2nd • Lexical: • Independence assumptions aren’t bad for meaningless random strings, function words, common first names, etc. • More adaptation for content words (good keywords, OOV)

Goal: Substring Statistics Words  Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str  All strs N2 substrings  N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes DFS Traversal of Class Tree Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here Sound Bite

Text Suffix Arrays:Freq & loc of all ngrams • Input: text, an array of N tokens • Null terminated • Tokens: words, bytes, Asian chars • Output: s, an array of N ints, sorted “lexicographically” • s[i] denotes a semi-infinite string • Text starting at position s[i] and continuing to end • s[i] ≡ substr(text, s[i]) ≡ text+s[i] • Simple Practical Procedure • Initialize for(i=0; i<N; i++) s[i] = i; • Sort “lexicographically” qsort(s, N, sizeof(*s), sufcmp); int sufcmp(int *a, int *b) { return strcmp(text + *a, text + *b);} Text

Frequency & Location of All Ngrams(Unigrams, Bigrams, Trigrams & Million-grams) • Sufconc(pattern) outputs a concordance • Two binary searches → <i, j> • i = first suffix in suffix array that starts with pattern • j = last suffix in suffix array that starts with pattern • Freq(<i, j>) = j – i + 1 • Output j – i + 1 concordance lines, one for each suffix in <i, j> ./sufconc -l 10 -r 40 /cygdrive/d/temp2/AP/AP8912 'Manuel Noriega' | head 17913368 5441: osed Gen. ^ Manuel Noriega\nin Panama _ their wives 13789741 4193: apprehend ^ Manuel Noriega\n The situation in Pana 3966027 1218: nian Gen. ^ Manuel Noriega a\n$300,000 present, and 4938894 1503: nian Gen. ^ Manuel Noriega and\nothers laundered $50 16718522 5098: ed ruler\n ^ Manuel Noriega continue to evade U.S. fo 18568442 5635: to force ^ Manuel Noriega from\npower.\n Further 14794912 4497: oust Gen. ^ Manuel Noriega from power, the zoo's dir 14434223 4380: that Gen. ^ Manuel Noriega had been killed.\n Mary 14237714 4321: .''\n `` ^ Manuel Noriega had explicitly declared w 19901786 6061: nian Gen. ^ Manuel Noriega in the hands of a special s[i] doc(s[i])

Suffix Arrays: Computational ComplexityBottom Line: O(N log N) Time & O(N) Space • Simple Practical Procedure • Initialize for(i=0; i<N; i++) s[i] = i; • Sort “lexicographically” qsort(s, N, sizeof(*s), sufcmp); int sufcmp(int *a, int *b) { return strcmp(text + *a, text + *b);} • You might think this takes O(N log N) time • But unfortunately, sufcmp is not O(1) • so the sort is O(N2 log N) • Fortunately, there is an O(N log N) alg • See http://www.cs.dartmouth.edu/~doug/ for excellent tutorial • But in practice, the simple procedure is often just as good • (if not slightly better)

Same as before, but tokens are now bytes rather than words

Distribution of LCPs (1989 AP News) • Peak ≈ 10 bytes (roughly word bigrams) • Long tail (boilerplate & duplicate docs)

Goal: Substring Statistics Words  Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str  All strs N2 substrings  N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Depth-First Traversal of Class Tree Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here

Distributional Equivalence • sufconc • Frequency and location for one substring • Impressive: trigrams  million-grams • Challenge: One substring  All substrings • Too many substrings: N2 • Solution: group substrings into equiv classes • N2 substrings  N classes • str1 = str2 iff • Every suffix that starts with str1 also starts with str2 • Example: “to be or not to be” • “to” = “to be” • Class(<i,j>) = { str | str starts every suffix in interval and no others } • Compute stats over N classes • Rather than over N2 substrings

Grouping Substrings into Classes • Interval on Suffix Array: <i, j> • Class(<i, j>) is a set of substrings that • Start every suffix within the interval • And no suffixes outside the interval • Examples: • Class(<6,7>) = {“b”, “be”} • Class(<17,18>) = {“to”, “to_”, “to_b”, “to_be”} • Classes form an equivalence relation R • str1 R str2 ↔ str1 & str2 in same class • Interpretation: distributional equivalence • “to” R “to be” → “to” and “to be” appear in exactly the same places in corpus • R partitions the set of all substrings: • Every substring appears in one and only one class • R is reflexive, symmetric and transitive

Although there are too many substrings to work with (≈N2),they can be grouped into a manageable number of classes

171 Substrings  8 Non-Trivial Classes • Corpus: • N = 18 • to_be_or_not_to_be • Substrings: • Theory: • N ∙(N+1)/2 = 171 • 150 observed • 135 with freq=1 (yellow) • 15 with freq>1 (green) • Classes: • Theory: • 2N = 36 (N trivial + N non-trivial) • 21 observed • 13 trivial (yellow) • 8 non-trivial (green)

Motivation for Grouping Substrings into Equivalence Classes • Computational Issues: • N is more manageable than N2 • Statistics can be computed over classes • Because all substrings in a class have the same stats • for many popular stats: freq, df, dfk, joint df & contingency tables, and combinations thereof • Examples: (corpus = “to_be_or_not_to_be”) • Class(<6,7>) = {“b”, “be”} • freq(“b”) = freq(“be”) = 7-6+1 = 2 • Class(<17,18>) = {“to”, “to_”, “to_b”, “to_be”} • freq(“to”) = freq(“to_”) = freq(“to_b”) = freq(“to_be”) = 18-17+1 = 2 • Class(<11, 14>) = {“o”} • freq(“o”) = 14-11+1 = 4

Goal: Substring Statistics Words  Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str  All strs N2 substrings  N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes DFS Traversal of Class Tree Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here

Class Tree: Nesting of Valid Intervals

LBL = Longest Bounding LCP = 0 SIL = Shortest Interior LCP = 1 • Class(<i, j>) = {str | str starts every suffix within interval, and no others } = {substr(text + s[i], 1, k)} where LBL < k ≤ SIL • LCP[i] = Longest Common Prefix of s[i] & s[i+1] • SIL(<i, j>) = Shortest Interior LCP = MINi≤k<j { LCP[k] } • LBL(<i, j>) = Longest Bounding LCP = max(LCP[i], LCP[j]) • Class(<1, 5>) = {substr(“_be”, 1, k)} for 0 < k ≤ 1

Enumerating Classes struct stackframe { int i, j, SIL } *stack; int sp = 0; /* stack pointer */ stack[sp].i = 0; stack[sp].SIL = -1; for(w=0; w<N; w++) { if(LCP[w] > stack[sp].SIL) { sp++; stack[sp].i = w; stack[sp].SIL = LCP[w]; } while(LCP[w] < stack[sp].SIL) { stack[sp].j = w; output(&stack[sp]); if(LCP[w] <= stack[sp-1].SIL) sp--; else stack[sp].SIL = LCP[w]; }} • A class is uniquely determined by • Two endpoints: <i, j>, or • SIL & witness: i ≤ w ≤ j • To enumerate classes: • Enumerate 0≤w<N & LCP[w] • Remove duplicate classes • Two witnesses and their LCPs might specify the same class • Alternatively, depth first traversal of class tree • Output: <i, j> and SIL(<i, j>) Push Pop

Depth-first traversal Outputs <i, j> and SIL(<i, j>) Sorted first by j (increasing order) and then by i (decreasing order) 1 2 3 struct stackframe { int i, j, SIL } *stack; int sp = 0; /* stack pointer */ stack[sp].i = 0; stack[sp].SIL = -1; for(w=0; w<N; w++) { if(LCP[w] > stack[sp].SIL) { sp++; stack[sp].i = w; stack[sp].SIL = LCP[w]; } while(LCP[w] < stack[sp].SIL) { stack[sp].j = w; output(&stack[sp]); if(LCP[w] <= stack[sp-1].SIL) sp--; else stack[sp].SIL = LCP[w]; }} 4 5 6 7 6 8 8 4 2 3 7 5 1

Find Class Input: pattern (a substring such as “Norieg”) Output: <i, j>, LBL, SIL, stats, Class(<i, j>) Method: Two binary searches into suffix array to find first (i) and last (j) suffix starting with input pattern Third binary search into classes to find class and associated (pre-computed) stats Computed from i and j (first two binary searches) LBL(<i, j>) = max(LCP[i], LCP[j]) Computed from class (third binary search) SIL dfk: # of documents that contain input pattern at least k times Class(<i, j>) = { substr(text + s[i], i, k)} for LBL < k ≤ SIL Takes advantage of ordering on classes (sorted first by j and then by i)

Goal: Substring Statistics Words  Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str  All strs N2 substrings  N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes DFS Traversal of Class Tree Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here

Corpus (3 docs): • Hi_Ho_Hi_Ho • Hi_Ho • Hi Need a few docs to talk about df

Cumulative Document Frequency (cdf) • Document Frequency (df) • Number of documents that mention str at least once • dfk≡ number of documents that mention str at k times • Adaptation = Pr(k≥2 | k≥1) = df2 / df1 • Cumulative Document Frequency (cdf) cdfk ≡ cumulative doc freq = Σi≥k dfi • Can recover freq & dfk from cdfk freq = cdf1 = j – i + 1 dfk = cdfk – cdfk+1 • A (simple but slow) method for computing cdfk cdf1(<i, j>)=Σi≤w≤j 1 cdf2(<i, j>)=Σi≤w≤j neighbor[w] ≥ i cdfk(<i, j>)=Σi≤w≤j neighbork-1[w] ≥ i

Neighbors • doc(s)  1:D • (using binary search) • Neighbors[s2] = s1 • where doc(s1) = doc(s2) = d • and s1 and s2 are adjacent  suffix s3 such that doc(s3) = d and s1<s3< s2 • Neighbors[s2] = NA if s2 is first suffix in doc  suffix s1 such that doc(s1) = doc(s2)and s1< s2 • Neighbork[s] = Neighbork-1[Neighbor[s]], for k>1 • Neighbor0[s] = s (identity)

Simple (but slow) code for cdfk • cdfk(<i, j>)=Σi≤w≤j neighbork-1[w] ≥ i struct class { int start, end, SIL }; /* returns neighbor^k(suf) or -1 if NA*/ int kth_neighbor(int suf, int k) { if(suf >= 0 || k >1) return kth_neighbor( neighbors[suf], k-1); else return suf; } struct class c; while(fread(&c, sizeof(c), 1, stdin)) { int cdfk = 0; for(w=c.start; w<=c.end; w++) if(kth_neighbor(w, K-1) >= c.start) cdfk++; putw(cdfk, out); /* report */ } Neighbork[s] = Neighbork-1[Neighbor[s]], for k>1

Same as before(but folded into Depth-First Search) • cdfk(<i, j>)=Σi≤w≤j neighbork-1[w] ≥ i for(w=0; w<N; w++) { if(LCP[w]> stack[sp].SIL) { sp++; stack[sp].start = w; stack[sp].SIL = LCP[w]; stack[sp].cdfk = 0; } for(sp1=0; sp1<=sp; sp1++) { if(kth_neighbor(w, K-1) >= stack[sp1].start) stack[sp1].cdfk++; } while(LCP[w] < stack[sp].SIL) { putw(stack[sp].cdfk, out); if(LCP[w] <= stack[sp-1].SIL) sp--; else stack[sp].SIL = LCP[w]; }} struct stackframe { int start, SIL, cdfk } *stack; /* returns neighbor^k(suf) or -1 if NA*/ int kth_neighbor(int suf, int k) { if(suf >= 0 || k >1) return kth_neighbor( neighbors[suf], k-1); else return suf; } Report Neighbork[s] = Neighbork-1[Neighbor[s]], for k>1

Results cdfk ≡ cum df = Σi≥k dfi freq = cdf1 = j – i + 1 dfk = cdfk – cdfk+1 cdf1(<i, j>)= Σi≤w≤j 1 = j – i + 1 cdf2(<i, j>)= Σi≤w≤j neighbor[w] ≥ i cdfk(<i, j>)= Σi≤w≤j neighbork-1[w] ≥ i dfk≥ dfk+1 and cdfk ≥ cdfk+1

Monotonicity • dfk ≥ dfk+1 • cdfk ≥ cdfk+1 • cdfk[mother] ≥ ddaughterscdfk[d] Opportunity for speedup: Propagate counts up class tree

Faster O(N max(k, log max(LCP)) code for cdfk for(w=0; w<N; w++) { if(LCP[w]> stack[sp].SIL) { sp++; stack[sp].start = w; stack[sp].SIL = LCP[w]; stack[sp].cdfk = 0; } int prev = kth_neighbor(w, K-1); if(prev >= 0) stack[find(prev)].cdfk++; while(LCP[w] < stack[sp].SIL) { putw(stack[sp].cdfk, out); /* report */ if(LCP[w] <= stack[sp-1].SIL) { stack[sp-1].cdfk += stack[sp].cdfk; sp--; } else stack[sp].SIL = LCP[w]; }} • N struct stackframe { int start, SIL, cdfk } *stack; /* returns neighbor^k(suffix) or -1 if NA*/ int kth_neighbor(int suffix, int k) { int i, result = suffix; for(i=0; i < k && result >= 0; i++) result = neighbors[result]; return result; } /* return first stack frame not before suffix */ /* binary search works because stack is sorted */ int find(int suffix) { int low = 0; int high = sp; while(low + 1 < high) { int mid = (low + high) / 2; if(stack[mid].start <= suffix) low = mid; else high = mid; } if(stack[high].start <= suffix) return high; if(stack[low].start <= suffix) return low; fatal("can't get here"); } • k • log max(LCP) • Propagate counts up class tree

Substring Statistics

Substring Statistics

Presentation Transcript

Efficient Algorithms for Substring Near Neighbor Problem

Statistics - Descriptive statistics

String ($ var ) arrays (@array) conversion and substring extraction

Multi-column Substring Matching for Database Schema Translation

Slender PUF Protocol Authentication by Substring Matching

Longest Palindromic Substring

Approximate Substring Matching over Uncertain Strings

Computing Longest Common Substring/Subsequence of Non-linear Texts

Sparse LCS Common Substring Alignment

Extracting key-substring-group features for text classification

Genome Homology Visualization by Short Similar Substring Enumeration

Statistics 300: Elementary Statistics

#N14 Pattern Value (aka Substring attribute)

Extracting Key-Substring-Group Features for Text Classification

The Longest Common Substring Problem

Statistics on Statistics.

Social Statistics: Inferential Statistics

Statistics 1: Elementary Statistics

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Statistics 300: Elementary Statistics

Statistics South Africa Official statistics; Statistics Act