Course on Data Mining (581550-4)

7.11. 24./26.10. 14.11. Home Exam 30.10. 21.11. 28.11. Course on Data Mining (581550-4) Intro/Ass. Rules Clustering Episodes KDD Process Text Mining Appl./Summary

Course on Data Mining (581550-4) • Today's subject: • Text Mining, focus on maximal frequent phrases or maximal frequent sequences (MaxFreq) • Next week's program: • Lecture: Clustering, Classification, Similarity • Exercise: Text Mining • Seminar: Text Mining Today 07.11.2001

Text Mining Background What is Text Mining? MaxFreq Sequences MaxFreq Algorithms MaxFreq Experiments

Text Databases and Information Retrieval • Text databases (document databases) • Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, Web pages, etc. • Information retrieval (IR) • Information is organized into (a large number of) documents • Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Ç | { Relevant } { Retrieved } | = precision | { Retrieved } | Ç | { Relevant } { Retrieved } | = recall | { Relevant } | Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved Retrieved Relevant & Retrieved Relevant All

Keyword/Similarity-Based Retrieval • A document is represented by a string, which can be identified by a set of keywords • Find similar documents based on a set of common keywords • Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. • In the following, some basic techniques related to the preprocessing and retrieval are briefly mentioned

Keyword/Similarity-Based Retrieval • Basic techniques (1): Remove unrelevant words with stop list • Set of words that are deemed “irrelevant”, even though they may appear frequently • E.g., a, the, of, for, with, etc. • Stop lists may vary when document set varies • Basic techniques (2): Take basic forms of words with word stemming • Several words are small syntactic variants of each other since they share a common word stem (basic form) • E.g., drug, drugs, drugged

Keyword/Similarity-Based Retrieval • Basic techniques (3): Calculate occurrences of terms to a term frequency table • Each entry frequent_table(i, j) = # of occurrences of the word ti in document di (or just "0" or "1" ) • Basic techniques (4): Similarity metrics: measure the closeness of a document to a query (a set of keywords) • Cosine distance: • Relative term occurrences • This is all nice to know, but where is the text mining and how does it relate to this?

What is Text Mining? • Data mining in text: find something useful and surprising from a text collection • Text mining vs. information retrieval is like data mining vs. database queries

Transaction ID Items Bought Document ID Words occurring 100 A,B,C 100 an,application,... 200 A,C 200 ... Different Views on Text • For example, we might have the following text: Documents are an interesting application field for data mining techniques. • Remember the market basket data? • The text can then be considered as a shopping transaction, i.e., row in the database • The words occurring in the text can be considered as items bought

D C A B D A B C Documents application interesting techniques 0 10 20 30 40 50 60 70 80 90 mining field data are for an 0 1 2 3 4 5 6 7 8 9 10 11 Different Views on Text • Recall the event sequence from episode rules: • Now we can consider the text as a sequence of words!

Text Preprocessing • So, suppose that we have the following example text: Documents are an interesting application field for data mining techniques. • To this text, we might do the following preprocessing operations: 1. Find the basic forms of the words (stemming) 2. Use stop list to remove uninteresting words 3. Select, e.g., the wanted word classes (e.g., nouns)

Text Preprocessing (Documents, 1) (are, 2) (an, 3) (interesting, 4) (application, 5) (field, 6) (for, 7) (data, 8) (mining, 9) (techniques, 10) (., 11) (document_N_PL, 1) (be_V_PRES_PL, 2) (an_DET, 3) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (for_PP, 7) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) (STOP, 11) Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition

Text Preprocessing (document_N_PL, 1) (be_V_PRES_PL, 2) (an_DET, 3) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (for_PP, 7) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) (STOP, 11) (document_N_PL, 1) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition

Text Preprocessing (document_N_PL, 1) (interesting_A_POS, 4) (application_N_SG, 5) (field_N_SG, 6) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) (document_N_PL, 1) (application_N_SG, 5) (field_N_SG, 6) (data_N_SG, 8) (mining_N_SG, 9) (technique_N_PL, 10) Morphological information: N = noun, PL = plural, V = verb, PRES = present form, DET = determinant, A = adjective, POS = positive, SG = singular, PP=preposition

application document technique mining field data 0 1 2 3 4 5 6 7 8 9 10 11 Text Preprocessing • Now we have a preprocessed sequence of words • We might also just throw away the stop words etc., and put words in consecutive "time slots" (1, 2, 3, …) • Preprocessing can be applied to transaction-based text data in a similar fashion

Types of Text Mining • Keyword (or term) based association analysis • Automatic document classification • Similarity detection • Cluster documents by a common author • Cluster documents containing information from a common source • Sequence analysis: predicting a recurring event, discovering trends • Anomaly detection: find information that violates usual patterns

Term-Based Assoc. Analysis • Collect sets of keywords or terms that occur frequently together and then find the association relationships among them • First preprocess the text data by parsing, stemming, removing stopwords, etc. • Then evoke association mining algorithms • Consider each document as a transaction • View a set of keywords/terms in the document as a set of items in the transaction

Term-Based Assoc. Analysis • For example, we might find frequent sets such as: 2%: application, field 5%: data, mining • …and association rules like: application  field (2%,52%) data  mining (5%,75%) • These kind of frequent sets etc. might help in expanding user queries or in describing better the documents than simple key words • Sometimes it would be nice to discover new descriptive phrases directly from the actual text - what then?

Term-Based Episode Analysis • Now, we want to find words/terms that occur frequently close to each other in the actual text • Take the preprocessed sequential text data and then find relationships among the words/terms by evoking episode mining algorithms (WINEPI or MINEPI) • For example, we might find frequent episodes such as: data, mining, knowledge, discovery • …and MINEPI style episode rules like: data, mining  knowledge, discovery [4] [8] (2%,81%)

Problems • Quite often, it could be interesting to try to find very long descriptive phrases to describe the documents… • …but discovery of long descriptive phrases might be tedious, especially if and when you'll have to create all shorter phrases in order to get the longest ones • One answer can be maximal frequent sequences or maximal frequent phrases (note: by concepts "sequence" and "phrase" we mean basically the same)

Frequent Word Sequences • Assume: S is a set of documents; each document consists of a sequence of words • A phrase is a sequence of words • A sequencepoccurs in a document d if all the words of p occur in d, in the same order as in p • A sequence p is frequent in S if p occurs in at least  documents of S, where  is a frequency threshold given • A maximal gapn can be given: the original locations of any two consecutive words of a sequence can have at most n words between them

Frequent Word Sequences 1: (The,70) (Congress,71) (subcommittee,72) (backed,73) (away,74) (from,75) (mandating,76) (specific,77) (retaliation,78) (against,79) (foreign,80) (countries,81) (for,82) (unfair,83) (foreign,84) (trade,85) (practices,86) 2: (He,105) (urged,106) (Congress,107) (to,108) (reject,109) (provisions,110) (that,111) (would,112) (mandate,113) (U.S.,114) (retaliation,115) (against,116) (foreign,117) (unfair,118) (trade,119) (practices,120) 3: (Washington,407) (charged,408) (France,409) (West,410) (Germany,411) (the,412) (U.K.,413) (Spain, 414) (and,415) (the,416) (EC,417) (Commission,418) (with,419) (unfair,420) (practices,421) (on,422) (behalf,423) (of,424) (Airbus,425)

Frequent Word Sequences Examples from the previous slides: • The phrase (retaliation, against, foreign, unfair, trade, practices) occurs in the first two documents, in the locations (78, 79, 80, 83, 85, 86) and (115, 116, 117, 118, 119, 120). • The phrase (unfair, practices) occurs in all the documents, namely in the locations (83, 86), (118, 120), and (420, 421). Note that we only count one occurrence of a sequence/doc!

Maximal Frequent Sequences • Maximal frequent sequence: • A sequence p is a maximal frequent (sub)sequence in S if there does not exist any other sequence p' in S such that p is a subsequence of p' and p' is frequent in S • Shortly, a maximal frequent sequence is a sequence of words that • appears frequently in the document collection • is not included in another longer frequent sequence

Maximal Frequent Sequences • Usually, it makes sense to concentrate on the maximal frequent sequences or maximal frequent phrases • Subsequences or subphrases usually do not have own meaning • However, sometimes also subsequences or subphrases may be interesting, if they are much more frequent

A Maximal Seq. with Subseq.s • Example (maximal sequence + subsequences): dow jones industrial average dow jones dow industrial dow average jones industrial jones average industrial average dow jones industrial dow jones average jones industrial average

Examples of Meaningful Subseqs • Interesting subsequences can be distinguished by the characteristic that they are more frequent than the maximal sequences • Subsequence has its OWN occurrences in the text • Subsequence might be joint to MANY maximal sequences • TOO FREQUENT subsequence might NOT be interesting

Examples of Meaningful Subseqs • Maximal sequences: prime minister Lionel Jospin prime minister Paavo Lipponen • Subsequences: prime minister Lionel Jospin Paavo Lipponen

Discovery of Frequent Sequences • Frequency of a sequence cannot be decided locally: all the instances in the collection has to be counted • However: already a document of length 20 (words) contains over one million sequences • Only small fraction of sequences are frequent • There are many sequences that have only very few occurrences

Naïve Discovery Approach • Basic idea: the "standard" bottom-up approach • Collect all the pairs from the documents, count them, and select the frequent ones • Build sequences of length p+1 from frequent sequences of length p • Select sequences that are frequent • Iterate • Finally: select maximal sequences (by checking for each phrase, whether it is contained in some other phrase)

Problems in the Naïve Approach • Problem: frequent sequences in text can be long • In our experiments: longest phrase 22 words (Reuters-21578 newswire data, 19000 documents, frequency threshold 15, max gap 2) • Processing all the subphrases of all lengths is not possible • Straightforward bottom-up approach does not work • Restriction of the length would produce a large amount of slightly differing subphrases of a phrase that is longer than the threshold

Combining Bottom-Up and Greedy Approaches: MaxFreq • First, frequent pairs are collected • Initial phase • Longer sequences are constructed from shorter sequences (k-grams) as in the bottom-up approach • Discovery phase • Maximal sequences are discovered directly, starting from a k-gram that is not a subsequence of any known maximal sequence • Expansion step

Combining Bottom-Up and Greedy Approaches: MaxFreq • Each maximal sequence has at least one unique subsequence that distinguishes it from the other maximal sequences. A maximal sequence is discovered, at the latest, on the level k, where k is the length of the shortest unique subsequence. • Grams that cannot be used to construct any new maximal sequences are pruned away after each level, before the length of grams is increased • Pruning step • Let's take a closer look at these phases and steps!

Algorithm: Initial Phase Input: a set of documents S, a frequency threshold, and a maximal gap Output: a gram set Grams2 containing the frequent pairs For all the documents dS collect all the ordered pairs of words (A,B) within d such that A and B occur in this order (wrt maximal gap) Grams2 = all the ordered pairs that are frequent in the set S (wrt frequency threshold) Return Grams2

Algorithm: Initial Phase Document 1: (A,11) (B,12) (C,13) (D,14) (E,15) Document 2: (P,21) (B,22) (C,23) (D,24) (K,25) Document 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36) Document 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46) Document 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57) Document 6: (R,61) (H,62) (K,63) (L,64) (M,65)

Algorithm: Initial Phase • The following pairs of words could be found (with max gap=2). E.g., AB occurs in doc 1 ([11-12]) and in doc 2 ([31-32]), while AE is unfrequent ([11-15] > max gap). AB 2 BE 3 CK 3 EL 1 HM 1 PC 3 AC 2 BH 1 CL 1 EM 1 KE 1 PD 2 AD 1 BK 2 CN 1 EN 1 KL 2 PK 1 AH 1 CD 4 DE 2 HD 1 KM 2 RH 1 BC 5 CE 3 DK 2 HK 2 LM 2 RK 1 BD 4 CH 1 DN 1 HL 1 PB 3 RL 1

Algorithm: Discovery Phase Input: a gram set Grams2 containing the frequent pairs (A, B) Output: the set Max of maximal frequent phrases k := 2; Max :=  While Gramsk is not empty For all grams g Gramsk If a gram g is not a subphrase of some mMax If a gram g is frequent max := Expand(g) Max := Maxmax If max = g Remove {g} from Gramsk Else Remove {g} from Gramsk Prune(Gramsk) Join the grams of Gramsk to form Gramsk+1 k := k + 1 Return Max

Note! All the possibilities to expand has to be checked: tail, front, middle! Algorithm: Expansion Step Input: a phrase p Output: a maximal frequent phrase p' such that p is a subphrase of p' Repeat Let l be the length of the sequence p. Find a sequence p' such that the length of p' is l+1, and p is a subsequence of p'. If p' is frequent p := p' Until there exists no frequent p' Return p

Algorithm: Expansion Step 1: (A,11) (B,12) (C,13) (D,14) (E,15) 2: (P,21) (B,22) (C,23) (D,24) (K,25) 3: (A,31) (B,32) (C,33) (H,34) (D,35) (K,36) 4: (P,41) (B,42) (C,43) (D,44) (E,45) (N,46) 5: (P,51) (B,52) (C,53) (K,54) (E,55) (L,56) (M,57) 6: (R,61) (H,62) (K,63) (L,64) (M,65) Freq: AB BD CD DE KL PB AC BE CE DK KM PC BC BK CK HK LM PD Exp: AB => ABC => ABCD (- ABCDE, ABCDK) BE => BCE => BCDE

Example • Maximal frequent sequences after the first expansion step: AB => ABC => ABCD BE => BCE => BCDE BK => BDK => BCDK KL => KLM PD => PBD => PBCD HK

Example • 3-grams after join: ABC ACK CDE PCD BKM ABD BCD CDK PCE CKL ABE BCE PBC PCK CKM italics+ ABK BCK PBD PDE DKL underlined= ACD BDE PBE PDK DKM already found ACE BDK PBK BKL KLM maximal phrase • New maximal frequent sequences: PBE => PBCE PBK => PBCK

Example • 3-grams after the second expansion step: ABC BCE CDE PBE PCK ABD BCK CDK PBK ACD BDE PBC PCD BCD BDK PBD PCE • 4-grams after join: ABCD ABDK BCDK PBDE ABCE ACDE PBCD PBDK ABCK ACDK PBCE PCDE ABDE BCDEPBCK PCDK

Algorithm: Pruning Step • After expansion step, every gram is a subsequence of some maximal sequence • For any other maximal sequence m not found yet: m has to contain grams from two or more other maximal sequences, or from one sequence m' in a different order than in m' • For each gram g: check if g can join grams of maximal sequences in a new way => extract sequences that are frequent and not yet included in any maximal sequence; mark the grams • Remove grams that are not marked

Pruning After the 1st Exp. Step • BC: ABCD, BCDE, BCDK, PBCD • Prefixes: A, P • Suffixes: D, DE, DK • Check the strings ABCDE, ABCDK, PBCDE, PBCDK •  a subsequence that is frequent and not included in any maximal sequence? ABCDE - ABC - ABCD (maximal) - ABCE (not frequent) - BCD - BCDE (maximal) - ABCD (known) - BCE - ABCE (known)

Pruning After the 1st Exp. Step PBCDE - PBC - PBCD (maximal) - PBCE (frequent, not in maximal) - BCD - BCDE (maximal) - PBCD (known) - BCE - PBCE (known) PBCDK - PBC - PBCD (maximal) - PBCK (frequent, not in maximal) ... Marked: PB, BC, CE, CK All the other grams are removed.

Algorithm: Implementation Data structures: • Table: for each pair its exact occurrences in text • Table: for each prefix the grams that have this prefix • Table: for each suffix the grams that have this suffix • Table: for each pair the indexes of maximal sequences within which it is a subsequence • An array of maximal sequences • Document identifiers are attached to the grams and occurrences

Course on Data Mining (581550-4)