1 / 47

Efficient Algorithms for Mining Semi-structured Data

Efficient Algorithms for Mining Semi-structured Data. Joint work with Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Setsuo Arikawa (Kyushu Univ.). Outline. Efficient Text Data Mining Fast and Robust Text Mining Algorithm (ALT'98, ISSAC'98, DS'98)

ulric
Download Presentation

Efficient Algorithms for Mining Semi-structured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Algorithms for Mining Semi-structured Data Joint work with Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Setsuo Arikawa (Kyushu Univ.)

  2. Outline Efficient Text Data Mining • Fast and Robust Text Mining Algorithm (ALT'98, ISSAC'98, DS'98) • Efficient Text Index for Data Mining (CPM'01 , CPM'02) • Text Mining on External Storage (PAKDD'00) • Applications • Interactive Document browsing • Keyword discovery form Web Towards Semi-structured Data Mining • Efficinet Frequent Tree Miner (SDM'02, PKDD'02) • Mining Semi-structured Data Streams (ICDM '02) Information Extraction from Web (GI'00, FLAIRS'01) Conclusion

  3. people person person name tel email @age @id @id name #text #text #text 25 608 609 #text John 555-4567 john@abc.com Mary Semi-structured Data • Many semi-structured data on networks • XML data [W3C 00], Web/HTML pages • Demand for mining useful information from a large semi-structured database.(Semi-structured Data Mining) • Tag • Attribute • Text <people> <person age=“25” id=“608”> <name>John</name> <email>john@abc.com</email> </person> <person id=“609”> <name>Mary</name> <tel>555-4567</tel> </person> </people>

  4. P #text B FONT #text #text @color @face #text blue Times Goal of this study • An efficient algorithm for finding frequent substructures from semi-structured data. • Labeled ordered trees (as graphs; not a set of paths) • Frequent pattern discovery [Agrawal 1994] • Efficient for long patterns [Bayardo 1997] Minsup = 5(%)

  5. people person person name tel email @age @id @id name #text #text #text 25 608 609 #text John 555-4567 john@abc.com Mary Model of Semi-structured Data • Labeled ordered trees • Each node has a label, which corresponds to: • Markup tag • Attribute & value • Text string • The children of a node are ordered from left to right by the sibling relation • Each node can have unbound number of children (unranked) • Labeled (unordered) trees • Labeled graphs <people> <person age=“25” id=“608”> <name>John</name> <email>john@abc.com</email> </person> <person id=“609”> <name>Mary</name> <tel>555-4567</tel> </person> </people>

  6. P #text B FONT #text #text @color @face #text blue Times What is Semi-structured Data Mining? Finding characteristic subgraphs (patterns) from a given set of labeled trees or graphs • Characteristic pattern • Frequent pattern:a pattern occurring in many graphs • Optimized pattern:a pattern distinguishing two different sets of graphs Minsup = 5(%)

  7. History of Semi-structured Data Mining Finding subgraphs by MDL principleSubdue [Holder et al. (KDD’94)] ~1995 1996 1997 1998 1999 2000 2001 2002 Finding frequent paths[Wang and Liu (KDD’97)] Finding Semi-structured Schema [Nestrov, Abiteboul et al. (SIGMOD’98)] Finding frequent subgraphsAGM [Inokuchi et al. (PKDD’00)] Finding frequent subgraphsFSG [Kuramochi et al. (ICDM’01)] Finding frequent ordered treesFREQT [Asai et al. (SDM’02)],Treeminer [Zaki (KDD’02)] Finding frequent subgraphs[Venetik, Gudes, et al. (ICDM’02)],gSpan [Yan and Han (ICDM’02)]

  8. Efficient Algorithms for Discovering Frequent Labeled Ordered Trees FREQT [Asai et al. (SDM’02, PKDD’02)] • Efficient enumeration of labeled ordered treesusingrightmost expansion technique. • Incremental updating ofrightmost leaf occurrences. TreeminerV [Zaki (SIGKDD’02)] • Enumeration technique is same with ours. • Counting method is different from ours. • Independent from ours.

  9. Tree Matching Pattern tree Tmatchesa data tree D There is amatching functionf from T into D. (T occurs in D ) T A r D B C A C • f is 1-to-1. • f preserves parent-child relation. • f preserves (indirect) sibling relation. • f preserves labels. P1 B A C A B B B C P2

  10. The occurrences of a pattern • A root occurrence of T: • The node to which the root of T maps by a matching function • The root count of T: • The number of distinct root occurrences of T in D. T A r D 1 B C A C P1 2 7 B A C A B Root occurrence list 3 5 4 11 8 OccD(T) = {2, 8} B B C P2 6 9 10

  11. Algorithm FREQT • Stage 1: Compute F1. • Stage k: Compute Fk from Fk-1.(k =2,3, …) • Compute k-patterns by the rightmost expansion (Ck from Fk-1 ) • Update their rightmost-leaf occurrences. • Select the frequent k-patterns in Ck. (Fk fromCk ) Fk: the set of all the frequent k-patterns Ck: the candidate set for Fk. (k-patterns =patterns of size k).

  12. Rightmost Expansion • (d,l)-expansion T of tree S • The tree T obtained by attaching new node k to the rightmost branch of S. • k is the rightmost leaf of T. • (d, l): depth and label of k • The rightmost expansion of S 1 d -1 k l S k-1

  13. Ordered tree enumeration tree[Asai et al., SDM’02; Zaki, SIGKDD’02] A generalization of set enumeration tree [Bayardo 97] for ordered trees ⊥ B (0,B) (1,B) (1,B) B B B B (0,A) B (1,A) (1,A) A B B (2,B) (1,A) (1,B) (2,A) B A A B A B A • The root is the empty tree. • Each node is an ordered tree, and has its (d, l)-expansions as its children. B B B A B A

  14. Incremental Computation of the Rightmost-Leaf Occurrences • Scan the list of old rightmost-leaf occurrences • For each old occurrence x, • Go upward to the (p-1)th parenth of x • Starting from h, scan its proper younger siblings. • Add siblings with label A to the new rightmost occ. list. 1 (p,A)-expansion of T Data Tree D k p p h A p - 1 A B C A A Pattern T proper younger siblings k-1 An occurrence of T x List of Old Right-most occurrences An old occurrence

  15. Performance Study of FREQT • Dataset: citeseers • Minimum support: s=3.0(%) fixed • Increasing the data size from 0.3MB to 5.6MB. Runtime (sec) 178,285 nodes, 1.39 sec # of nodes

  16. s = 2(%) 37.1(sec)3.29(sec)1.15(sec) 3 times faster(by Pruning) 10 times faster (by DD) Algorithm Comparison Run Time(Sec) Minimum sup.(%)

  17. Experiment:FREQTFrequent Substructure Discovery from Web <a href=“_”> <font color=“#6F6F6F”> #text_1 </font> </a> <p> #text_2 <b> #text_3 <!-- CITE--> <font color=“green”> #text_4 </font> #text_5 </b> #text_6 <br /><br /> <font color=“#999999”> #text_7 <i> #text_8 </i> #text_9 </font> </p> • Effictive for schema discovery • DataGuide [Widom, Garcia-Molina et al. (VLDB’97)]

  18. Optimized Pattern Discovery Algorithm OPTT[Abe et al. (PKDD’02)] • Find such patterns as … • frequent in positive data, and • infrequent in negative data. • Applicable to classification of trees and graphs Pattern P matched unmatched Positive data Negative data

  19. Ex. of optimized patterns #Occ:Pos 10,Neg 0<movie> <certification> <certif> sweden:15 </certif> </certification></movie> #Occ:Pos 1,Neg 12<movie> <title /> <genre> animation <genre></movie> Experiment: OPTTOptimized Pattern Discovery from XML Data Pos Data: Action movie ×15 Neg Data: Family movie ×15 AlgorithmOPTT • Effictive for classification of semi-structured data

  20. Population N Pattern Population N1 Population N0 Split ! S1 S0 (M1/N1) (M0/N0) Impurity function Optimized Rule/Pattern Discovery Evaluation function for pattern  GS,() = (N1/ N) (M1/N1) + (N0/ N) (M0/N0)

  21. Theoretical results • Theorem: The algorithm OPTT solves the maximum agreement problem for labeled ordered treesin averagetimeO(kk bkN). • (Note: A straightforward algorithm has super linear time complexity when the number of labels grows in N). • Theorem: If the maximum sizek of subwords is unbounded, For any e > 0, there exists no polynomial time (770/767 - e)-approximation algorithm for the maximum agreement problem for labeled ordered trees of unbounded size on an unbounded label alphabet if P /=NP. Proc. SIAM Data Mining 02 (2001), and Proc. PKDD'02 (2002) アルゴリズム詳細

  22. Mining Semi-structured Data Streams <moviedb><movie><title>Godfather</title><year>1972</year><directed_by><person><name>Francis Ford Coppola </name> <birth_name> Francis Ford Coppola </birth_name> <date_of_birth> <day> 7 April </day> <year> 1939 </year> <locate> Detroit, Michigan, USA </locate> </date_of_birth> <mini_biography> He was born in 1939 in Detroit, USA, but he grew up in a New York </mini_biography> <sometimes_credited> Thomas Colchart </sometimes_credited> <sometimes_credited> Francis Coppola </sometimes_credited> <filmography> <Producer> <title> Assassination Tango (2002) </title> <title>Pumpkin (2002) </title><title>No Such Thing (2001)</title> <title>Another Day (2001) (TV) </title> <title> Jeepers Creepers (2001)</title> <title>CQ (2001) </title> <title> Sleepy Hollow (1999)</title> <title> Goosed (1999/I) </title> <title>Third Miracle, The (1999) </title> <title>Virgin Suicides, The (1999) </title> <title>Florentine, The (1999) </title> <title>Lanai-Loa (1998) </title> <title> “First Wave” (1998) </title> <title> Moby Dick (1998) (TV) </title> <title> Outrage (1998) (TV) </title> <title> Buddy (1997) </title> …… • Emerging applications on Internets • Eg. Network monitoring, web management, e-commerce • Not a static collection but a transient data stream • Unbounded, Rapid, Continuous, Time varying • Traditional data mining methods cannot be directly applied. Mining algorithm for semi-structured data streamsStreamT [Asai et al. (ICDM’02)] SAX event stream … …

  23. Semi-structured Data Stream • (v1, v2, … , vi, …)∈(N×L)∞ • vi = (di,li): depth and label of node i (depth, label) pair representation: Data tree D Semi-structured data stream w.r.t. D (0,R), (1,A), (2,B), (2,A), (2,C), (3,B), (1,C), (2,A), (3,B), (3,C), (2,B) R 1 A C 2 7 B A C A B 3 5 4 11 8 B B C 6 9 10

  24. Example moviedb movie title directed_by year Each (depth, label)-pair in a stream corresponds to an open parenthesis in XML data person date_of_birth name birth_name (depth, label) pair representation XML data <moviedb> <movie> <title> Godfather </title> <year> 1972 </year> <directed_by> <person> <name> Francis Ford Coppola </name> <birth_name> Francis Ford Coppola </birth_name> <date_of_birth> <day> 7 April </day> <year> 1939 </year> . . . (0, moviedb), (1, movie), (2, title), (3, “Godfathar”), (2, yaer), (3, “1972”), (2, directed_by), (3, person), (4, name), (5, “Francis Ford Coppola”), (4,birth_name), (5, “Francis Ford Coppola”), (4, data_of_birth), (5, day), (6, “7 April”), (5, year), (6, “1939”), . . .

  25. Offline vs. Online FREQT (Offline) Horizontal Scan(Level-wise search) StreamT (Online) Vertical Scan 1 2 …i…n Data 1 2 …i…n Data 1 2 k 1 2 k Pattern size Pattern size

  26. Related Works: Online Data Mining • Brin et al. [SIGMOD’97] • Dynamic Itemset Counting. • Mining association rules using fewer number of scans. • Hidber [SIGMOD’99] • Carma: Online mining of association rules from transaction data streams. • Manku & Motowani [VLDB’02] • Approximate online mining algorithms for frequent items and itemsets from data streams. • New candidate management policy which has a provable space complexity.

  27. Main Result • StreamT: an online algorithm for finding frequent labeled ordered trees from semi-structured data streams • Techniques • Plain sweeping technique • Adaptive candidate management • Extensions to various online models • Theoretical and Empirical Analysis

  28. Sweep branchSB • The unique path from the root to the current nodevi • The algorithm sweeps the sweep branch SB rightwards • Records the occurrences of the candidate patterns on SB • Use root and bottom occurrences

  29. Tree sweeping technique # Occurrences 4

  30. Tree sweeping technique # Occurrences 0 sweep branch

  31. Tree sweeping technique # Occurrences 0

  32. Tree sweeping technique # Occurrences 1

  33. Tree sweeping technique # Occurrences 2

  34. Tree sweeping technique # Occurrences 3

  35. Tree sweeping technique # Occurrences 4

  36. Tree sweeping technique # Occurrences 4

  37. Tree sweeping technique # Occurrences 5

  38. Tree sweeping technique # Occurrences 6

  39. Tree sweeping technique # Occurrences 7

  40. Bottom occurrences of a pattern PatternT∈C • Property: the intersection of an embedding of a pattern and the SB forms a chain of nodes. • We record only the pair of the root and the bottom occurrences of a pattern T, instead of the whole occurrence of T The intersection of SB and the tree. Root occurrence vR Left-half treeDi Bottom occurrence vB SBi

  41. Sweep Branch Stack B i-th Sweep branch stack B • Property: The d-th bucket of the sweep branch stack B keeps all candidate patterns that has the bottom occurrence with depth don the current sweep branch Time (i-1) depth d depth d (k-1)-pattern S i-th branch SB(i)

  42. How to maintain the SB-stack: Summary The next nodeof depth d Classify the candidates with their bottom depth 0 . . . d-1 d . . . Case 1 Case 2 REMOVE Time (i-1) UNCHANGE CHANGE Case 3 UNCHANGE Time i RIGHTMOST-EXPANTION Case 4

  43. Online candidate management policy [Hidber’99] Observation (Monotonicity)A pattern is frequent only if its predecessor is frequent. • Initially insert all patterns of size 1 to C . • Predecessor of T is frequent => insert T in C • Predecessor of T is infrequent => delete T from C Set F of frequent patterns of stage i Set C of candidate patterns of stage i

  44. Various online models • Basic model [Hidber 99] • Sliding window model [Manilla et al. 95] • Forgetting model [Yamanishi et al. 00] time i Unsuitable to tracking rapid trend changes Window size w i-w+1 time i Forgetting factor: g gi-j j time i past now

  45. Online Scalability • Minimum support: s=1.0(%) Runtime (sec) # of nodes

  46. mem = 575MB mem = 64MB 307286 1,646 = 149,521 + 307,286 - 455,161 455161 mem = 575MB

  47. Experiments:Performance and effectiveness of forgetting 1,348 (sec) Effectiveness of forgetting 3,200,000 (nodes) --- Basic --- Forgetting Performance • Data size: 130MB • # of nodes: 3,185,138 • # of labels: 72

More Related