A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence
Two Main Frameworks of Sequential Mining • Sequential pattern mining for multiple data sequences • Sequential pattern mining for a single data sequence
What Is Sequential Pattern Mining? J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji • Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern
Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should • find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold • be highly efficient, scalable, involving only a small number of database scans • be able to incorporate various kinds of user-specificconstraints J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
Sequential Pattern Mining Algorithms for Multiple Data Sequences • Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96) • Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@KDD’00; Pei, et al.@ICDE’01) • Vertical format-based mining: SPADE (Zaki@Machine Leanining’00) • Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02) • Mining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM’03) J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji
Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper articles < > typhoon flood, landslide typhoon flood, landslide <typhoon(flood,landslide)>
Sequential Pattern Mining Algorithms for a Single data Sequence • Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]： • The frequency measure becomes anti-monotonic, but has a problem, i.e., a duplicate counting of an occurrence. • Asynchronous periodic pattern mining [Yang et.al 2000, Huang 2004]： • Any anti-monotonic frequency measures are not investigated. • On-line approximation algorithm for mining frequent items, not for frequent subsequences • Lossy counting algorithm [Manku and Motwani, VLDB’02]
Research in Our Laboratory • Sequential Data Mining from a very-large single data sequence. • Main target: sequential textual data, especially, newspaper-articles corpora • Objectives: to generate a robust and useful large-scale event-sequences corpus. • Application 1： topic tracking/detection in information retrieval. • Application 2： automated content-tracking in WEB. • Application 3: scenario/story semi-automatic creation • Ordinary temporal data analysis: various log data in computer systems, genetic information, etc.
Technical Topics (1/2) • A new framework for extracting frequent subsequences from a single long data sequence:in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005): • A new rational frequency measures, which satisfies the Apriori (anti-monotonic)property and has no duplicate counting. • A fast on-line algorithm for a some limited case
Technical Topics (1/2) On-going current works and future work • On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output • Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system • A method using compression based on context-free grammar-inference/learning • More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data.
References: • Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www.cs.uiuc.edu/~hanj