1 / 12

A Short Introduction to Sequential Data Mining

A Short Introduction to Sequential Data Mining. Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence. Two Main Frameworks of Sequential Mining.

matthew
Download Presentation

A Short Introduction to Sequential Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Short Introduction to Sequential Data Mining Koji IWANUMA Hidetomo NABESHIMA University of Yamanashi The First Franco-Japanese Symposium on Knowledge Discovery in System Biology, September 17, Aix-en-Provence

  2. Two Main Frameworks of Sequential Mining • Sequential pattern mining for multiple data sequences • Sequential pattern mining for a single data sequence

  3. What Is Sequential Pattern Mining? J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji • Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

  4. Challenges on Sequential Pattern Mining • A huge number of possible sequential patterns are hidden in databases • A mining algorithm should • find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold • be highly efficient, scalable, involving only a small number of database scans • be able to incorporate various kinds of user-specificconstraints J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji

  5. Sequential Pattern Mining Algorithms for Multiple Data Sequences • Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’96) • Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.@KDD’00; Pei, et al.@ICDE’01) • Vertical format-based mining: SPADE (Zaki@Machine Leanining’00) • Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02) • Mining closed sequential patterns: CloSpan (Yan, Han & Afshar @SDM’03) J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji

  6. Mining Sequential Patterns from a Very-Long Single Sequence A series of daily news paper articles < > typhoon flood, landslide typhoon flood, landslide <typhoon(flood,landslide)>

  7. Sequential Pattern Mining Algorithms for a Single data Sequence • Discovery of frequent episodes in event sequences, based on a sliding window system [Mannila 1998]:  • The frequency measure becomes anti-monotonic, but has a problem, i.e., a duplicate counting of an occurrence. • Asynchronous periodic pattern mining [Yang et.al 2000, Huang 2004]: • Any anti-monotonic frequency measures are not investigated. • On-line approximation algorithm for mining frequent items, not for frequent subsequences • Lossy counting algorithm [Manku and Motwani, VLDB’02]

  8. Research in Our Laboratory • Sequential Data Mining from a very-large single data sequence. • Main target: sequential textual data, especially, newspaper-articles corpora • Objectives: to generate a robust and useful large-scale event-sequences corpus. • Application 1: topic tracking/detection in information retrieval. • Application 2: automated content-tracking in WEB. • Application 3: scenario/story semi-automatic creation • Ordinary temporal data analysis: various log data in computer systems, genetic information, etc.

  9. Technical Topics (1/2) • A new framework for extracting frequent subsequences from a single long data sequence:in IEEE Inter. Conf. on Data Mining 2005 (ICDM2005): • A new rational frequency measures, which satisfies the Apriori (anti-monotonic)property and has no duplicate counting. • A fast on-line algorithm for a some limited case

  10. Technical Topics (1/2) On-going current works and future work • On-line rational filters based on confidence criteria and/or information-gain for eliminating redundant valueless sequences from system output • Methods for finding meta-structures embedded in huge amount of frequent sequences generated by a system • A method using compression based on context-free grammar-inference/learning • More fast extraction algorithm based on a method for simultaneously searching multiple strings over compressed data.

  11. References: • Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques (Chapter 8). www.cs.uiuc.edu/~hanj

  12. Thanks for your attention!!

More Related