Panagiotis Papapetrou, George Kollios, Stan Sclaroff, Dimitrios Gunopulos

Discovering Frequent Arrangements of Temporal Intervals Panagiotis Papapetrou, George Kollios, Stan Sclaroff, Dimitrios Gunopulos Department of Computer Science Boston University University of California, Riverside

Introduction and Motivation • Sequential pattern mining has received particular attention in the last decade: • Database of sequences: ordered lists of instantaneous events. • Extract frequent sequential patterns. • In many applications events occur over time intervals. • Extracting frequent arrangements of these temporally correlated labeled intervals may lead to useful observations. • So far, algorithms concentrate on the case where events occur instantaneously. • The idea of mining temporal patterns of interval-based events introduced in [1]. • However, the extracted patterns are restricted to certain forms. 1. P. Kam and A. W. Fu. “Discovering temporal patterns of Interval-based Events”. In Proc. of the DaWak, pages 317–326, London, UK, 2000. Springer-Verlag.

Applications (1/4)Linguistics • ASL Database • Collections of utterances. • Utterance: • Associates a segment of video with a detailed transcription. • Contains a number of ASL gestural and grammatical fields each one occurring over a time interval. (WH-Word) (WH-Word) (Eye-brow Lower) (Rapid head-shake) (WH-Question) time

Applications (2/4)Linguistics (An example) > Who drove the car, who?

Applications (3/4)Networks B A Router 1 Router 2 IPs IPs C D (A, B) (D, C) (D, B) time

Applications (4/4)Biology Human Gene (Nucleodite A) (Nucleodite C) (Nucleodite G) Position in the Gene

Main Contributions • Formal definition of the problem of mining frequent temporal arrangements of intervals in an interval database. • Development of two efficient mining algorithms that use breadth first and depth search techniques in an enumeration tree of temporal arrangements. • Extensive experimental evaluation and comparison with a standard sequential pattern mining method both on real and synthetic datasets.

Outline • Preliminaries • Problem Formulation • Proposed Algorithms • BFS-based • DFS-based • Hybrid-DFS • Experimental Evaluation • Related Work • Conclusions • Future Work

Preliminaries (1/6) • There can be many types of relations between two event intervals2. • We consider five of them: 2. J. F. Allen and G. Ferguson. “Actions and events in interval temporal logic”. Technical Report 521, The University of Rochester, July 1994”.

Preliminaries (2/6) • Let S = {E1, E2, …, Em} be an ordered set of event intervals, called event interval sequence, or e-sequence. • Each Eiis a triple (ei, tistart, tiend) • ei: an event label. • tistart:: the event start time. • tiend:: the event end time. Note: S is ordered by tistart. • k-e-sequence: an e-sequence of size k. • e-sequence databaseD: a set of e-sequences.

Preliminaries (3/6) • Example of a 5-e-sequence: S= { (A,1,7), (B,3,19), (D,4,30), (C,7,15), C,23,42) }

Preliminaries (4/6) • k-Arrangement: a set of k temporally correlated events in an e-sequence, denoted as A = {E , R}, where: • E : the set of labels of the event intervals in the arrangement. • R : the set of temporal relations between the events in E. • where is the temporal relation between Eiand Ej.

Preliminaries (5/6) • Given an e-sequence S and an arrangement A = {E , R}: • S contains A, if all the events in E appear in S, with the relations defined in R. • Given an e-sequence database D and a minimum support threshold min_sup: • An arrangement A is frequent, if it is contained in at least min_sup e-sequences (i.e. records) of D.

Preliminaries (6/6) • Example of an arrangement A, contained in an e-sequence S: S= {(A,1,7), (B,3,19), (D,4,30), (C,7,15), C,23,42)}

Problem Formulation Our Goal: Find the complete set of frequent arrangements given: • A e-sequence database D. • A minimum support threshold min_sup.

Apply a sequential pattern mining algorithm? • Consider start and end points of an interval as two instantaneous events. • Convert each e-sequence into a regular sequence. • Apply an efficient sequential pattern mining algorithm + post-processing. • Basic drawbacks: • k-arrangement = sequence of 2k events. May produce 22k patterns. Can we reduce it to 2k? • Extracted patterns will carry lots of redundant information. {Astart, Bstart, Aend, Bend}, but also: {Astart, Bstart},…

Frequent Arrangement Mining Algorithms • Use a logical Tree-like structure to enumerate the arrangements3. • Traverse the Tree using: • BFS • DFS • Hybrid DFS • BFS for the first two levels. • DFS for the rest of the mining process. 3. R. J. Bayardo. “Efficiently mining long patterns from databases”. In Proc. of ACM SIGMOD, pages 85–93, 1998.

The Arrangement Enumeration Tree Let LEVEL 1 LEVEL 2 Intermediate LEVEL 3 Intermediate

BFS-based Approach (1/2) • Traverses Tree in BFS order. • 2 database scans. • On each step k: • Build candidate k-arrangements based on (k-1)-arrangements. • Find 2-relations by scanning the second level of the Tree. • Determine frequency: min_sup threshold must be satisfied. • If a node is not frequent, do not expand sub-tree (Apriori Principle)4. • Stop at step k, where no frequent arrangements are found. 4. R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In Proc. of VLDB, pages 487-499, 1994.

BFS-based Approach (2/2)An Example

DFS-based Approach • Candidate generation in DFS order. • Leads to frequent large arrangements faster. • Skips expansions of nodes that are definitely going to lead to frequent arrangements. • DFS is inappropriate: • For each node we would have to scan the database multiple times to detect the 2-relations among the items in the node. • Hybrid-DFS • Generates the first two levels of the Tree using BFS, then uses DFS. • Eliminates multiple database scans, 2-relations are available.

Experimental Setup (1/4)Real Datasets • SignStream Database • Created by the National Center for Sign Language and Gesture Resources at Boston University. • Collection of 884 utterances. • Some types of event labels: • Grammatical or syntactic structures: • WH-Question. • Negation. • Yes/No Question. • Gestural Fields: • Head-shake. • Eye-brow raise/lower.

Experimental Setup (2/4)Real Datasets • Network Data • Sampled from flow data. • Two routers with high communication rate: • ATLA: router in Atlanta. • LOSA: router in LA. • Monitored communication for 10 days, between 200 IPs. • An e-sequence is a set of IP connections for every 15 minutes: • An event label is the two IPs (source-destination). • The interval corresponds to the duration of this communication. • Size of dataset: 960 e-sequences.

Experimental Setup (3/4)Synthetic Datasets • Generated considering the following factors: • Number of e-sequences in the Database. • Average e-sequence size. • Number of distinct items. • Density of frequent patterns.

Experimental Setup (4/4)Algorithms • Compared: • BFS. • Hybrid-DFS. • SPAM5, modified as follows: • Considered the start and end points of each interval as two instantaneous events. • Post-processed the extracted sequential patterns to convert them into arrangements. 5. J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. Sequential pattern mining using a bitmap representation. In Proc. of ACM SIGKDD, pages 429–435, 2002.

Performance Analysis • BFS outperforms SPAM in large database sizes and small supports. • Hybrid-DFS outperforms both SPAM and BFS. • In low supports Hybrid-DFS is twice as fast as BFS.

Sample Results (1/4) SignStream Database

Sample Results (2/4) SignStream Database • Negations: • YES/NO Questions:

Sample Results (3/4) SignStream Database • WH-questions: For more detailed results visit the following web page: http://cs-people.bu.edu/panagpap/Research/asl_mining.htm

Sample Results (4/4)Network Dataset

Related Work • Problem of mining frequent itemsets and association rules was first introduced by R. Agrawal and R. Srikant, 1994. • An extension to episodes (i.e. combinations of events with a partially specified order) was proposed in [H. Mannila et. al. 1995]. • Some efficient sequential pattern mining algorithms have been proposed in [M. Zaki. et. al. 2001] and [J. Ayres et. al. 2002]. • Closed sequential pattern mining algorithms presented in [X. Yan. et. al. 2003] and [J.Wang et. al. 2004]. • Interval-based events introduced in [P. Kam et. Al. 2000], however restricted to certain forms + apriori-based.

Conclusions • The problem of mining frequent arrangements of temporal intervals has been formally defined. • Two efficient methods for solving the problem have been discussed. • Both methods use an arrangement enumeration tree to discover the set of frequent arrangements. • The DFS-based approach further improves performance over BFS: • Longer arrangements are reached faster. • The need to examine smaller subsets of these arrangements is eliminated.

Future Work • Push additional constraints into the mining process: • Gap constraints. • Regular expression constraints. • Mine top-k frequent arrangements. • Mine frequent closed arrangements. • Apply other interestingness measures.

EXTRA SLIDES

Applications • American Sign Language databases • Extract frequent correlations between grammatical and syntactic structures + manual and gestural fields. • Network Monitoring: • Analyze packet and router logs. • Detect temporal relations of events occurring over time periods. • Patterns can be used for prediction and intrusion detection. • Biology: • Find frequent overlapping regions of nucleotides in the human gene.

Apply a closed sequential pattern mining algorithm*? • Noise again… {Astart, Bstart, Aend, Bend}: 2/3 But also: {Astart, Aend, Bend}: 3/3 *. J.Wang and J. Han. “Bide: Efficient mining of frequent closed sequences”. In Proc. of IEEE ICDE, pages 79–90, 2004.

The ISIdList Structure (1/2) • An ISIdList is defined for every arrangement generated throughout the mining process. • The ISIdList for an arrangement A = { , R} in an e-sequence database D, has the following structure: • Head: Arrangement representation using and R. • A record for each e-sequence in the database that supports A. • Each record is of type (id, intv-List), where: • id is the id of the e-sequence in D. • intv-List: • set of intervals where A occurs in the e-sequence A (for | | ≤ 2). • set of pointers to records of ISIdLists of the second level (for | | > 2).

The ISIdList Structure (2/2) (Example) • Let D consist of a set or e-sequences of event intervals with labels A, B, C. • The set of frequent 1 arrangements is {A, B, C}, with the following ISIdLists:

BFS-based Approach At each Step k: • Use Tree to generate candidate arrangements: • Build N(k) from N(k-1). • Construct IMk. For every 2-relation, point to the second level of the Tree. • Check support. If it satisfies min_sup, then add to Fk. • Continue with the rest of the Tree in a BFS order. • If a node is found not to be frequent, do not expand its sub-tree (Apriori Principle)3. • Stop at step k, where Fk = empty. 3. R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In Proc. of VLDB, pages 487-499, 1994.

BFS-based Approach (1/4) • D: an input e-sequence database. • F: the complete set of frequent arrangements. • Fk: the complete set of frequent k-arrangements. • Ck: the current set of candidate k-arrangements. • min_sup: the minimum support threshold. • ISIdList (A): the ISIdList of arrangement A.

BFS-based Approach (2/4) • BFS: STEP 1: Find F1 • Use Tree to generate C1 • Build N(1). • For each ni1 in N(1): • Build ISIdList (Ai), where Ai is the arrangement that corresponds to ni1. • If the number of records in ISIdList (Ai) is at least min_sup, then A is inserted into F1.

BFS-based Approach (3/4) • BFS: STEP k: Find Fk • Use Tree to generate Ck • Build N(k) from N(k-1). • Construct IMk. • For each node in IMk: • Build ISIdList. • If the number of records in the ISIdList is at least min_sup, insert arrangement into F1. • Continue with the rest of the Tree in a BFS order.

BFS-based Approach (4/4) • Continue with the rest of the Tree in a BFS order. • If a node is found not to be frequent, do not expand its sub-tree (Apriori Principle)1. • Stop at step k, where Fk = empty. 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In proc. of VLDB, pages 487-499, 1994.

Hybrid DFS-based Approach • DFS is inappropriate: • For each node we would have to scan the database multiple times to detect the 2-relations among the items in the node. • Though in BFS these relations are already available. • Generate the first two levels of the Tree using BFS. • Then use DFS. • Eliminates multiple database scans, since now the 2-relations are available.

BFS-based Approach (1/2) Creating a 2-arrangement (Example)

BFS-based Approach (2/2)Creating a 3-arrangement (Example)

Experimental SetupReal Datasets • Dataset 1: • Utterances of WH-Questions. • Size: 73 e-sequences. • # of labels: 400. • Dataset 2: • SignStream Database. • Size: 884 e-sequences. • # of labels: 400.

Related Work (1/2) • Problem of frequent itemset mining first introduced in: • R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In proc. of VLDB, pages 487-499, 1994. • An extension to episodes (i.e. combinations of events with a partially specified order) was proposed in: • H. Mannila, H. Toivonen, and A. Verkamo. Discovering Frequent episodes in sequences. In Proc. of ACM SIGKDD, pages 210–215, 1995. • The Itemset Enumeration Tree was described in: • R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. of ACM SIGMOD, pages 85–93, 1998. • Some efficient sequential pattern mining algorithms have been proposed in: • M. Zaki. Spade: An efficient algorithm for mining sequences. Machine Learning, 40:31–60, 2001. • J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. Sequential pattern mining using a bitmap representation. In Proc. of ACM SIGKDD, pages 429–435, 2002.

Related Work (2/2) • Closed frequent itemset mining algorithms: • J. Pei, J. Han, and R.Mao. Closet: An efficient algorithm form ining frequent closed itemsets. In Proc. of DMKD, pages 11–20, 2000. • M. Zaki and C. Hsiao. Charm: An efficient algorithm for closed itemset mining. In Proc. of SIAM, pages 457–473, 2002. • Closed sequential pattern mining: • X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large databases. In Proc. of SDM, 2003. • J.Wang and J. Han. Bide: Efficient mining of frequent closed sequences. In Proc. of IEEE ICDE, pages 79–90, 2004. • Mining association rules in temporal and spatio-temporal databases: • T. Abraham and J. F. Roddick. Incremental meta-mining from large temporal data sets. In ER ’98: Proceedings of the Workshops on Data Warehousing and Data Mining, pages 41–54, 1999. • X. Chen and I. Petrounias. Mining temporal features in association rules. In Proc. of PKDD, pages 295–300, London, UK, 1999. Springer-Verlag. • I. Tsoukatos and D. Gunopulos. Efficient mining of spatiotemporal patterns. In Proc. of the SSTD, pages 425–442, 2001. • Discovering temporal patterns of Interval-based Events: • P. Kam and A. W. Fu. Discovering temporal patterns of Interval-based Events. In Proc. of the DaWak, pages 317–326, London, UK, 2000. Springer-Verlag.

Panagiotis Papapetrou, George Kollios, Stan Sclaroff, Dimitrios Gunopulos

Panagiotis Papapetrou, George Kollios, Stan Sclaroff, Dimitrios Gunopulos

Presentation Transcript

Stan Getz

Answering Top-k Queries Using Views By Gautam Das,Dimitrios Gunopulos , Nick Koudas , Dimitris Tsirogiannis

DAMI: Introduction to Data Mining Panagiotis Papapetrou

Dimitrios Hatzinakos, Ph.D. , P.Eng

Odysseas Papapetrou 18 April 2011

Stan Ahalt

Dimitrios Gkatzoflias dgkatzof@auth.gr

Dimitrios P. Bogdanos

Dimitrios Stimoniaris

EMINEM STAN

George Pallis Athena Vakali Konstantinos Stamos Antonis Sidiropoulos Dimitrios Katsaros

Dimitrios Gkatzoflias dgkatzof@auth.gr

Stan Smits

Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science

stan fomin

Dimitrios P. Bogdanos

Dimitrios Stimoniaris

Dimitrios Vlachopoulos Malmo, 13.10.2014