1 / 75

OLAP on

OLAP on. Sequence Data. Published in SIGMOD 2008 Vancouver, Canada. Authors :. Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. Cheung. Presenter :. Chun Kit Chui (Kit), The University of Hong Kong ckchui@cs.hku.hk. OLAP on. Sequence Data. Problem Motivation.

berit
Download Presentation

OLAP on

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OLAP on Sequence Data Published in SIGMOD 2008 Vancouver, Canada. Authors : Eric Lo, Ben Kao, Wai-Shing Ho, Sau-Dan Lee, Chun Kit Chui and David W. Cheung Presenter : Chun Kit Chui (Kit),The University of Hong Kongckchui@cs.hku.hk

  2. OLAP on Sequence Data Problem Motivation Sequence Data Cube and Cuboids New OLAP operations System architecture Experimental evaluations Future works

  3. Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. OLAP on Sequence Data Stock market data Web server access logs U.S. OIL FUND ETF MEXCO ENERGY CORP

  4. Web server access logs (Web retailor selling sports wear products) The product dimension is associated with a concepthierarchy in which the finest level of abstraction is product ID, followed by product type, and brand. • Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. Sequence Data Stock market data Web server access logs U.S. OIL FUND ETF MEXCO ENERGY CORP

  5. Member 688 Nike shoes Adidas shoes Nike shoes Web server access logs (Web retailor selling sports wear products) The product dimension is associated with a concepthierarchy in which the finest level of abstraction is product ID, followed by product type, and brand. Sequence Data • Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. From the access logs we can trace back the browsing sequences of all members. Web server access logs Browsing Sequence

  6. Member 688 Nike shoes Adidas shoes Nike shoes Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. Manager Sequence Data • Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. Web server access logs Browsing Sequence

  7. Pattern template Member 688 Nike shoes Adidas shoes Nike shoes Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. Manager Sequence Data Browsing Sequence The query is referring to a particular kind of pattern in the browsing sequences. The comparison shopping semanticscan be expressed by the pattern template < X, Y, X >.

  8. Pattern template Instantiated pattern Member 688 Nike shoes Adidas shoes Nike shoes Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. Manager Sequence Data <Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template. Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell. Browsing Sequence

  9. Member 688 Nike shoes Adidas shoes Nike shoes Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. Manager Sequence Data The aggregated number of members is counted and a tabulated view of the sequence data should be returned. <Nike shoes, Adidas Shoes, Nike Shoes> is one of the instantiations of the pattern template. Since the browsing sequence of member 688 contains/ possesses the pattern, the sequence contributes to 1 count in the cell. Browsing Sequence

  10. Web server access logs (Web retailor selling sports wear products) I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. Sequence OLAP system Query • Support “pattern based” grouping and aggregation. Manager The aggregated number of members is counted and a tabulated view of the sequence data should be returned. Result

  11. + I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if so what is the product. Sequence OLAP system Follow up Query • Support “pattern based” grouping and aggregation. Manager • Obtain query results in real time (OLAP feature). Result The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes

  12. + I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if so what is the product. Sequence OLAP system Follow up Query • Support “pattern based” grouping and aggregation. Manager • Obtain query results in real time (OLAP feature). This manager find out the Adidas T-shirts page is the most popular page for the members who did comparison shopping between Nike shoes and Adidas shoes pages. Result The new query can be expressed by appending a pattern symbol “Z” to form a new pattern template <X,Y,X,Z>. The result shows the statistics of one more browsing step after the comparison shopping between Nike Shoes and Adidas Shoes

  13. Nike Nike shoes Nike T-shirts Nike Basketballs Nike socks I would like to know the number of members that did comparison shoppingand their distributions over all product web page to product web page pairs within 2008 Quarter 1. There are so many members did comparison shopping between Nike shoes and Addidas shoes, I would like to further investigate whether those members would browse one more product and if so what is the product. The comparison shopping patterns displayed in the “product type” abstraction level is too detailed, I would like to view some higher level statistics. Sequence OLAP system Query • Support “pattern based” grouping and aggregation. Manager • Obtain query results in real time (OLAP feature). • Provide OLAP operations to ease sequence analysis. Result A simple “roll up” operation on the pattern template transforms the summary statistics to the brand abstraction level. “Product type” abstraction level “brand” abstraction level

  14. Sequence OLAP Research Objective • To design and implement an OLAP system that is able to • support “pattern based” grouping and aggregation. • obtain query results in real-time. • Especially optimized for interactive/iterative queries. • provide OLAP operations to ease explorative analysis of sequence data.

  15. Smart card RFID Logs • Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags. • The smart card system in public transits • Octopus card Hong Kong, Orca cardin Seattle (2009)…etc • Electronic money • Travel history of passengers are logged in a database. • Generate massive amount of sequence data.

  16. Smart card Card reader RFID Logs Event Database • Radio-frequency identification (RFID) is an automatic identification method, relying on storing and remotely retrieving data using devices called RFID tags. • The smart card system in public transits • Octopus card Hong Kong, Orca cardin Seattle (2009)…etc • Electronic money • Payment can be done easily by waving the card over the card reader. • Travel history of passengers are logged in a database. • Generate massive amount of sequence data .

  17. Marketing Manager Event Database The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.

  18. Marketing Manager Sequence OLAP system • Support “pattern based” grouping and aggregation. • Obtain query results in real time. • Provide OLAP operations to ease explorative analysis. Event Database Round trip statistics (Stations level) Result Query The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4.

  19. Sequence Data Cuboid A logical view of sequence data at a particular degree of summarization.

  20. Marketing Manager Preliminary The number of round-trip passengers and their distributions over all origin-destination station pairs within 2008 Quarter 4. • Sequence Cuboid (S-Cuboid) • a logical view of sequence data at a particular degree of summarization. • sequences can be characterized by • attributes’ values of the events in the sequence (e.g. time, spending, product type) • the subsequence/ substring patterns they possess. (e.g. <X,Y,X> , <X,Y,Y,X>) Sequence OLAP An S-Cuboid

  21. Phase 1. Sequence Formation Event Database Event Selection An event selection step to select a set of a relevant records and attributes.

  22. Kit’s trip on monday Phase 1. Sequence Formation Event Database Event Selection A sequence formation step to form sequences from the event dataset. Sequence Formation Sequences can be formed per day and for each individual user. By doing this, we have a number of daily travel sequences of each user. E.g. S1 is Kit’s trip on Monday User : Individual, Time : Day

  23. Kit’s trip on monday Kit’s trip in 2008 Phase 1. Sequence Formation Event Database Event Selection Sequences can also be formed according to time dimension at the abstraction level of year and per individual user. Sequence Formation User : Individual, Time : Day User : Individual, Time : Year

  24. S4 S29 S129 S2529 Kit’s trip on monday S3 S23 S242 S2453 Shing User : individual S2 S90 S124 S9230 Ben S1 S100 S388 S1020 Kit Time : day Phase 2. S-Cuboid construction User : Individual, Time : Day Monday

  25. S4 S29 S129 S2529 S3 S23 S242 S2453 User : fare-group S2 S90 S124 S9230 Regular Group S1 S100 S388 S1020 time : day S4 S29 S129 S2529 S3 S23 S242 S2453 Shing User : individual S2 S90 S124 S9230 Ben S1 S100 S388 S1020 Kit Time : day Phase 2. S-Cuboid construction A sequence grouping step to group the sequences that share the same dimensions’ values into a sequence group. E.g. travel sequences are grouped according to their fare groups. Sequence Grouping User : Individual, Time : Day Monday

  26. Y (Location : station) S4 S29 S129 S2529 X (Location : station) S3 S23 S242 S2453 User : fare-group S2 S90 S124 S9230 S1 S100 S388 S1020 time : day S4 S29 S129 S2529 S3 S23 S242 S2453 User : individual S2 S90 S124 S9230 S1 S100 S388 S1020 Time : day Phase 2. S-Cuboid construction Pattern X,Y,Y,X Pattern Grouping Sequence Grouping The pattern grouping step further groups the sequences according to the “patterns” they possess. User : Individual, Time : Day

  27. Y (Location : station) S4 S29 S129 S2529 X (Location : station) S3 S23 S242 S2453 User : fare-group S2 S90 S124 S9230 S1 S100 S388 S1020 time : day Phase 2. S-Cuboid construction Pattern X,Y,Y,X Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, Shatin> We assign sequences to a cell if that sequence contains the instantiated pattern. Pattern Grouping The pattern grouping step further groups the sequences according to the “patterns” they possess. S1 Central S3 Shatin

  28. Y (Location : station) S4 S29 S129 S2529 X (Location : station) S3 S23 S242 S2453 User : fare-group S2 S90 S124 S9230 S1 S100 S388 S1020 time : day Phase 2. S-Cuboid construction Pattern X,Y,Y,X Each cell represents an instantiated pattern E.g. <Shatin, Central, Central, Shatin> We assign sequences to a cell if that sequence contains the instantiated pattern. Pattern Grouping Aggregated Value Finally, an aggregation function is applied to the sequences in each cuboid cell. Count: 2 S1 Central S3 Shatin

  29. Y (Location : station) S4 S29 S129 S2529 X (Location : station) S3 S23 S242 S2453 User : fare-group S2 S90 S124 S9230 S1 S100 S388 S1020 time : day Phase 2. S-Cuboid construction Pattern X,Y,Y,X Pattern Grouping Aggregated Value Count: 2 S1 Central S3 4D S-Cuboid Shatin 4D S-Cuboid

  30. Y (Location : station) S4 S29 S129 S2529 X (Location : station) S3 S23 S242 S2453 User : fare-group S2 S90 S124 S9230 S1 S100 S388 S1020 time : day Pattern Dimensions Global Dimensions Phase 2. S-Cuboid construction Pattern X,Y,Y,X Pattern Grouping Aggregated Value Count: 2 S1 Central S3 4D S-Cuboid Shatin 4D S-Cuboid

  31. Sequence Formation Sequence Grouping Pattern Grouping Sequence Cuboid query language This query specifies the construction of the S-Cuboid that answer the round trip query in the running example. The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4. The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4. 4D S-Cuboid

  32. Form individual daily travel sequences. Sequence Formation We specify the global dimensions in the sequence grouping step. Group the sequences with the same fare-group within the same day. Sequence Grouping Pattern Grouping Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level. Sequence Cuboid query language The number of round-trip passengers and their distributions over all origin-destination station pairs within 2007 Quarter 4. 4D S-Cuboid

  33. Form individual daily travel sequences. Sequence Formation We specify the global dimensions in the sequence grouping step. Group the sequences with the same fare-group within the same day. Sequence Grouping Pattern Grouping Group the sequences according to the pattern template <X,Y,Y,X>, where X, Y are referring to the location dimension at station abstraction level. Sequence Cuboid query language 4D S-Cuboid The predicates further increases the expression power of pattern matching in the query language. What exactly is a round-trip pattern?

  34. Sequence Cuboid query language Sequence Formation Sequence Grouping Global dimensions Pattern template Pattern dimensions Pattern Grouping E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin > 4D S-Cuboid The cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence.

  35. Sequence Cuboid query language Any changes to the cuboid specification transforms the S-Cuboid to another. E.g. changing the pattern template to (X,Y,Y,X,Z) generates another S-Cuboid. Sequence Formation Sequence Grouping Global dimensions Pattern template Pattern dimensions Pattern Grouping E.g. Kit <Shatin, Central, Central, Shatin, Shatin, Central, Central, Shatin > 4D S-Cuboid The cell restriction defines how to deal with the situations when a data sequence contains multiple occurrences of a cell’s pattern. E.g. A sequence contribute to 1 count whenever we can find one match of the pattern in the sequence.

  36. Properties of S-Cuboids • Exponential number of S-cuboids • The length of the pattern template is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable Recall that changing the pattern template essentially changes the cuboid specification and thus generates a new cuboid.

  37. Traditional OLAP # Sales Finer summaries Wed Tue Sat Sun Thur Fri Mon Coarser summaries 7 # Sales Summarizable! Whole week Properties of S-Cuboids • Exponential number of S-cuboids • The length of the pattern template is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable In traditional OLAP systems, data are summarizable. i.e. Summaries in finer abstraction level can be used to construct the summary in higher abstraction level.

  38. Sequence OLAP #Sequences Finer summaries < A, B, A> 1 1 < A, B, B> #Sequences Coarser summaries ? < A, B > Properties of S-Cuboids Sequence Database S-Cuboid (Finer aggregates) • Infinite number of S-cuboids • The number of pattern dimensions is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable The S-Cuboid with pattern template <X,Y,Z> Traditional OLAP # Sales Wed Tue Sat Sun Thur Fri Mon 7 # Sales Summarizable! Whole week

  39. Sequence OLAP #Sequences Finer summaries < A, B, A> 1 1 < A, B, B> #Sequences Coarser summaries ? < A, B > Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? Properties of S-Cuboids Sequence Database S-Cuboid (Finer aggregates) S-Cuboid (Coarser aggregates) • Infinite number of S-cuboids • The number of pattern dimensions is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable The S-Cuboid with pattern template <X,Y,Z> Traditional OLAP # Sales Wed Tue Sat Sun Thur Fri Mon 7 # Sales Summarizable! Whole week

  40. Sequence OLAP #Sequences Finer summaries < A, B, A> 1 1 < A, B, B> #Sequences Coarser summaries ? < A, B > Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? Properties of S-Cuboids Sequence Database S-Cuboid (Finer aggregates) S-Cuboid (Coarser aggregates) • Infinite number of S-cuboids • The number of pattern dimensions is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable S-Cuboid (Coarser aggregates) Sequence Database S-Cuboid (Finer aggregates) The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences. Traditional OLAP # Sales Wed Tue Sat Sun Thur Fri Mon 7 # Sales Summarizable! Whole week

  41. Sequence OLAP #Sequences Finer summaries < A, B, A> 1 1 < A, B, B> #Sequences Coarser summaries < A, B > Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? Properties of S-Cuboids Sequence Database S-Cuboid (Finer aggregates) S-Cuboid (Coarser aggregates) • Infinite number of S-cuboids • The number of pattern dimensions is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable S-Cuboid (Coarser aggregates) Sequence Database S-Cuboid (Finer aggregates) The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences. Traditional OLAP # Sales Wed Tue Sat Sun Thur Fri Mon 7 # Sales Summarizable! Non-Summarizable! Whole week

  42. Can we compute the S-Cuboid with pattern <X,Y> (coarser summary) from the S-Cuboid with pattern <X,Y,Z> (finer summary) without looking at the sequence database? Properties of S-Cuboids Sequence Database S-Cuboid (Finer aggregates) S-Cuboid (Coarser aggregates) • Infinite number of S-cuboids • The number of pattern dimensions is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Non-summarizable • Coarser aggregates cannot be computed solely from the corresponding finer aggregates. S-Cuboid (Coarser aggregates) Sequence Database S-Cuboid (Finer aggregates) The problem is that we don’t know if the counts in these two patterns are generated from the same sequence, or two different sequences.

  43. Properties of S-Cuboids • Exponential number of S-cuboids • The length of the pattern template is infinite • Pattern Template (X,Y,Y,X,A,B,…) • Full materialization is impossible! • Non-summarizable • Coarser aggregates cannot be computed solely from the corresponding finer aggregates. • Partial materialization is infeasible!

  44. Properties of S-Cuboids • Research direction • Precompute some other auxiliary data structures so that queries can be computed online using the pre-built data structures

  45. S-OLAP Specific Operations Assist explorative analysis of the sequence data

  46. S-OLAP specific operations • Navigate between cuboids with ease • Traditional OLAP operations for Global Dimensions • SLICE, DICE, ROLL-UP, DRILL-DOWN, etc. • New S-OLAP operations for Pattern Dimensions / Pattern Template • APPEND(X) (X,Y,Y) (X,Y,Y,X) • DE-TAIL (X,Y,Y,X) (X,Y,Y) • PREPEND(Z) (X,Y,Y,X) (Z,X,X,Y,Y) • DE-HEAD (Q,Y,Y,X) (Y,Y,X) • PATTERN-ROLL-UP(X) (X,Y,Y,X)  (X,Y,Y,X) • PATTERN-DRILL-DOWN(X) (X,Y,Y,X)  (x,Y,Y,x) Coarser abstraction level Finer abstraction level

  47. < X ,Y > Tell me the summary statistics of the single trip travel patterns of passengers among different RailLines, please . Sequence OLAP CUBOID by SUBSTRING(X,Y) WITH X as location at “Rail Lines”, Y as location at “Rail Lines” LEFT-MAXIMALITY (x1, y1) WITH x1.action = “in” AND y1.action = “out”

  48. < X ,Y > S-Cuboid 1 (10 * 10 cells) Sequence OLAP

  49. < X ,Y > S-Cuboid 1 (10 * 10 cells) Sequence OLAP More detailed statistics of passengers traveling from the Tsuen Wan Line to each of the Island Line stations, please .

  50. < X ,Y > Slice, P-DRILL-DOWN S-Cuboid 1 (10 * 10 cells) Sequence OLAP S-Cuboid 2 (1 * 14 cells) Instead of specifying the S-Cuboid construction query, a SLICE plus a P-DRILL-DOWN(Y) is done.

More Related