Data Mining on Streams. Christos Faloutsos CMU. Dr. Deepay Chakrabarti (Yahoo) Dr. Spiros Papadimitriou (IBM) Prof. ByoungKee Yi (Pohang U.). Prof. Dimitris Gunopulos (UCR) Dr. Mengzhi Wang (Google). Thanks. For more info:. 3h tutorial, at:
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Dr. Spiros Papadimitriou (IBM)
Prof. ByoungKee Yi (Pohang U.)
Prof. Dimitris Gunopulos (UCR)
Dr. Mengzhi Wang (Google)
Thanks(c) C. Faloutsos, 2006
3h tutorial, at:
http://www.cs.cmu.edu/~christos/TALKS/EDBT04tut/faloutsosedbt04.pdf
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
x1 , x2 , … , xt , …
(y1, y2, … , yt, …
… )
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
Goal: given a signal (e.g.., #packets over time)
Find: patterns, periodicities, and/or compress
count
lynx caught per year
(packets per day;
temperature per day)
year
(c) C. Faloutsos, 2006
Given xt, xt1, …, forecast xt+1
90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
(c) C. Faloutsos, 2006
E.g.., Find a 3tick pattern, similar to the last one
90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
Patterns, rules, forecasting and similarity indexing are closely related:
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
E.g.., ‘find stocks similar to MSFT’
Seq. scanning: too slow
How to accelerate the search?
[Faloutsos96]
(c) C. Faloutsos, 2006
Solution: Quickanddirty' filter:
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
Even on otherthansequence data:
(c) C. Faloutsos, 2006
Q: How do Spatial Access Methods (SAMs) work?
A: they group nearby points (or regions) together, on nearby disk pages, and answer spatial queries quickly (‘range queries’, ‘nearest neighbor’ queries etc)
For example:
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
very good for compressing real signals
more details on DFT/DCT/DWT: later
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
Similarity search in time sequences
1) establish/choose distance (Euclidean, timewarping,…)
2) extract features (SVD, DWT, MDS), and use a SAM (Rtree/variant) or a Metric Tree (Mtree)
2’)for high intrinsic dimensionalities, consider sequential scan (it might win…)
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
Goal: given a signal (eg., packets over time)
Find: patterns and/or compress
count
lynx caught per year
(packets per day;
automobiles per hour)
year
(c) C. Faloutsos, 2006
value
Ampl.
time
Freq.
(c) C. Faloutsos, 2006
value
time
(c) C. Faloutsos, 2006
time
Wavelets  DWT(c) C. Faloutsos, 2006
time
Wavelets  DWTfreq
time
(c) C. Faloutsos, 2006
Time
domain
DWT
SWFT
DFT
freq
time
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
x0 x1 x2 x3 x4 x5 x6 x7
s1,0
.......
level 1
d1,1
s1,1
d1,0
+

(c) C. Faloutsos, 2006
x0 x1 x2 x3 x4 x5 x6 x7
s2,0
level 2
d2,0
s1,0
.......
d1,1
s1,1
d1,0
+

(c) C. Faloutsos, 2006
etc ...
x0 x1 x2 x3 x4 x5 x6 x7
s2,0
d2,0
s1,0
.......
d1,1
s1,1
d1,0
+

(c) C. Faloutsos, 2006
Q: map each coefficient
on the timefreq. plane
f
x0 x1 x2 x3 x4 x5 x6 x7
s2,0
d2,0
t
s1,0
.......
d1,1
s1,1
d1,0
+

(c) C. Faloutsos, 2006
Q: map each coefficient
on the timefreq. plane
f
x0 x1 x2 x3 x4 x5 x6 x7
s2,0
d2,0
t
s1,0
.......
d1,1
s1,1
d1,0
+

(c) C. Faloutsos, 2006
# expects a file with numbers
# and prints the dwt transform
# The number of timeticks should be a power of 2
# USAGE
# haar.pl <fname>
my @vals=();
my @smooth; # the smooth component of the signal
my @diff; # the highfreq. component
# collect the values into the array @val
while(<>){
@vals = ( @vals , split );
}
my $len = scalar(@vals);
my $half = int($len/2);
while($half >= 1 ){
for(my $i=0; $i< $half; $i++){
$diff [$i] = ($vals[2*$i]  $vals[2*$i + 1] )/ sqrt(2);
print "\t", $diff[$i];
$smooth [$i] = ($vals[2*$i] + $vals[2*$i + 1] )/ sqrt(2);
}
print "\n";
@vals = @smooth;
$half = int($half/2);
}
print "\t", $vals[0], "\n" ; # the final, smooth component
Haar wavelets  code(c) C. Faloutsos, 2006
Observation1:
‘+’ can be some weighted addition
‘’ is the corresponding weighted difference (‘Quadrature mirror filters’)
Observation2: unlike DFT/DCT,
there are *many* wavelet bases: Haar, Daubechies4, Daubechies6, Coifman, Morlet, Gabor, ...
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
"Prediction is very difficult, especially about the future."  Nils Bohr
http://www.hfac.uh.edu/MediaFutures/thoughts.html
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
(c) C. Faloutsos, 2006
MANUALLY:
remove trends spot periodicities
7 days
time
time
(c) C. Faloutsos, 2006
80
70
??
60
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
Problem#2: Forecastxt
as a linear function of the past: xt2, xt2, …,
(up to a window of w)
Formally:
(c) C. Faloutsos, 2006
85
‘lagplot’
80
75
70
65
Number of packets sent (t)
60
55
50
45
40
15
25
35
45
Number of packets sent (t1)
(c) C. Faloutsos, 2006
xt
xt1
xt2
(c) C. Faloutsos, 2006
with NO human intervention
on a semiinfinite stream
How to choose ‘w’?(c) C. Faloutsos, 2006
idea: do AR on each wavelet level
in detail:
Answer:(c) C. Faloutsos, 2006
Wl,t1
Wl,t
Wl’,t’2
Wl’,t’1
AWSOM  ideaWl,t l,1Wl,t1l,2Wl,t2 …
Wl’,t’ l’,1Wl’,t’1l’,2Wl’,t’2 …
Wl’,t’
(c) C. Faloutsos, 2006
(incremental)
(incremental; RLS)
(singlepass)
(c) C. Faloutsos, 2006
AWSOM
AR
Seasonal AR
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
Space:OlgN + mk2 OlgN
Time:Ok2 O1
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
http://www.postech.ac.kr/~bkyi/
or christos@cs.cmu.edu
(clone of Splus)
http://cran.rproject.org/
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
http://warsteiner.db.cs.cmu.edu/
http://warsteiner.db.cs.cmu.edu/demo/intemon.jsp
IPto
t=0
IPfrom
(c) C. Faloutsos, 2006
http://warsteiner.db.cs.cmu.edu/
http://warsteiner.db.cs.cmu.edu/demo/intemon.jsp
t=2
t=1
t=0
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
(c) C. Faloutsos, 2006
For code, papers, questions etc:
christos <at> cs.cmu.edu
www.cs.cmu.edu/~christos
(c) C. Faloutsos, 2006