Loading in 2 Seconds...
Loading in 2 Seconds...
What is the WHT anyway, and why are there so many ways to compute it?. Jeremy Johnson. 1, 2, 6, 24, 112, 568, 3032, 16768,…. WalshHadamard Transform. y = WHT N x, N = 2 n. WHT Algorithms. Factor WHT N into a product of sparse structured matrices Compute: y = (M 1 M 2 … M t )x
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Jeremy Johnson
1, 2, 6, 24, 112, 568, 3032, 16768,…
yt = Mtx
…
y2 = M2 y3
y = M1 y2
WHT2Ä WHT2 = (WHT2 Ä I2)(I2 Ä WHT2)
WHT8= (WHT2 Ä I4)(I2 Ä WHT4)
= (WHT2 Ä I4)(I2 Ä WHT2 ÄI2) (I4 Ä WHT2)
WHT8 = (WHT2 I4)(I2 (WHT2 I2)(I2 WHT2))
ù
1
1
1
1
1
1
1
1
ê
ú




1
1
1
1
1
1
1
1
ê
ú
ê
ú




1
1
1
1
1
1
1
1
ê
ú




1
1
1
1
1
1
1
1
ê
ú
=
ê
ú




1
1
1
1
1
1
1
1
ê
ú




1
1
1
1
1
1
1
1
ê
ú
ê
ú




1
1
1
1
1
1
1
1
ê
ú




ê
ú
1
1
1
1
1
1
1
1
ë
û
é
ù
é
ù
é
ù
1
1
1
1
1
1
ê
ú
ê
ú
ê
ú

1
1
1
1
1
1
ê
ú
ê
ú
ê
ú
ê
ú
ê
ú
ê
ú

1
1
1
1
1
1
ê
ú
ê
ú
ê
ú


1
1
1
1
1
1
ê
ú
ê
ú
ê
ú
ê
ú
ê
ú
ê
ú

1
1
1
1
1
1
ê
ú
ê
ú
ê
ú


1
1
1
1
1
ê
ú
ê
ú
ê
ú
ê
ú
ê
ú
ê
ú


1
1
1
1
1
1
ê
ú
ê
ú
ê
ú



ê
ú
ê
ú
ê
ú
1
1
1
1
1
1
ë
û
ë
û
ë
û
Iterative AlgorithmWHT8 = (WHT2 I4)(I2 WHT2 I2)(I4 WHT2)
+ ··· +
n
n
+ ··· +
n
i
+1
t
1
i

1
WHT ImplementationR=N; S=1;
for i=t,…,1
R=R/Ni
forj=0,…,R1
for k=0,…,S1
S=S* Ni;
M
b,s
t
Õ
)
Ä
Ä
(
I
I
WHT
WHT
=
n
n
2
2
2
2
i
i
=
1
162 = 1 0 1 0 0 0 1 0
11 11 1 1 11 1 1+2+4+2 = 9
Tn = (n/n3/2), where =4+8 6.8284
Tn = (5n/n3/2)
WHT algorithms
W(n)=small[n]  split[W(n1),…W(nt)]
Theorem. Let WN be a WHT algorithm of size N. Then the number of floating point operations (flops) used by WN is Nlg(N).
Proof. By induction.
Al(n) = Number of calls to base case of size l
.file "s_1.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.globl apply_small1
.type apply_small1,@function
apply_small1:
movl 8(%esp),%edx //load stride S to EDX
movl 12(%esp),%eax //load x array\'s base address to EAX
fldl (%eax) // st(0)=R7=x[0]
fldl (%eax,%edx,8) //st(0)=R6=x[S]
fld %st(1) //st(0)=R5=x[0]
fadd %st(1),%st // R5=x[0]+x[S]
fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]
fsubp %st,%st(1) //st(0)=R6=x[S]x[0] ?????
fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]x[0]
fstpl (%eax) //store x[0]=x[0]+x[S]
fstpl (%eax,%edx,8) //store x[0]=x[0]x[S]
ret
l = 12, l = 34, and l = 106
= 27
1 = 18, 2 = 18, and 1 = 20
n
Cost(
),
min
n1+… + nt= n
…
T
T
nt
n1
where Tn is the optimial tree of size n.
This depends on the assumption that Cost only depends on
the size of a tree and not where it is located.
(true for IC, but false for runtime).
For IC, the optimal tree is iterative with appropriate leaves.
For runtime, DP is a good heuristic (used with binary trees).
UltraSPARC
[1], [2], [3], [4], [5], [6]
[[3],[4]]
[[4],[4]]
[[4],[5]]
[[5],[5]]
[[5],[6]]
[[4],[[4],[4]]]
[[[4],[5]],[4]]
[[4],[[5],[5]]]
[[[5],[5]],[5]]
[[[5],[5]],[6]]
[[4],[[[4],[5]],[4]]]
[[4],[[4],[[5],[5]]]]
[[4],[[[5],[5]],[5]]]
[[5],[[[5],[5]],[5]]]
Pentium
[1], [2], [3], [4], [5], [6]
[7]
[[4],[4]]
[[5],[4]]
[[5],[5]]
[[5],[6]]
[[2],[[5],[5]]]
[[2],[[5],[6]]]
[[2],[[2],[[5],[5]]]]
[[2],[[2],[[5],[6]]]]
[[2],[[2],[[2],[[5],[5]]]]]
[[2],[[2],[[2],[[5],[6]]]]]
[[2],[[2],[[2],[[2],[[5],[5]]]]]]
[[2],[[2],[[2],[[2],[[5],[6]]]]]]
[[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]]