병렬 분산 컴퓨팅 연구실 석사 1학기 송지숙

On Computing the Data Cube.Research Report 10026, IBM Almaden Research Center, San Jose, California, 1996. 병렬 분산 컴퓨팅 연구실 석사 1학기 송지숙

Contents • Introduction • PipeSort Algorithm • PipeHash Algorithm • Comparing PipeSort and PipeHash • Conclusion

Optimization(1/2) • Smallest-parent • 이전에 계산된 group-by 중 가장 작은 것으로부터 group-by계산 • Cache-results • disk I/O를 줄이기 위해서 결과가 memory에 저장된 group-by로부터 다른 group-by 계산 • Amortize-scans • 가능하면 한꺼번에 많은 group-by를 계산함으로써 disk read를 줄이는 것

Optimization(2/2) • Share-sorts • sort-based algorithm에만 한정 • 여러 group-by 간에 sorting cost를 공유 • Share-partitions • hash-based algorithm에만 한정 • hash-table이 memory에 비해 너무 클 경우, data를 memory에 맞게 분할하여 각 partition에 대해 aggregation  여러 group-by간에 partitioning cost 공유

Sort-based methods • PipeSort algorithm • optimization share-sorts와 smallest-parent의 결합 : 두 optimization간에 대립이 생길 수 있기 때문에 group-by를 할 때 global planning통해 minimum total cost 얻음. • optimization cache-results와 amortize-scans도 포함 : pipeline 방식으로 여러 group-by를 실행함으로써 disk scan cost를 줄임.

Level 0 1 A 2 AB 3 ABC 4  Share-sorts and smallest-parent all A B C D AB AC AD BC BD CD ABC ABD ACD BCD BDA ABCD

Level 0 1 A 2 AB 3 ABC ABCD 4  cache-results and amortize-scans all A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD

Algorithm PipeSort(1/2) • Input • search lattice - vertex : group-by cube - edge : i로부터 j가 generate될 때, i에서j로 연결한다. j는 i보다 attribute를 하나 적게 가지고 i를 j의 parent라고 부른다. - cost : S는 i가 정렬되어 있지 않을 때 i로부터 j를 계산하는 cost A는 i가 정렬되어 있을 때 i로부터 j를 계산하는 cost • Output • subgraph of the search lattice - 각 group-by는 그것의 attribute 정렬순서로 결합되어 있고 그것을 계산하는데 이용되는 하나의 parent와 연결된다.

Level all 0 1 A B C A B C AB AB AC AC BC BC 10 10 12 12 20 20 2 AB AC BC BA BA CA AB AC BC 2 5 13 2 5 13 3 ABC Minimum cost matching A S Algorithm PipeSort(2/2)

all C B A D 2 4 5 8 4 16 4 13 CB BA AC DB AD CD 5 15 5 15 4 14 5 15 5 15 10 20 DBA ADC CBA BAD ACD DBC 10 30 15 40 5 20 45 130 Pipeline edges sort edges DBCA ACDB CBAD Raw data 50 160 A() S() Minimum cost sort plan CDA BADC

Hash-based methods • PipeHash algorithm • optimization cache-results와 amortize-scans의 결합 : multiple hash-table의 신중한 memory allocation이 요구 • optimization smallest-parent도 포함 • optimization share-partitions 포함 : aggregation data는 hash-table이 memory에 들어가기에 너무 크기 때문에, 하나 또는 그 이상의 attribute에 대해서 data를 partition한다. Partitioning attribute를 포함하는 모든 group-by간에 data partitioning cost를 공유한다.

Level 0 1 A B 2 AB AC 3 4  cache-results and amortize-scans all A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD

Algorithm PipeHash • Input • search lattice • First step • 각 group-by에 대해, 가장 작은 total size 추정치를 가지는 parent group-by를 선택한다. 그 결과가 minimum spanning tree이다. • Next step • 대개 MST안에 모든 group-by를 함께 계산하기에 memory가 충분하지 않다. • 다른 hash-table을 위해 memorydisplacement가 일어날 때, 어떤 group-by가 함께 계산될지, datapartitioning을 위해 어떤 attribute를 선택할지 결정한다. • Optimization cache-results와 amortize-scan을 위해 MST의 subtree 중 가장 큰 것을 선택하도록 한다.

all A B C D AB AC BC AD CD BD ABC ABD ACD BCD ABCD Raw Data Minimum spanning tree

A AB AC AD all ABC ABD ACD B A ABCD AB C D BC Raw Data CD BD ABC BCD ABCD First subtree partitioned on A Remaining subtrees

Comparing PipeSort and PipeHash(1/5) • Datasets • Performance results • faster than the naive methods • The performance of PipeHash is very close to lower bound for hash-based algorithms. • PipeHash is inferior to the PipeSort algorithms.

Comparing PipeSort and PipeHash(2/5)

Comparing PipeSort and PipeHash(3/5) • 각 group-by 결과로 tuple의 수가 많이 줄어들 때, hash-based method가 sort-based method보다 더 좋은 성능을 가질 것이다. • Synthetic datasets • number of tuples, T • number of grouping attributes, N • ratio among the number of distinct values of each attribute, d1:d2:…:dN • ratio of T to the total number of possible attribute value combinations, p - data sparsity 정도를 바꾸는데 사용

Comparing PipeSort and PipeHash(4/5) Effect of sparseness on relative performance of the hash and sort-based algorithms for a 5 attribute synthetic dataset.

Comparing PipeSort and PipeHash(5/5) • Results • x-axis denotes decreasing levels of sparsity. • y-axis denotes the ratio between the total running time of algorithms PipeHash and PipeSort. • data가 점점 덜 sparse해짐에 따라, hash-based method가 sort-based method보다 더 좋은 성능을 가진다. • PipeHash와 PipeSort algorithm의 상대적인 성능의 predictor는 sparsity임을 알 수 있다.

Conclusion • Presented five optimizations smallest-parent, cache-results, amortize-scans, share-sorts and share-partitions • The PipeHash and PipeSort algorithms combine them so as to reduce the total cost. • PipeHash does better on low sparsity data whereas PipeSort does better on high sparsity data.

병렬 분산 컴퓨팅 연구실 석사 1학기 송지숙

병렬 분산 컴퓨팅 연구실 석사 1학기 송지숙

Presentation Transcript