fast parallel algorithms for universal lossless source coding
Download
Skip this Video
Download Presentation
Fast Parallel Algorithms for Universal Lossless Source Coding

Loading in 2 Seconds...

play fullscreen
1 / 36

Fast Parallel Algorithms for Universal Lossless Source Coding - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

Fast Parallel Algorithms for Universal Lossless Source Coding. Dror Baron CSL & ECE Department, UIUC [email protected] Ph.D. Defense – February 18, 2003. Overview. Motivation, applications, and goals Background: Source models Lossless source coding + universality Semi-predictive methods

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Fast Parallel Algorithms for Universal Lossless Source Coding' - talia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
fast parallel algorithms for universal lossless source coding

Fast Parallel Algorithms for Universal Lossless Source Coding

Dror Baron

CSL & ECE Department, UIUC

[email protected]

Ph.D. Defense – February 18, 2003

overview
Overview
  • Motivation, applications, and goals
  • Background:
    • Source models
    • Lossless source coding + universality
    • Semi-predictive methods
  • An O(N) semi-predictive universal encoder
  • Two-part codes
    • Rigorous analysis of their compression quality
    • Application to parallel compression of Bernoulli sequences
  • Parallel semi-predictive (PSP) coding
    • Achieving a work-efficient algorithm
    • Theoretical results
  • Summary
motivation
Motivation
  • Lossless compression: text files, facsimiles, software executables, medical, financial, etc.
  • What do we want in a compression algorithm?
    • Universality: adaptive to a large class of sources
    • Good compression quality
    • Speed: low computational complexity
    • Simple implementation
    • Low memory use
    • Sequential vs. offline
why parallel compression
Why Parallel Compression ?
  • Some applications require high data rates:
    • Compressed pages in virtual memory
    • Remote archiving + fast communication links
    • Real-time compression in storage systems
    • Power reduction for interconnects on a circuit board
  • Serial compression is limited by the clock rate
room for improvement and goals
Room for Improvement and Goals
  • Previous Art:
    • Serial universal source coding methods have reached the bounds on compression quality [Willems1998,Rissanen1999]
    • Parallel source coding algorithms have high complexity and/or poor compression quality
      • Naïve parallelization compresses poorly
      • Parallel dictionary compression [Franszek et. al.1996]
      • Parallel context tree weighting [Stassen&Tjalkens2001,Willems2000]
  • Research Goals: “good” parallel compression algorithm
    • Work-efficient: O(N/B) time with B computational units
    • Compression quality: as good as best serial methods (almost!)
main contributions
Main Contributions
  • BWT-MDL (O(N) universal encoder):
    • An O(N) algorithm that achieves Rissanen’s redundancy bounds on best achievable compression
    • Combines efficient prefix tree construction with semi-predictive approach to universal coding
  • Fast Suffix Sorting (not in this talk):
    • Core algorithm is very simple (can be implemented in VLSI)
    • Worst-case complexity O(N log0.5(N))
    • Competitive with other suffix sorting methods in practice
  • Two-Part Codes:
    • Rigorous analysis of their compression quality
    • Application to distributed/parallel compression
    • Optimal two-part codes
  • Parallel Compression Algorithm (not in this talk):
    • Work-efficient O(N/B) algorithm
    • Compression loss is roughly B log(N/B) bits
source models
Source Models
  • Binary alphabet X={0,1}, sequence x  XN
  • Bernoulli Model:
    • i.i.d. model
    • p(xi=1)=
  • Order-K Markov Model:
    • Previous K symbols called context
    • Context-dependent conditional probability for next symbol
    • More flexible than Bernoulli
    • Exponentially many states
context tree sources
P(xn+1=1|0)=0.8

0

leaf

0

01

11

P(xn+1=1|01)=0.4

1

0

internal node

1

P(xn+1=1|11)=0.9

Context Tree Sources

root

  • More flexible than Bernoulli
  • More compact than Markov
  • Particularly good for text
  • Works for M-ary alphabet
  • State= context + conditional probabilities
  • Example: N=11, x=01011111111
review of lossless source coding
Review of Lossless Source Coding
  • Stationary ergodic sources
  • Entropy rate H=limN H(x)/N
  • Asymptotically, H is the lowest attainable per-symbol rate
  • Arithmetic coding:
    • Probability assignment p(x)
    • Coding length l(x)=-log(p(x))+O(1)
    • Can achieve entropy + O(1) bits
universal source coding
Universal Source Coding
  • Source statistics are unknown
    • Need probability assignment p(x)
    • Need to estimate source model
    • Need to describe estimated source (explicitly or implicitly)
  • Redundancy: excess coding length above entropy

(x)=l(x)-NH

redundancy bounds
Redundancy Bounds
  • Rissanen’s bound (K unknown parameters):

E[(x)] > (K/2) [log(N)+O(1)]

  • Worst-case redundancy for Bernoulli sequences (K=1): (x*)=maxxXN {(x)}  0.5 log(N/2)
  • Asymptotically, (x)/N 0
  • In practice, e.g., text, the number of parameters scales almost linearly with N
  • Low redundancy is still essential
semi predictive approach
S*

Phase I

Phase II

y

MDL Estimator

x

Encoder

Semi-Predictive Approach
  • Semi-predictive methods describe x in two phases:
    • Phase I: find a “good” tree source structure S and describe it using codelength lS
    • Phase II: encode x using S with probability assignment pS(x)
  • Phase I: estimate minimum description length (MDL) tree source model S*=arg min {lS –log(pS(x))}
semi predictive approach phase ii
Arithmetic Encoder

Determine s

p(xi|s)

s

xi

y

S*

Assign p(xi|s)

Semi-Predictive Approach - Phase II
  • Sequential encoding of x given S*
    • Determine which state s of S* generated symbol xi
    • Assign xi a conditional probability p(xi|s)
    • Arithmetic encoding
  • p(xi|s) can be based on previously processed portion of x, quantized probability estimates, etc.
context trees
root

node 0

node 11

node 01

unique

sentinel

Context Trees
  • We will provide an O(N) semi-predictive algorithm by estimating S* using context trees
  • Context trees arrange x in a tree
  • Each node corresponds to

sequence of appended arc

labels on path to root

  • Internal nodes correspond

to repeating contexts in x

  • Leaves correspond to unique contexts
  • Sentinel symbol x0=$ makes sure

symbols have different contexts

context tree pruning to prune or not to prune
MDL structure for state s is

or

MDL structure for 1s

additional structure

0s

1s

10s

s

s

00s

Context Tree Pruning(“To prune or not to prune…”)
  • The MDL structure for state s yields the shortest description for symbols generated by s
  • When processing state s:
    • Estimate MDL structures for states 0s and 1s
    • Decide whether to keep 0s and 1s or prune them into state s
    • Base decision on coding lengths
phase i with atomic context trees
1

1

1

1

0

0

0

0

Phase I with Atomic Context Trees
  • Atomic context tree:
    • Arc labels are atomic (single symbol)
    • Internal nodes are not necessarily branching
    • Has up to O(N2) nodes
  • The coding length minimization of Phase I processes each node of the context tree [Nohre94]
  • With atomic context trees, the worst-case complexity is at least O(N2) ☹
compact context trees
111

1

0

0

Compact Context Trees
  • Compact context tree:
    • Arc labels not necessarily atomic
    • Internal node are branching
    • O(N) nodes
    • Compact representation of the same tree
  • Depth-first traversal of compact context tree provides O(N) complexity 
  • Theorem: Phase I of BWT-MDL requires O(N) operations performed with O(log(N)) bits of precision
phase ii of bwt mdl
Arithmetic Encoder

Determine s

p(xi|s)

s

xi

y

S*

Assign p(xi|s)

Phase II of BWT-MDL
  • We determine the generator state using a novel algorithm that is based on properties of the Burrows Wheeler transform (BWT)
  • Theorem: The BWT-MDL encoder requires O(N) operations performed with O(log(N)) bits of precision
  • Theorem:[Willems et. al. 2000]: redundancy w.r.t. any tree source S is at most |S|[0.5 log(N)+O(1)] bits
distributed parallel compression of bernoulli sequences
Assign p(xi(1))

Arithmetic Encoder 1

Arithmetic Encoder B

Assign p(xi(B))

xi(1)

p(xi(1))

x(1)

y(1)

Encoder 1

x

Splitter

xi(B)

p(xi(B))

x(B)

y(B)

Encoder B

Distributed/Parallel Compression of Bernoulli Sequences
  • Splitter partitions x into B blocks x(1),…,x(B)
  • Encoder j{1,…,B} compresses x(j); it assigns probabilities p(xi(j)=1)= and p(xi(j)=0)=1-
  • The total probability assigned to x is identical to that in a serial compression system
  • This structure assumes that  is known; our goal is to provide a universal parallel compression algorithm for Bernoulli sequences
two part codes
Quantizer

y

k{1,…,K}

rk

ML(x)

x

Determine ML(x)

Encoder

Two-Part Codes
  • Two-part codes use a semi-predictive approach to describe Bernoulli sequences:
    • First part of code:
      • Determine the maximum likelihood (ML) parameter estimate ML(x)=n1/(n0+n1)
      • Quantize ML(x) to rk, one of K representation levels
      • Describe the bin index k with log(K) bits
    • Second part of code encodes x using rk
  • In distributed systems:
    • Sequential compressors require O(N) internal communications
    • Two-part codes need only communicate {n0(j),n1(j)}j{1,…,B}
    • Requires O(B log(K)) internal communications
jeffreys two part code
r1

r2

rk

rK

bk

bK

b0

b1

b2

bk-1

Jeffreys Two-Part Code
  • Quantize ML(x)

Bin edges bk=sin2(k/2K)

Representation levels rk=sin2((2k-1)/4K)

  • Use K 1.772N0.5 bins
  • Source description:
    • log(K) bits for describing the bin index k
    • Need –n1 log(ML(x))-n0log(1-ML(x)) for encoding x
redundancy of jeffreys code for bernoulli sequences
Redundancy of Jeffreys Code for Bernoulli Sequences
  • Redundancy:
    • log(K) bits for describing k
    • N D(ML(x)||rk) bits for encoding x using imprecise model
  • D(a||b) is Kullback Leibler divergence
  • In bin k, l(x)=-ML(x)log(rk )-[1-ML(x)] log(1-rk )
  • l( ML (x)) is poly-line
  • Redundancy = log(K)+ l(ML(x))– N H(ML(x))  log(K) + L
  • Use quantizers that have small L distance between the entropy function and the induced poly-line fit
redundancy properties
Redundancy Properties
  • For xs.t. ML(x) is quantized to rk, the worst-case redundancy is

log(K)+N max{D(bk||rk),D(bk-1||rk)}

  • D(bk||rk) and D(bk-1||rk)
    • Largest in initial or end bins
    • Similar in the middle bins
    • Difference reduced over wider range of k for largerN (larger K)
  • Can construct a near-optimal quantizer by modifying the initial and end bins of the Jeffreys quantizer
redundancy results
Redundancy Results
  • Theorem: The worst-case redundancy of the Jeffreys code is 1.221+O(1/N) bits above Rissanen’s bound
  • Theorem: The worst-case redundancy of the optimal two-part code is 1.047+O(1) bits above Rissanen’s bound
parallel universal compression for bernoulli sequences
x(1)

y(1)

n0(1),n1(1)

ML(x)

rk

x(1)

ML(x) [jn0(j)] / [j n0(j)+j n1(j)]

Quantizer

n0(B),n1(B)

y(B)

x(B)

k

x(B)

Encoder 1

Encoder B

Determine n0(B) and n1(B)

Determine n0(1) and n1(1)

Parallel Universal Compression for Bernoulli Sequences
  • Phase I:
    • Parallel units (PUs) compute symbol counts for the B blocks
    • Coordinating unit (CU) computes and quantizes the MDL parameter estimate ML(x) and describes k
  • Phase II: B PUs encode the B blocks based on rk
why do we need parallel semi predictive coding
x(1)

y(1)

x

Splitter

Compressor 1

x(B)

y(B)

Compressor B

Why do we need Parallel Semi-Predictive Coding?
  • Naïve parallelization:
    • Partition x into B blocks
    • Compress blocks independently
    • The redundancy for a length-N/B block is O(log(N/B))
    • Total redundancy is O(B log(N/B))
  • Rissanen’s bound is O(log(N))
  • The redundancy with naïve parallelization is excessive!
parallel semi predictive psp concept
x(1)

y(1)

x(1)

symbol counts 1

S*

symbol counts B

y(B)

x(B)

Phase I

Phase II

x(B)

S*

Compressor 1

Compressor B

Coordinating Unit

Statistics Accumulator B

Statistics Accumulator 1

Parallel Semi-Predictive (PSP) Concept
  • Phase I:
    • Bparallel units (PUs) accumulate statistics (symbol counts) on the B blocks
    • Coordinating unit (CU) computes the MDL tree source estimate S*
  • Phase II:-- B PUs compress the B blocks based on S*
source description in psp
Describe S* structure

Coordinating Unit

Describe bin indices {ks}sS*

S*

{ks}sS*

Determine p(xi(b)|s)

s

p(xi(b)|s)

Arithmetic Encoder

xi(b)

Determine s

y(b)

Parallel unit b

p(xi(b)|s)1{xi(b)=1}rks+1{xi(b)=0}(1-rks )

Source Description in PSP
  • Phase I: the CU describes the structure of S* and the quantized ML parameter estimates {ks}sS*
  • Phase II: each of B PUs compresses block x(b) just like Phase II of the (serial) semi-predictive approach
complexity of phase i
Dmax

2Dmax=O(N/B)

Complexity of Phase I
  • Phase I processes each node of the context tree [Nohre94]
  • The CU processes the states of a full atomic context tree of depth-Dmax, where Dmax log(N/B)
  • Processing a node:
    • Internal node: requires
    • O(1) time
    • Leaf: CU adds up block
    • symbol counts to compute
    • each symbol count, i.e., ns=b ns(b), where  {0,1}
  • The CU processes a leaf node in O(B) time
  • With O(N/B) leaves, the aggregate complexity is O(N), which is excessive
phase i in o n b time
Phase I in O(N/B) Time
  • We want to compute ns=b ns(b) faster
  • An adder tree incurs O(log(B)) delay for adding up B block symbol counts
  • Pipelining enables us to generate a result every O(1) time
  • O(N/B) nodes, each requiring O(1) time
phase ii in o n b time
Parallel unit b

Determine p(xi(b)|s)

s

p(xi(b)|s)

Arithmetic Encoder

xi(b)

Determine s

y(b)

S*

{ks}sS*

Phase II in O(N/B) Time
  • The challenging part in Phase II is determining s:
    • Define the context index for a length-Dmax context s preceding xi(b) as the binary number that represents s
    • The length-2Dmaxgenerator table g satisfies gj=sS* if s is a suffix of the context whose context index is j
    • We can construct g in O(N/B) time (far from trivial!)
  • Compute context indices for all symbols of x(b) and determine the generating states via the generator table g
decoder
S*,{ks}sS*

Decoding Unit 1

Decoding Unit B

x(1)

DEMUX

y(1)

Reconstruct S*,{ks}sS*

bus y

x(B)

y(B)

Decoder
  • An input bus is demultiplexed to multiple units
  • The MDL source and quantized ML parameters are reconstructed
  • The B compressed blocks y(B) are decompressed on B decoding units
theoretical results
Theoretical Results
  • Theorem: With computations performed with 2 log(N) bits of precision defined as O(1) time:
    • Phase I of PSP approximates the MDL coding length within O(1) of the true optimum
    • The PSP algorithm requires O(N/B) time
  • Theorem: The PSP algorithm uses a total of O(N) words of memory = a total of O(N log(N)) bits
  • Theorem: The pointwise redundancy of PSP w.r.t. S* is (x) < B[log(N/B)+O(1)]+|S|*[log(N)/2+O(1)]

parallelization overhead

main contributions1
Main Contributions
  • BWT-MDL (O(N) universal encoder):
    • An O(N) algorithm that achieves Rissanen’s redundancy bounds on best achievable compression
    • Combines efficient prefix tree construction with semi-predictive approach to universal coding
  • Fast Suffix Sorting (not in this talk):
    • Core algorithm is very simple (can be implemented in VLSI)
    • Worst-case complexity O(N log0.5(N))
    • Competitive with other suffix sorting methods in practice
  • Two-Part Codes:
    • Rigorous analysis of their compression quality
    • Application to distributed/parallel compression
    • Optimal two-part codes
  • Parallel Compression Algorithm (not in this talk):
    • Work-efficient O(N/B) algorithm
    • Compression loss is roughly B log(N/B) bits
slide35
More…
  • Results have been extended to |X|-ary alphabet
  • Future research can concentrate on:
    • Processing broader classes of tree sources
    • Problems in statistical inference
      • Universal classification
      • Channel decoding
      • Prediction
    • Characterize the design space for parallel compression algorithms
generic phase i
Generic Phase I
  • if (s is a leaf) {
    • Count symbol appearances ns0 and ns1
    • MDLslength(ns0, ns1)
  • } else { /* s is an internal node */
    • Recursively compute MDL length and counts for 0s and 1s
    • ns0  n0s0+n1s0, ns1  n0s1+n1s1
    • MDLslength(ns0, ns1)
    • if (MDLs >MDL0s +MDL1s )
      • Keep 0s and 1s
    • } else {
      • Prune0s and 1s, keeps
    • }
  • }
ad