1 / 14

Inverted File Compression

Inverted File Compression. In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철. Inverted File Compression. Inverted file entry < t ; f t ; [ d 1 , d 2 , …, d f t ]> t : term, f t : # of documents d k : document no. where d k < d k+1

elaina
Download Presentation

Inverted File Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inverted File Compression In Managing Gigabytes 과목 : 정보검색론 강의 : 부산대학교 권혁철

  2. Inverted File Compression • Inverted file entry • <t; ft; [d1, d2, …, dft]> • t : term, ft : # of documents • dk : document no. where dk < dk+1 • < elephant; 8; [3, 5, 20, 21, 23, 76, 77, 78] > => < elephant; 8; [3, 2, 15, 1, 2, 53, 1, 1] > • gap = dk+1 -dk • Two compression classes • Global Methods V.S Local Methods

  3. Summary of coding methods

  4. Unarycode • Simple method • fixed representation of the positive integer • log N (bits) • Unary code • gap이 x일 때, x-1 bit의 1과 1bit의 0으로 표현 • lx = (x - 1) + 1, Pr[x] = 2-x • eg) x = 9 일 때, => 111111110

  5.  code •  code • 1 + log x bit의 unary code와 log x bit의 binary code(x - 2log x)로 표현 • lx = 1 + log x + log x, Pr[x] = 1/2x2 • eg) x = 9 일 때, log x = 3, x - 2log x=1 => 1110001 • V = <1, 2, 4, 8, 16,…> or V = <1, 2, 2, 4, 4, 4, 8,…> or ….

  6.  code •  code •  code와 표현 방법이 유사. • 1 + log x bit의 unary code대신에  code를 사용하고, log x bit의 binary code(x - 2log x)로 표현 • lx = 1 + 2log(1 + log x) + log x, Pr[x] = 1/2x(log x)2 • eg) x = 9 일 때, => 11000001

  7. Global Bernoulli model • Pr[x] = (1-p)x-1p, p : gap x가 나타날 확률 • Golomb code • q + 1 bit의 unary code와 + log b or log b bit의 binary code • q = (x - 1) / b, r = x - q b - 1 • bA =log(2 - p) / - log(1 - p) 0.69(N n / f) • eg) b=3, r=0(0), 1(10), 2(11) b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111) x=9이면, q = 2, r = 2 따라서, 11011

  8. Global “observed frequency” model • Based on observed frequency of appear gap size • Use arithmetic or Huffman code • In theory • better compression method • In practice • slightly better than  and  code

  9. Local Bernoulli model • The frequency of term t, ft , is known • Bernoulli model on each individual inverted file entry can be used • Very common words are encoded with b=1. • Tantamount bitvector • thus, inverted file can never worse than bitvector. • Necessary to store the parameter ft • b can be used during decoding

  10. (a) (b) (c) Word position in Bible : (a)bridegroom; (b)Jezebel; (c) twelfth Skewed Bernoulli model • Bernoulli model의 vector VG = <b, b, b, …> • VT = <b, 2b, 4b, 2ib, …> • slightly worse than the Golomb code

  11. Local hyperbolic model • Pr[x] =  / x, x = 1, 2, …, m •  = 1 / (loge(m+1)+0.5772) • m is largest gap • Better performance • more complex to implement • requires the use of arithmetic coding

  12. Local “observed frequency” model • The ultimate in local modeling • batched frequency • request more memory space • best compression method

  13. Performance of Index Compression Methods Method Bits per pointer Bible GNUbib Comact TREC Global methods Unary 264 920 490 1719 Binary 15.00 16.00 18.00 20.00 Bernoulli 9.67 11.65 10.58 12.61  6.55 5.69 4.48 6.43  6.26 5.08 4.36 6.19 Observed frequency 5.92 4.83 4.21 5.83 Local methods Bernoulli 6.13 6.17 5.40 5.73 Hyperbolic 5.77 5.17 4.65 5.74 Skewed Bernoulli 5.68 4.71 4.24 5.28 Batched frequency 5.61 4.65 4.03 5.27

  14. Compression of bitmaps • Bitmaps : Hierarchical bitvetor compression기법으로 압축 (a) original bitvector (b) hierarchical structure (c) flattened tree as a string of bits

More Related