Processing of large document collections
1 / 48

Processing of large document collections - PowerPoint PPT Presentation

  • Updated On :

Processing of large document collections. Fall 2002, Part 3. Text compression. Despite a continuous increase in storage and transmission capacities, more and more effort has been put into using compression to increase the amount of data that can be handled

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Processing of large document collections' - Melvin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Text compression l.jpg
Text compression

  • Despite a continuous increase in storage and transmission capacities, more and more effort has been put into using compression to increase the amount of data that can be handled

  • no matter how much storage space or transmission bandwidth is available, someone always finds ways to fill it with

Text compression3 l.jpg
Text compression

  • Efficient storage and representation of information is an old problem (before the computer era)

    • Morse code: uses shorter representations for common characters

    • Braille code for the blind: includes contractions, which represent common words with 2 or 3 characters

Text compression4 l.jpg
Text compression

  • On a computer: changing the representation of a file so that it takes less space to store or less time to transmit

    • original file can be reconstructed exactly from the compressed representation

  • different than data compression in general

    • text compression has to be lossless

    • compare with sound and images: small changes and noise is tolerated

Text compression methods l.jpg
Text compression methods

  • Huffman coding (in the 50’s)

    • compressing English: 5 bits/character

  • Ziv-Lempel compression (in the 70’s)

    • 4 bits/character

  • arithmetic coding

    • 2 bits/char (more processing power needed)

  • prediction by partial matching (80’s)

Text compression methods6 l.jpg
Text compression methods

  • Since 80’s compression rate has been about the same

  • improvements are made in processor and memory utilization during compression

  • also: amount of compression may increase when more memory (for compression and uncompression) is available

Text compression methods7 l.jpg
Text compression methods

  • Most text compression methods can be placed in one of two classes:

    • symbolwise methods

    • dictionary methods

Symbolwise methods l.jpg
Symbolwise methods

  • Work by estimating the probabilities of symbols (often characters)

    • coding one symbol at a time

    • using shorter codewords for the most likely symbols (in the same way as Morse code does)

Symbolwise methods9 l.jpg
Symbolwise methods

  • variations differ mainly in how they estimate probabilities for symbols

    • the more accurate these estimates are, the greater the compression that can be achieved

    • to obtain good compression, the probability estimate is usually based on the context in which a symbol occurs

Dictionary methods l.jpg
Dictionary methods

  • compress by replacing words and other fragments of text with an index to an entry in a ”dictionary”

  • compression is achieved if the index is stored in fewer bits than the string it replaces

Symbolwise methods11 l.jpg
Symbolwise methods

  • Modeling

    • estimating probabilities

    • there does not appear to be any single ”best” method

  • Coding

    • converting the probabilities into a bitstream for transmission

    • well understood, can be performed effectively

Models l.jpg

  • Compression methods obtain high compression by forming good models of the data that is to be coded

  • the function of a model is to predict symbols

    • e.g. during the encoding of a text , the ”prediction” for the next symbol might include a probability of 2% for the letter ’u’, based on its relative frequency in a sample of text

Models13 l.jpg

  • The set of all possible symbols is called the alphabet

  • the probability distribution provides an estimated probability for each symbol in the alphabet

Encoding decoding l.jpg
Encoding, decoding

  • the model provides the probability distribution to the encoder, which uses it to encode the symbol that actually occurs

  • the decoder uses an identical model together with the output of the encoder to find out what the encoded symbol was

Information content of a symbol l.jpg
Information content of a symbol

  • The number of bits in which a symbol s should be coded is called the information content I(s) of the symbol

  • the information content I(s) is directly related to the symbol’s predicted probability P(s), by the function

    • I(s) = -log P(s) bits

Information content of a symbol16 l.jpg
Information content of a symbol

  • The average amount of information per symbol over the whole alphabet is known as the entropy of the probability distribution, denoted by H:

Information content of a symbol17 l.jpg
Information content of a symbol

  • Provided that the symbols appear independently and with the assumed probabilities, H is a lower bound on compression, measured in bits per symbol, that can be achieved by any coding method

Information content of a symbol18 l.jpg
Information content of a symbol

  • If the probability of symbol ’u’ is estimated to be 2%, the corresponding information content is 5.6 bits

  • if ’u’ happens to be the next symbol that is to be coded, it should be transmitted in 5.6 bits

Information content of a symbol19 l.jpg
Information content of a symbol

  • predictions can usually be improved by taking account of the previous symbol

  • if a ’q’ has just occurred, the probability of ’u’ may jump to 95 %, based on how often ’q’ is followed by ’u’ in a sample of text

  • information content of ’u’ in this case is 0.074 bits

Information content of a symbol20 l.jpg
Information content of a symbol

  • Models that take a few immediately preceding symbols into account to make a prediction are called finite-context models of order m

    • m is the number of previous symbols used to make a prediction

Static models l.jpg
Static models

  • There are many ways to estimate the probabilities in a model

  • we could use static modelling:

    • always use the same probabilities for symbols, regardless of what text is being coded

    • compressing system may not perform well, if different text is received

      • e.g. a model for English with a file of numbers

Semi static models l.jpg
Semi-static models

  • One solution is to generate a model specifically for each file that is to be compressed

  • an initial pass is made through the file to estimate symbol probabilities, and these are transmitted to the decode before transmitting the encoded symbols

  • this is called semi-static modelling

Semi static models23 l.jpg
Semi-static models

  • Semi-static modelling has the advantage that the model is invariably better suited to the input than a static one, but the penalty paid is

    • having to transmit the model first,

    • as well as the preliminary pass over the data to accumulate symbol probabilities

Adaptive models l.jpg
Adaptive models

  • Adaptive model begins with a bland probability distribution and gradually alters it as more symbols are encountered

  • as an example, assume a zero-order model, i.e., no context is used to predict the next symbol

Adaptive models25 l.jpg
Adaptive models

  • Assume that a encoder has already encoded a long text and come to a sentence: It migh

  • now the probability that the next character is ’t’ is estimated to be 49,983/768,078 = 6.5 %, since in the previous text, 49,983 characters of the total of 768,078 characters were ’t’

Adaptive models26 l.jpg
Adaptive models

  • Using the same system, ’e’ has the probability 9.4 % and ’x’ has probability 0.11 %

  • the model provides this estimated probability distribution to an encoder

  • the decoder is able to generate the same model since it has the same probability estimates as the encoder

Adaptive models27 l.jpg
Adaptive models

  • For a higher-order model, such as a first-order model, the probability is estimated by how often that character has occurred in the current context

  • in a zero-order model earlier, a symbol ’t’ occurred in a context: It migh , but the model made no use of the characters of the phrase

Adaptive models28 l.jpg
Adaptive models

  • A first-order model would use the final ’h’ as a context with which to condition the probability estimates

  • the letter ’h’ has occurred 37,525 times in the prior text, and 1,133 of these times it was followed by a ’t’

  • the probability of ’t’ occurring after an ’h’ can be estimated to be 1,133/37,525=3.02 %

Adaptive models29 l.jpg
Adaptive models

  • For ’t’, a prediction of 3.2% is actually worse than in the zero-order model because ’t’ is rare in this context (’e’ follows ’h’ much more often)

  • second-order model would use the relative frequency that the context ’gh’ is followed by ’t’, which is the case in 64,6%

Adaptive models30 l.jpg
Adaptive models

  • Good: robust, reliable, flexible

  • Bad: not suitable for random access to compressed files

    • a text can be decoded only from the beginning: the model used for coding a particular part of the text is determined from all the preceding text

    • -> not suitable for full-text retrieval

Coding l.jpg

  • Coding is the task of determining the output representation of a symbol, based on a probability distribution supplied by a model

  • general idea: the coder should output short codewords for likely symbols and long codewords for rare ones

  • symbolwise methods depend heavily on a good coder to achieve compression

Huffman coding l.jpg
Huffman coding

  • A phrase is coded by replacing each of its symbols with the codeword given by a table

  • Huffman coding generates codewords for a set of symbols, given some probability distribution for the symbols

  • the type of code is called prefix-free code

    • no codeword is the prefix of another symbol’s codeword

Huffman coding33 l.jpg
Huffman coding

  • The codewords can be stored in a tree (a decoding tree)

  • Huffman’s algorithm works by constructing the decoding tree from the bottom up

Huffman coding34 l.jpg
Huffman coding

  • Algorithm

    • create for each symbol a leaf node containing the symbol and its probability

    • two nodes with the smallest probabilities become siblings under a new parent node, which is given a probability equal to the sum of its two children’s probabilities

    • the combining operation is repeated until there is only one node without a parent

    • the two branches from every nonleaf node are then labeled 0 and 1

Huffman coding35 l.jpg
Huffman coding

  • Huffman coding is generally fast for both encoding and decoding, provided that the probability distribution is static

    • adaptive Huffman coding is possible, but needs either a lot of memory or is slow

  • coupled with a word-based model (rather than character-based model), gives a good compression

Dictionary models l.jpg
Dictionary models

  • Dictionary-based compression methods use the principle of replacing substrings in a text with a codeword that identifies that substring in a dictionary

  • dictionary contains a list of substrings and a codeword for each substring

  • often fixed codewords used

    • reasonable compression is obtained even if coding is simple

Dictionary models37 l.jpg
Dictionary models

  • The simplest dictionary compression methods use small dictionaries

  • for instance, digram coding

    • selected pairs of letters are replaced with codewords

    • a dictionary for the ASCII character set might contain the 128 ASCII characters, as well as 128 common letter pairs

Dictionary models38 l.jpg
Dictionary models

  • Digram coding…

    • the output codewords are eight bits each

    • the presence of the full ASCII character set ensures that any (ASCII) input can be represented

    • at best, every pair of characters is replaced with a codeword, reducing the input from 7 bits/character to 4 bits/characters

    • at worst, each 7 bit character will be expanded to 8 bits

Dictionary models39 l.jpg
Dictionary models

  • Natural extension:

    • put even larger entries in the dictionary, e.g. common words like ’and’, ’the’,… or common components of words like ’pre’, ’tion’…

  • a predefined set of dictionary phrases make the compression domain-dependent

    • or very short phrases have to be used -> good compression is not achieved

Dictionary models40 l.jpg
Dictionary models

  • One way to avoid the problem of the dictionary being unsuitable for the text at hand is to use a semi-static dictionary scheme

    • constuct a new dictionary for every text that is to be compressed

    • overhead of transmitting or storing the dictionary is significant

    • decision of which phrases should be included is a difficult problem

Dictionary models41 l.jpg
Dictionary models

  • Solution: use an adaptive dictionary scheme

  • Ziv-Lempel coders (LZ77 and LZ78)

  • a substring of text is replaced with a pointer to where it has occurred previously

  • dictionary: all the text prior to the current position

  • codewords: pointers

Dictionary models42 l.jpg
Dictionary models

  • Ziv-Lempel…

    • the prior text makes a very good dictionary since it is usually in the same style and language as upcoming text

    • the dictionary is transmitted implicitly at no extra cost, because the decoder has access to all previously encoded text

Slide43 l.jpg

  • Key benefits:

    • relatively easy to implement

    • decoding can be performed extremely quickly using only a small amount of memory

  • suitable when the resources required for decoding must be minimized, like when data is distributed or broadcast from a central source to a number of small computers

Slide44 l.jpg

  • The output of an encoder consists of a sequence of triples, e.g. <3,2,b>

    • the first component of a triple indicates how far back to look in the previous (decoded) text to find the next phrase

    • the second component records how long the phrase is

    • the third component gives the next character from the input

Slide45 l.jpg

  • The components 1 and 2 constitute a pointer back into the text

  • the component 3 is actually necessary only when the character to be coded does not occur anywhere in the previous input

Slide46 l.jpg

  • Encoding

    • for the text from the current point ahead:

      • search for the longest match in the previous text

      • output a triple that records the position and length of the match

      • the search for a match may return a length of zero, in which case the position of the match is not relevant

    • search can be accelerated by indexing the prior text with a suitable data structure

Slide47 l.jpg

  • limitations on how far back a pointer can refer and the maximum size of the string referred to

  • e.g. for English text, a window of a few thousand characters

  • the length of the phrase e.g. maximum of 16 characters

  • otherwise too much space wasted without benefit

Slide48 l.jpg

  • The decoding program is very simple, so it can be included with the data at very little cost

  • in fact, the compressed data is stored as part of the decoder program, which makes the data self-expanding

  • common way to distribute files