Algorithms and Data Structures

Algorithms and Data Structures Lecture 11

Agenda: • Text Processing: string matching • Text Processing: packing

Text Processing: • Text processing is kind of data processing (as text is kind of data) • A number of various data processing algorithms can be demonstrated on basis of text processing algorithms • String matching -> sequence matching • Text packing -> data packing • ... and etc.

Text: String matching • Most text editors like MS Word, Notepad and others perform such operations like substring search, replace and etc. • While working with relatively short texts, efficiency of algorithms does not matter • Efficiency of algorithms becomes important if we have to deal with huge data sources (text or others) • Let’s consider several string (or, more generally, subsequence) matching algorithms

Text: String matching – basic notions • S={ a,…, z, A,…, Z, 0,…, 9 } – set of symbols (letters) • T[0…n-1] – source text, array of n symbols from S • P[0…m-1] – pattern, array of m symbols from S, m<=n • Pattern P occurs with shift s in text T if 0<=s<=n-m and T[s…s+m-1]=P[0…m-1] • If P occurs with shift s in T, s is valid shift; otherwise s is invalid shift • Task of string matching: find all possible valid shifts s

Text: String matching – basic notions

Text: String matching – simple algorithm • The idea: • Iterate through all possible values of shifts s (from 0 to n-m) • Verify whether P occurs with shift s in T

Text: String matching – simple algorithm

Text: String matching – simple algorithm • In worst case string comparison is O(m), where m is a length of compared strings • Loop through all possible values of s is executed n-m+1 times • In worst case algorithm is Θ((n-m+1)m)

Text: String matching – simple algorithm

Text: Robin-Karp algorithm • Can be applied to various sequences like strings, numbers • Each element of a sequence is considered as a numeral • Let’s consider sequences of letters (or strings) • Each character can be encoded by its ASCII value (from 0 to 255), 256 – numeral system • “abc” -> “97 98 99” • Each string can be encoded by a number calculated from ASCII codes of its letters • “abc” -> 97*2562+98*256+99=6382179

Text: Robin-Karp algorithm • If two strings match their numbers are equal • Pattern number p: number representing pattern string P[0…m-1] • Slice number tk: number representing source text slice T[sk…sk+m-1], where sk is a possible shift, 0<=k<=n-m • Total number of slices is n-m+1 • The ideas behind the Robin-Karp algorithm are: • - Calculate corresponding numbers of pattern and slices • - Compare “pattern number” against all the “slice numbers”, if they match then valid shift is found

Text: Robin-Karp algorithm

Text: Robin-Karp algorithm • m-length pattern “pattern number” p calculation consumes O(m): p=P[0]*256m-1 + P[1]*256m-2 +…+ P[m-1] • “slice number” tk calculation consumes O(m) too: tk = T[sk]*256m-1 + T[sk+1]*256m-2 +…+ T[sk+m-1] • Totally: O(m)+ O(m)*(n-m+1) – not good result

Text: Robin-Karp algorithm • Fortunately tk+1 can be calculated on the basis of tk , which consumes constant time O(1) • tk+1 = (tk – T[sk]*256m-1)*256 + T[sk+1+ m -1] • h= 256m-1 constant that is used for each tk calculation • Now we have to calculate p, t0 - both take O(m); and all tk-s, where k=1..n-m, each takes O(1) • Totally: 2O(m)+(n-m)O(1)=O(n+m)

Text: Robin-Karp algorithm • “pattern” and “slice” numbers may be quite large and may exceed machine “word” • In order to avoid the problem they are calculated by some “modulo” q • Equality mod(a,q)=mod(b,q) does not mean that a=b • There is a chance for “conflict” • In order to resolve “conflict” when values of p and tk match, we have to perform comparison of pattern and slice • If comparison of slice and pattern is successful (pattern and slice match) we say we’ve found a valid shift, otherwise there is a “spurious hit” in algorithm execution

Text: Robin-Karp algorithm • q is usually a greatest simple number so q*256 (for our “string” version of algorithm) fits to a machine WORD (for Windows we can use “int”) • Generally q can be non-simple value, but that may increase number of possible “conflicts” • In worst case Robin-Karp algorithm consumes Θ((n-m+1)m), but in average it shows quite good performance

Text: packing • There are a number of packers available at the market (e.g. zip, WinZip, Rar, arj and others) • All of them have the same purpose: reduce size of files • There are distinct ideas behind the algorithms: • Such well known packers like WinZip target any files; they apply the same packing algorithm for files of any types and demonstrate quite good results, but not always (e.g. mp3 files ) • Some packers are designed for specific data, e.g. video, audio and image files; the results of packing are well known types of files like video - MPEG, audio - MP3 and image – JPEG • Let’s consider some ideas that can be used for text packing

Text: packing • Every character is usually encoded by some number of bits • E.g. ASCII coding table; each character is represented by 8 bits; thus text of n letters consumes n*8 bits of memory • Most texts include only a part of 256 ASCII symbols; often number of symbols is contained and it means that used characters can be encoded by less number of bits than 8 • E.g. given a 320 symbol text and it is a combination of only 13 distinct characters; it is clear that each of letter can be encoded by 4 bits • Designing a special coding table (specific for given text) we can reduce number of consumed memory (ASCII: 8*320=2560 bits, Our: 4*320=1280 bits)

Text: packing • In order to decode packed text we will need “our” coding table, so it must be stored along with the packed data • Another idea behind the text packing is to code distinct symbols by codes of different lengths (ASCII table uses fixed code length – 8 bits) • Some symbols are very popular in texts, others are rare • The idea is to code popular symbols by less number of bits than rare symbols

Text: packing • As it seen, both the ideas assume that source text is analyzed somehow before packing • Let’s combine both the ideas into the packing algorithm • First of all we have to walk through the text and construct coding table that includes actually used symbols • Secondly we have to count frequencies of symbol occurrences in text, in order to code very popular symbols with less bits • When desired statistics is ready we can pack text; coding table will be needed for unpacking and therefore must be stored somewhere

Text: packing-designing our code table • Gathering statistics : for each ASCII symbol we have to find out whether it occurs in text or not; if presents we have to know how many times symbol occurs in text • Statistics analysis: for each occurred symbol we have to assign new code; it depends on how many times symbol occurs in text – frequent symbols are coded by less bits than the rare ones • Finally we will have a set of symbols (with their frequencies) that occur in a text

Text: packing-designing our code table

Text: packing-designing our code table • Huffman tree is a very convenient for such tasks • Huffman tree is a binary tree where leaves of a tree contain ASCII characters that occur in text • Each node of a tree has frequency attribute – indicates how many times symbol occurs in a text • Edge to a left child is denoted as 0, to a right child by 1 • So a path from the root to a leaf node uniquely identifies symbol; path is sequence s of 0-s and 1-s; e.g. s=010001 • Sequence s is a binary representation of a code in “our” code table

Text: packing-designing our code table

Text: packing-designing our code table • Building Huffman tree • Input: set of ASCII symbols that occur in text; let’s denote is as S • Output: Huffman tree with assigned codes • 1. Remove two symbols with smallest “frequency” from the S • 2. Create a node n and attach just removed symbols as children (left and right); “frequency” of n is a sum of its children frequencies • 3. Add n to the S • 4. If there are more than 1 element in the S, continue from the step 1

Text: packing algorithm • All the required information is gathered, Huffman tree is built, now we can proceed with packing • 1. Read symbol of a text • 2. Find it in Huffman tree • 3. Going upwards through the tree calculate binary code • 4. Store code to the output • 5. If end of text is not reached continue from step 1

Text: unpacking algorithm • 1. Start from the root of the Huffman tree • 2. Reading a bit of packed data move downwards; direction depends on the value of a bit, if 0 –leftwards, otherwise rightwards • 3. If leaf is reached – read its ASCII value and place to the output buffer • 4. If end of data is not reached – continue from the step 1

Tex: packing-sample

Q & A

Algorithms and Data Structures