1 / 44

Algorithms and Data Structures

Algorithms and Data Structures. Lecture 11. Agenda:. Text Processing: string matching Text Processing: packing. Text Processing:. Text processing is kind of data processing (as text is kind of data)

Download Presentation

Algorithms and Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms and Data Structures Lecture 11

  2. Agenda: • Text Processing: string matching • Text Processing: packing

  3. Text Processing: • Text processing is kind of data processing (as text is kind of data) • A number of various data processing algorithms can be demonstrated on basis of text processing algorithms • String matching -> sequence matching • Text packing -> data packing • ... and etc.

  4. Text: String matching • Most text editors like MS Word, Notepad and others perform such operations like substring search, replace and etc. • While working with relatively short texts, efficiency of algorithms does not matter • Efficiency of algorithms becomes important if we have to deal with huge data sources (text or others) • Let’s consider several string (or, more generally, subsequence) matching algorithms

  5. Text: String matching – basic notions • S={ a,…, z, A,…, Z, 0,…, 9 } – set of symbols (letters) • T[0…n-1] – source text, array of n symbols from S • P[0…m-1] – pattern, array of m symbols from S, m<=n • Pattern P occurs with shift s in text T if 0<=s<=n-m and T[s…s+m-1]=P[0…m-1] • If P occurs with shift s in T, s is valid shift; otherwise s is invalid shift • Task of string matching: find all possible valid shifts s

  6. Text: String matching – basic notions

  7. Text: String matching – simple algorithm • The idea: • Iterate through all possible values of shifts s (from 0 to n-m) • Verify whether P occurs with shift s in T

  8. Text: String matching – simple algorithm

  9. Text: String matching – simple algorithm • In worst case string comparison is O(m), where m is a length of compared strings • Loop through all possible values of s is executed n-m+1 times • In worst case algorithm is Θ((n-m+1)m)

  10. Text: String matching – simple algorithm

  11. Text: String matching – simple algorithm

  12. Text: Robin-Karp algorithm • Can be applied to various sequences like strings, numbers • Each element of a sequence is considered as a numeral • Let’s consider sequences of letters (or strings) • Each character can be encoded by its ASCII value (from 0 to 255), 256 – numeral system • “abc” -> “97 98 99” • Each string can be encoded by a number calculated from ASCII codes of its letters • “abc” -> 97*2562+98*256+99=6382179

  13. Text: Robin-Karp algorithm • If two strings match their numbers are equal • Pattern number p: number representing pattern string P[0…m-1] • Slice number tk: number representing source text slice T[sk…sk+m-1], where sk is a possible shift, 0<=k<=n-m • Total number of slices is n-m+1 • The ideas behind the Robin-Karp algorithm are: • - Calculate corresponding numbers of pattern and slices • - Compare “pattern number” against all the “slice numbers”, if they match then valid shift is found

  14. Text: Robin-Karp algorithm

  15. Text: Robin-Karp algorithm • m-length pattern “pattern number” p calculation consumes O(m): p=P[0]*256m-1 + P[1]*256m-2 +…+ P[m-1] • “slice number” tk calculation consumes O(m) too: tk = T[sk]*256m-1 + T[sk+1]*256m-2 +…+ T[sk+m-1] • Totally: O(m)+ O(m)*(n-m+1) – not good result

  16. Text: Robin-Karp algorithm • Fortunately tk+1 can be calculated on the basis of tk , which consumes constant time O(1) • tk+1 = (tk – T[sk]*256m-1)*256 + T[sk+1+ m -1] • h= 256m-1 constant that is used for each tk calculation • Now we have to calculate p, t0 - both take O(m); and all tk-s, where k=1..n-m, each takes O(1) • Totally: 2O(m)+(n-m)O(1)=O(n+m)

  17. Text: Robin-Karp algorithm

  18. Text: Robin-Karp algorithm • “pattern” and “slice” numbers may be quite large and may exceed machine “word” • In order to avoid the problem they are calculated by some “modulo” q • Equality mod(a,q)=mod(b,q) does not mean that a=b • There is a chance for “conflict” • In order to resolve “conflict” when values of p and tk match, we have to perform comparison of pattern and slice • If comparison of slice and pattern is successful (pattern and slice match) we say we’ve found a valid shift, otherwise there is a “spurious hit” in algorithm execution

  19. Text: Robin-Karp algorithm • q is usually a greatest simple number so q*256 (for our “string” version of algorithm) fits to a machine WORD (for Windows we can use “int”) • Generally q can be non-simple value, but that may increase number of possible “conflicts” • In worst case Robin-Karp algorithm consumes Θ((n-m+1)m), but in average it shows quite good performance

  20. Text: Robin-Karp algorithm

  21. Text: Robin-Karp algorithm

  22. Text: Robin-Karp algorithm

  23. Text: packing • There are a number of packers available at the market (e.g. zip, WinZip, Rar, arj and others) • All of them have the same purpose: reduce size of files • There are distinct ideas behind the algorithms: • Such well known packers like WinZip target any files; they apply the same packing algorithm for files of any types and demonstrate quite good results, but not always (e.g. mp3 files ) • Some packers are designed for specific data, e.g. video, audio and image files; the results of packing are well known types of files like video - MPEG, audio - MP3 and image – JPEG • Let’s consider some ideas that can be used for text packing

  24. Text: packing • Every character is usually encoded by some number of bits • E.g. ASCII coding table; each character is represented by 8 bits; thus text of n letters consumes n*8 bits of memory • Most texts include only a part of 256 ASCII symbols; often number of symbols is contained and it means that used characters can be encoded by less number of bits than 8 • E.g. given a 320 symbol text and it is a combination of only 13 distinct characters; it is clear that each of letter can be encoded by 4 bits • Designing a special coding table (specific for given text) we can reduce number of consumed memory (ASCII: 8*320=2560 bits, Our: 4*320=1280 bits)

  25. Text: packing • In order to decode packed text we will need “our” coding table, so it must be stored along with the packed data • Another idea behind the text packing is to code distinct symbols by codes of different lengths (ASCII table uses fixed code length – 8 bits) • Some symbols are very popular in texts, others are rare • The idea is to code popular symbols by less number of bits than rare symbols

  26. Text: packing • As it seen, both the ideas assume that source text is analyzed somehow before packing • Let’s combine both the ideas into the packing algorithm • First of all we have to walk through the text and construct coding table that includes actually used symbols • Secondly we have to count frequencies of symbol occurrences in text, in order to code very popular symbols with less bits • When desired statistics is ready we can pack text; coding table will be needed for unpacking and therefore must be stored somewhere

  27. Text: packing-designing our code table • Gathering statistics : for each ASCII symbol we have to find out whether it occurs in text or not; if presents we have to know how many times symbol occurs in text • Statistics analysis: for each occurred symbol we have to assign new code; it depends on how many times symbol occurs in text – frequent symbols are coded by less bits than the rare ones • Finally we will have a set of symbols (with their frequencies) that occur in a text

  28. Text: packing-designing our code table

  29. Text: packing-designing our code table • Huffman tree is a very convenient for such tasks • Huffman tree is a binary tree where leaves of a tree contain ASCII characters that occur in text • Each node of a tree has frequency attribute – indicates how many times symbol occurs in a text • Edge to a left child is denoted as 0, to a right child by 1 • So a path from the root to a leaf node uniquely identifies symbol; path is sequence s of 0-s and 1-s; e.g. s=010001 • Sequence s is a binary representation of a code in “our” code table

  30. Text: packing-designing our code table

  31. Text: packing-designing our code table • Building Huffman tree • Input: set of ASCII symbols that occur in text; let’s denote is as S • Output: Huffman tree with assigned codes • 1. Remove two symbols with smallest “frequency” from the S • 2. Create a node n and attach just removed symbols as children (left and right); “frequency” of n is a sum of its children frequencies • 3. Add n to the S • 4. If there are more than 1 element in the S, continue from the step 1

  32. Text: packing algorithm • All the required information is gathered, Huffman tree is built, now we can proceed with packing • 1. Read symbol of a text • 2. Find it in Huffman tree • 3. Going upwards through the tree calculate binary code • 4. Store code to the output • 5. If end of text is not reached continue from step 1

  33. Text: unpacking algorithm • 1. Start from the root of the Huffman tree • 2. Reading a bit of packed data move downwards; direction depends on the value of a bit, if 0 –leftwards, otherwise rightwards • 3. If leaf is reached – read its ASCII value and place to the output buffer • 4. If end of data is not reached – continue from the step 1

  34. Tex: packing-sample

  35. Tex: packing-sample

  36. Tex: packing-sample

  37. Tex: packing-sample

  38. Tex: packing-sample

  39. Tex: packing-sample

  40. Tex: packing-sample

  41. Tex: packing-sample

  42. Tex: packing-sample

  43. Tex: packing-sample

  44. Q & A

More Related