1 / 10

The Statistical Nature of English with Implications for Data Compression

The Statistical Nature of English with Implications for Data Compression. Joshua Blackburn Communications Theory Honors April 28, 2006. Statistical Structure in Language. Letters E: 0.1024 W: 0.0142 Digrams TH: 0.0254 AO: 0.0001

ban
Download Presentation

The Statistical Nature of English with Implications for Data Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Statistical Nature of English with Implications for Data Compression Joshua Blackburn Communications Theory Honors April 28, 2006

  2. Statistical Structure in Language • Letters • E: 0.1024 • W: 0.0142 • Digrams • TH: 0.0254 • AO: 0.0001 • Not directly related to individual letter probabilities • T: 0.0835 • H: 0.0442 • A: 0.0640 • O: 0.0621 • Trigrams • THE: 0.0167 • QMZ: 0

  3. Types of Languages • Natural Spoken Language • English: The cat chased the dog up the hill. • German: Die Katze jagte den Hund herauf den Hügel. • Spanish: El gato persiguió el perro encima de la colina. • Polish: Ten kot goniący ten pies w górze ten pagórek. • Czech: Člen určitý kočka cizelovat člen určitý být v patách autobus člen určitý vrch. • Pulse Code Modulated Samples of Continuous Process • Non-Return to Zero • Return to Zero • Mathematical Cases • Sequence of symbols with defined probabilities.

  4. Mathematical Cases • Zeroth Order • All symbols equiprobable. • BDCBCECCCADCBDDAAECEEAABBDAEECACEEBAEECBC • First Order • Accounts for letter probabilities. • P{A, B, C, D, E}={0.4, 0.1, 0.2, 0.2, 0.1} • AAACDCBDCEAADADACEDAEADCABEDADDCECAAAAAD • Second Order • Accounts for transition probabilities. • ABBABABABABABABBBABBBB BABABABABABBBACACABB

  5. Mathematical Cases • Third Order • Uses transition probabilities from the previous two symbols. • Word Analysis • Basic unit of analysis can be words instead of symbols. • Any order of analysis can be used. • Typical First Order Analysis: DAB EE A BEBE DEED DEB ADEE ADEE EE DEB BEBE BEBE BEBE ADEE BED DEED DEED CEED ADEE A DEED DEED BEBE CABED BEBE BED

  6. Entropy • Definition: average number of symbols per information • Calculation • Translation: • Letters equiprobable • Mapping 1: 2 bits/symbol • Mapping 2: 2.25 bits/symbol • P{A, B, C, D}={0.4, 0.3, 0.2, 0.1} • Mapping 1: 2 bits/symbol • Mapping 2: 1.9 bits/symbol • Intrinsic: • Letters equiprobable: 2 bits/symbol • P{A, B, C, D}={0.4, 0.3, 0.2, 0.1}: 1.846 bits/symbol

  7. Matlab Implementation • Analyzed the English Language • cleanstring() • Read standard text file. • Converted uppercase to lowercase. • Removed punctuation. • 60 lines of code. • createpmf() • Mapped 26 letters and space to integers 1 to 27 • Used the mappings of the current letter and previous two as indices of a 27x27x27 frequency table. • Incremented the proper location as each letter was read. • 94 lines of code. • createmarginals() • Creates a CDF of each letter conditioned upon the two previous letters. • 66 lines of code.

  8. Matlab Implementation • Created English Approximation • createEnglish() • Randomly generates a stream of letters according to the probability model for the desired order. • Created the lower order marginal CDFs. • Zeroth Order: randint() • Higher Orders • Used the proper CDF to map the uniform rand() function to the proper nonuniform probability model. • 152 lines of code.

  9. Matlab Implementation • Results • Zeroth Order • qvytekzylybjvadffqhfaumzmlwofswaskwntliffsioeskxxq • Equal presence of the rare letters (q, v, w, x, z, etc.) • First Order • o tew dtgsm eshnmtet ik thy g laftnae iearuac uot • Increased spaces and vowels • Second Order • tivecurm al aris ch at gero inanhah b s tallest hat • Divided into syllables • Third Order • hour goes of ind procaughtiven torst wit mink ing • Pronounceable text

  10. Matlab Implementation • Calculated the Entropy • marginalpmf() • Normalized the frequency tables to sum to one. • 56 lines of code. • entropy() • Calculated the entropy for each order of approximation. • First: 4.0936 bits/symbol • Second: 7.4486 bits/symbol • Third: 10.113 bits/symbol • 23 lines of code. • Results • First Order: 4.0936 bits/letter • Second Order: 3.7243 bits/letter • Third Order: 3.371 bits/letter

More Related