1 / 25

Using Variable to Variable Compression Codes to Characterize Gene Sequence

Using Variable to Variable Compression Codes to Characterize Gene Sequence. We haven’t done this yet. Any suggestions or ideas are welcome – that’s why I’m up here today.

marly
Download Presentation

Using Variable to Variable Compression Codes to Characterize Gene Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Variable to Variable Compression Codes to Characterize Gene Sequence

  2. We haven’t done this yet. • Any suggestions or ideas are welcome – that’s why I’m up here today. • This is an idea that we think might work, not a project that we’ve already done; so I’ll say up front that we don’t have any evidence yet that this is going to work.

  3. The Big Picture • Variable to Variable length codes are a type of compression scheme that makes use of Huffman encoding. • I’ll explain exactly how they work in a minute. • They compress different types of information with varying degrees of effectiveness.

  4. How do we use this in informatics? • These codes offer us a great deal of flexibility in how we encode the information • It is possible to come up with a large number of coding schemes automatically using a genetic algorithm or the like. • From this large number of codes, it is possible to select a few that are very good at compressing certain types of gene sequence (ie exons, introns, promoter regions)

  5. Why Compress Sequence? • We don’t care about compressing gene sequence! • We do care about how well the compression algorithms work on a given type of sequence. • Our hypothesis is that a code that is very effective at compressing some types of sequence will be less effective at compressing other types.

  6. Not Just Information Content • Effectiveness of a compression scheme does suggest information content of sequence • That is one property of the sequence, but not the only thing we are looking at • Nature of these codes seems to tell us something about what subsequences are present in the sequence in addition to information content (I’ll come back to this point.)

  7. Some Background • Entropy • Fixed length Huffman codes • Variable to Variable codes

  8. Entropy • Conceptually, this is a measure of how much information is in the message • Gives an ideal maximum for how well a lossless compression scheme can work • There is an equation which defines this quantitatively. Exact definition not relevant for this discussion. • Redundancy is actual compression achieved minus entropy. It is a measure of how a well compression scheme works.

  9. Huffman Codes • Encode a message made up of discrete symbols into a coded message made up of variable length code words • Compression is achieved by encoding the most common symbols with shorter code words

  10. Example • Message made up of a string of A’s and B’s • BAAABAAAAABAABABAABABAABBB Four Symbols here: AA, AB, BA, BB Probabilities: AA - 0.308 BA - 0.385 AB - 0.231 BB - 0.078

  11. Our Code is Now: • AA - 10 BA - 0 • AB - 110 BB – 111 • BAAABAAAAABAABABAABABAABBB • 0100101001101101000110111

  12. Variable to Variable Code • Huffman process described above is used • Difference is that instead of coding for equal length symbols, we code for variable length symbols

  13. Example • Using example above, instead of coding for AA, AB, BA and BB we could use: AAA B AB AAB

  14. Code L P P*L AAA 3 .077 .231 B 1 .154 .154 AB 2 .077 .154 AAB 3 .154 .462

  15. Code Words: Symbol Code word AAA 10 B 110 AB 111 AAB 0

  16. BAAABAAAAABAABABAABABAABBB • 10010100100011101110100100

  17. Variable to variable code gives many possible sets of symbols with which to code, even in the binary case (ie, only have A and B in our alphabet) AAA AAAA AAAAA AAAAAA B B B B AB AB AB AB AAB AAB AAB AAB AAAB AAAB AAAB AAAAB AAAAB AAAAAB

  18. If we are encoding a simple binary source (only A and B possible), the effectiveness of each of these compression schemes depends on the distribution of symbols in the source. (ie if the probability of any given symbol being an ‘A’ is .15, the compression performance will be different than if the probability of an ‘A’ was .35)

  19. Also, each symbol set described above has a different probability distribution for which it works best

  20. Back to Genomics • The genetic alphabet is not binary, it is in base 4. • We have even more flexibility for creating sets of code words in base 4 than in base 2. • Stands to reason that given symbol sets might exhibit a range of effectiveness on different types of sequence (this is what we intend to test).

  21. Genetic algorithm would be used to test large numbers of these symbol sets • Redundancy would be our fitness function • Hope is that code schemes that are highly optimized to compress one type of sequence (exons for example) would be less effective compressing other sequence (non-coding sequence for example).

  22. This would be useful because the genome could then be “scanned” with the compression algorithm to find regions that it is more or less effective on. • This data could be used as a component in predicting certain types of sequence (coding vs. non-coding for example)

More Related