1 / 15

An analysis of “Using sequence compression to speed up probabilistic profile matching”

An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo. Cory Tobin. Probabilistic Profiles. Ex: Hydrophobicity sliding window program

reyna
Download Presentation

An analysis of “Using sequence compression to speed up probabilistic profile matching”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin

  2. Probabilistic Profiles • Ex: Hydrophobicity sliding window program • Scores for all characters within the window are added together and assigned to the character in question • Characters in window further away from the character in question are weighted less Ex: A T T C G G C T C 52.521

  3. Overview • Time complexity of the brute force method is O(NP) • N=length of sequence P=length of profile • Looking for a more efficient way to score sequences with probabilistic profiles • Made algorithms that could work on compressed sequences • Use Run-length encoding and LZ78 compression • Decompressing sequences prior to scoring is not necessary • Test the algorithms on real sequences

  4. Run-Length Encoding • Lossless compress method • Sequential repeats are saved as a single character and an integer representing the number of repeats • Only works well when there are lots of repetitive characters • Better compression ratio with nucleotides than with amino acids Ex: A T T T G C G C A A A A A T A T T C T C T C T G T G GA A A A A A C G A (T,3) G C G C (A,5) T A (T,2) C T C T C T G T (G,2)(A,6) C G

  5. LZ78 Compression • Lossless compression method • Lempel Zif 1978 • Stores the data in a tree structure • Uses repeated patterns rather than sequentialy repeated characters • Better compression ratio than run-length • Compression algorithm is more complex than run-length ATA AT CA A CT C G A C G A T C

  6. Brute-Force Scoring Algorithm A T A A T C A A C T C G 36 steps

  7. Run-Length Scoring Algorithm A T A A T C A A C T C G 30 steps

  8. LZ78 Scoring Algorithm A T A A T C A A C T C G 21 steps

  9. Complexities Brute Force: O( N P ) Run-Length: O( ( N / lavg ) P ) LZ78: O( ( N / log N ) P )

  10. Implications of Complexities • Complexities are based on the compression ratios of the sequences • If the compression ratio is 1:1 there is no reward for using the non-brute force algorithms • Sequences of equal length but higher compression will yield algorithms with lower complexities

  11. 64 Dollar Question How do these algorithms stack up against real sequences?

  12. Methods • Randomly pick human DNA and protein sequences of varying lengths • Calculate the compression ratio using brute force, Run-length, and LZ78 methods • Run the algorithms on those sequences Characters in original sequence vs. characters in compressed sequence

  13. Results • Run-length did not provide much advantage over brute force • LZ78 provided a great advantage over both brute force and run-length • Longer sequences yield better LZ78 performance compared to brute force • Both Run-length and LZ78 have lower complexities, therefore better performance, on DNA sequences rather than protein sequences

  14. Pros and Cons • Less time is needed to perform probabilistic profile matching • Databases such as GenBank do not store their sequences in LZ78 or Run-length format • One would need to retrieve the sequence, compress it, then run the algorithm • This is probably worse than just using brute force on an uncompressed sequence

  15. End

More Related