1 / 22

Identifying File Similarity in Large Data Sets by Modulo File Length

Identifying File Similarity in Large Data Sets by Modulo File Length. Yongtao Zhou, Yuhui Deng, Xiaoguang Cheng, Junjie Xie Department of Computer Science, Jinan University, Guangzhou, 510632, P. R.China. Agenda. Motivation Challenges Simhash Traditional sampling algorithm(TSA)

Download Presentation

Identifying File Similarity in Large Data Sets by Modulo File Length

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying File Similarity in Large Data Sets by Modulo File Length Yongtao Zhou, Yuhui Deng, Xiaoguang Cheng, Junjie Xie Department of Computer Science, Jinan University, Guangzhou, 510632, P. R.China

  2. Agenda • Motivation • Challenges • Simhash • Traditional sampling algorithm(TSA) • Position-aware similarity algorithm(PAS) • Evaluation • Conclusion

  3. Motivation • The explosive growth of data • IDC: 4ZBbytes data in 2014, 50% growth in contrast to 2012 • Data more and more complex • 5V: Volume, Variety, Velocity, Value, and Veracity • Structured, semi-structured, and unstructured data • File similarity detection is very important to data management • Cluster similar data in data mining • Find similar file in backup • Employ similarity to enhance the cache hierarchy in clouds • …………

  4. Challenges • Reducing the computing overhead of similarity detection • Similarity detection is I/O bound and CPU bound task • Require lots of memory space and frequently access disk • The compute overhead increase with the growth of data sets • Reducing the time of similarity detection • Some applications requiring real time and high throughput • Achieving both the efficiency and accuracy

  5. Simhash • Fingerprints of similar files differ in a small number of bit positions • Determine the similarity of files by working out their Hamming distance

  6. Traditional sampling algorithm(TSA) • TSA is described in algorithm 1 by using pseudo-code • Transforming similarity detection problem into set intersection problem

  7. Traditional sampling algorithm(TSA) • TSA is simple and fixed overhead, but it is very sensitive to file modifications • We chosen n = 6, Lenc = 1KB.

  8. Position-aware similarity algorithm(PAS)

  9. Position-aware similarity algorithm(PAS) • PAS is described in algorithm 2 by using pseudo-code

  10. Position-aware similarity algorithm(PAS) • We chosen T as 28KB, n = 6, Lenc = 1KB

  11. Evaluation environment • Ubuntu operation system (kernel version is 2.6.32) • VirtualBox virtual machine software(4.3.8.r92456) • 1GB memory • 2.0 GHz Intel(R) Pentium(R) CPU

  12. Data set • Data set D1 • 11.5GB, 2756 files • Table 1 summarizes the profile of D1 • Figure 4 show the distribution of file size • Data set D2 • 14 txt files • 128MB Table 1. The profile of data set D1 Figure 4. The file size distribution of data set D1

  13. The principle of parameters selection • Comparing the detection probability of PAS against the actual of matching chunks. • Splitting file into chunks, then maps these chunks into fingerprint by using MD5 hash function and get fingerprint set Finger(A). The actual portion of matching chunk fingerprint of file A and file B is described as follows:

  14. Sampling position impact factor T • The impact of T on the detection probability (Lenc = 32byte, N = 8, T = 2KB, 8KB, 32KB, 128KB, 512KB) • We take T = 512K.

  15. PAS and Simhash parameters configuration

  16. PAS compare to Simhash: Time overhead The size of file are 2MB, 5MB and 10MB, respectively Data set D1

  17. PAS compare to Simhash: CPU and memory usage CPU and memory utilization of PAS and simhash with data set D1

  18. PAS compare to Simhash: Precision and Recall

  19. Conclusion • PAS is very effective in detecting file similarity • The time overhead, CPU and memory occupation of PAS are much less than that of simhash We believe that the PAS algorithm is applicable.

  20. Thanks!

More Related