1 / 24

Copy or Not

Copy or Not. Dawei (David) Shi. Copy Or Not. Introduction Algorithm Framework Future work Demo. Copy Or Not. Introduction Algorithm Framework Future work Demo. Introduction. A web-based document comparator Calculate accurate similarity between 2 documents. Copy Or Not.

clove
Download Presentation

Copy or Not

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Copy or Not Dawei (David) Shi

  2. Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo

  3. Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo

  4. Introduction • A web-based document comparator • Calculate accurate similarity between 2 documents

  5. Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo

  6. Algorithm • Preprocessing • Vector space • Similarity calculation

  7. Preprocessing

  8. Preprocessing • Stemming • Porter Stemming Algorithm • E.g. • cat – cats • meet – meeting • agree – agreed • correct - correctness

  9. Vector Space • Build dictionary 1 • word -> frequency • Sort the keys of dictionary 1 • Build dictionary 2 • key -> (index, count) • Build binary vectors • index -> occurrence

  10. Similarity Calculation • Vectors v1 and v2 • Similarity = v1 * v2 / (norm(v1) * norm(v2))

  11. Performance • Algorithms coded in Python • Dynamic typing • Not good at numerical operations • Solution: numpy

  12. Numpy • A Python extension module • Written mostly in C • Define numerical array and matrix types and basic operations on them

  13. Numpyvs Python • Python code • a = range(10000000) • b = range(10000000) • c = [] • for i in range(len(a)): • c.append(a[i] + b[i]) • Takes up to 10 seconds on a several GHz processor

  14. Numpyvs Python • Numpy code • import numpy as np • a = np.arrange(10000000) • a = np.arrange(10000000) • c = a + b • Almost Instant

  15. Numpy Usage • Vector dot product • Vector normalization • Vector zero filling

  16. Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo

  17. Framework • Django • The web framework for perfectionists with deadlines

  18. Libraries • Python • Numpy • Porter Stemming • jQuery

  19. Hosting • Alwaysdata • Django 1.3 • Python 2.6

  20. Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo

  21. Future Work • Support file uploading and comparison • Add HTML5 features

  22. Copy Or Not • Introduction • Algorithm • Framework • Future work • Demo

  23. Demo • http://imds.alwaysdata.net

  24. Thank you!

More Related