1 / 28

Thumbnail Summarization Techniques For Web Archives

The 36th European Conference on Information Retrieval . ECIR 2014, Amsterdam, Netherlands, 2014. Thumbnail Summarization Techniques For Web Archives. Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu . M ichael L. Nelson Old Dominion University

uri
Download Presentation

Thumbnail Summarization Techniques For Web Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014 Thumbnail Summarization Techniques For Web Archives Ahmed AlSum* Stanford University Libraries Stanford CA, USA aalsum@stanford.edu MichaelL. Nelson Old Dominion University Norfolk VA, USA mln@cs.odu.edu *Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands

  2. What is a Web Archive? http://www.cs.odu.edu ECIR 2014 Amsterdam, Netherlands

  3. Thumbnails in Web Archive Internet Archive UK Web Archive ECIR 2014 Amsterdam, Netherlands

  4. Memento Terminology Original Resource URI-R, R http://www.amazon.com Memento URI-M, M http://web.archive.org/web/20110411070244/http://amazon.com TimeMap URI-T, TM ECIR 2014 Amsterdam, Netherlands

  5. Thumbnails Creation Challenges • Scalability in Time • IA may need 361 years to create thumbnail for each memento using one hundred machines. • Scalability in Space • IA will need 355 TB to store 1 thumbnail per each memento. • Page quality ECIR 2014 Amsterdam, Netherlands

  6. Thumbnails Usage Challenges • This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com ECIR 2014 Amsterdam, Netherlands

  7. From 10,500 Mementos to 69 Thumbnails. ECIR 2014 Amsterdam, Netherlands

  8. How many thumbnails do we need? www.unfi.com on the live Web ECIR 2014 Amsterdam, Netherlands

  9. How many thumbnails do we need? www.unfi.com on the live Web ECIR 2014 Amsterdam, Netherlands

  10. 40 Thumbnails are good. ECIR 2014 Amsterdam, Netherlands

  11. Methodology ECIR 2014 Amsterdam, Netherlands

  12. Visual Similarity and Text Similarity Similar Different HTML Text ECIR 2014 Amsterdam, Netherlands

  13. Correlation between Visual Similarity and Text Similarity • Text Similarity • SimHash • DOM Tree • Embedded resources • Memento Datetime (Capture time) • Visual Similarity ECIR 2014 Amsterdam, Netherlands

  14. Text SimilaritySimHash • Computes 64-bit SimHash fingerprints with k = 4 for two pages • Full HTML text ✔ • The main content from the web page • All the text • Templates including the text • The template excluding the text • Calculate the differences using Hamming Distance ECIR 2014 Amsterdam, Netherlands

  15. Text SimilarityDOM Tree • Transfer each webpage to DOM tree • Calculate the difference usingLevenshteinDistance • Levenshtein distance: is the number of operations to insert, update, and delete. ECIR 2014 Amsterdam, Netherlands

  16. Text SimilarityEmbedded resources • Extract the embedded resources for each page • Calculate the total number of new resources that have been added and the resources that have been removed. • For example, the difference between M1 and M2: • Addition of 5 resources (2 javascript files and 3 images) • Removal of 2 resources (1 javascript file and 1 image). ECIR 2014 Amsterdam, Netherlands

  17. Text SimilarityMemento datetime • Calculate the difference between the record capture time for both pages in seconds. ECIR 2014 Amsterdam, Netherlands

  18. Visual Similarity • Measurement: the number of different pixels between two thumbnails • To compare two thumbnails, • Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600. • Calculate the Manhattan distance and Zero distance between each pair ECIR 2014 Amsterdam, Netherlands

  19. Correlation between Visual Similarity and Text Similarity SimHash DOM tree Embedded resources Memento Datetime SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013] ECIR 2014 Amsterdam, Netherlands

  20. Selection algorithms ECIR 2014 Amsterdam, Netherlands

  21. Threshold Grouping ECIR 2014 Amsterdam, Netherlands

  22. Threshold Grouping ECIR 2014 Amsterdam, Netherlands

  23. Clustering technique • Input: • TimeMap with n mementos • A set of features. • For example, F = {SimHash, Memento-Datetime} • Task: • Cluster n mementos in K clusters. ECIR 2014 Amsterdam, Netherlands

  24. Clustering technique SimHash and Datetime Features SimHash Feature Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341. ECIR 2014 Amsterdam, Netherlands

  25. Time Normalization ECIR 2014 Amsterdam, Netherlands

  26. Selection Algorithms Comparison ECIR 2014 Amsterdam, Netherlands

  27. Generalization outside the Web Archive • Get k thumbnails from website that has n pages ECIR 2014 Amsterdam, Netherlands

  28. Conclusions • We explored the similarity between the text and visual appearance of the web page. • We found that SimHash and Levenshteindistance have the highest correlation • We presented three algorithms to select k thumbnails from n mementos per TimeMap. aalsum@stanford.edu @aalsum ECIR 2014 Amsterdam, Netherlands

More Related