1 / 44

Authors

Finding Replicated Web Collections ( Junghoo Cho, Narayanan Shivakumar , Hector Garcia-Molina) A Comparison of Techniques to Find Mirrored Hosts on WWW (Krishna Bharat, Andrei Broder , Jeffrey Dean, Monika Henzinger ). Authors . Authors . Identifying replicated content

eshe
Download Presentation

Authors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Replicated Web Collections(Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina)A Comparison of Techniques to Find Mirrored Hosts on WWW(Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika Henzinger)

  2. Authors

  3. Authors

  4. Identifying replicated content • Cho et al, a bottom up approach • Using content based analysis • Computing similarity measures • Improved Crawling • Reducing clutter from search engine results • Bharat et al, a top down approach • Using page attributes • URL, IP Address, Connectivity What are they talking about?

  5. Needs only the URLs of pages, not the pages themselves • Mirrors can be discovered even when very few of their duplicate pages are simultaneously present in the collection Pros and cons – Top down

  6. Might discover mirrors • even under renaming of paths • Too small for top down appraoch • Changed pages between different crawling intervals might create problems Pros and cons – Bottom up

  7. Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Finding replicated web collections

  8. man printf

  9. Crawler’s task becomes easy • Improved search engine results • Ranking • Improved Archiving Why Identifying Replicated Content is Important?

  10. Update Frequency Why Replicated Content Identification is Difficult? ? ? ? ? dup-2.com dup-1.com www.original.com

  11. Mirror Partial Coverage Why Replicated Content Identification is Difficult? www.original.com dup-1.com dup-2.com

  12. Different Formats Why Replicated Content Identification is Difficult? dup-2.com dup-1.com www.original.com

  13. Partial Crawls Why Replicated Content Identification is Difficult? duplicate.com www.original.com

  14. Similarity of Collections – WEB GRaph

  15. Similarity of Collections – Collection Induced Subgraph Collection Size = 4 • Assumption : • Location of the hyperlinks in the pages is immaterial

  16. Similarity of Collections – Identical Collection dup-1.com www.original.com

  17. Close copies of each other – Human view Automatic identification, over large web pages Similarity of Collections – Similar Collection Textual Overlap Option

  18. Similarity of Collections – Similar Collection c 10110001001000010111101001000100011 10101001011111101010101011110101011 10110111010101001011010101010110001 32 bits

  19. Similarity of Collections – Similar Collection Text 2 Text 2 101111….011 101010….011 100010….011 1111110….011 100010….011 101110….011 111010….011 110100….011 1111110….011 100010….011 101010….011 101101….011 101010….011 101010….011 110100….011 111010….011

  20. Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10110001001000010111101001000100011 X out Y matches If X > T (threshold) => Two pages are similar

  21. Similarity of Collections – Transitive Similarity  P P`    P` P`` P P` P``  P P``

  22. Similarity of link structure • One-to-one • Collection Sizes

  23. Similarity of link structure • Link Similarity • Break Points

  24. Clusters Similar Cluster : CiCj, i,j (Pairwise Similarity) Cluster Cardinality = number of collections Cluster = equi-sized collections Identical Cluster : CiCj, i,j

  25. Computing similar clusters Cluster Cardinality = 2 Collection Size = 5

  26. Computing similar clusters Cluster Cardinality = 3 Collection Size = 3

  27. Identify trivial clusters Cluster growing algorithm

  28. Growth Strategy Cluster growing algorithm Ri Rj si,j = 3 di,j = 3 |Ri| = 3 |Rj| = 3 si,j= di,j = |Ri| = |Rj|

  29. Cluster growing algorithm

  30. Sample • Select 25 replicated collections – target • 5-10 mirrors from each target • 35000 pages from target + 15000 random pages • Results • 180 non-trivial collections • 149 collections -> 25 clusters • 180 – 149 = 31 problem collection • Due to partial mirrors Quality of similarity measure

  31. Partial Mirrors Quality of similarity measure

  32. Change of growth strategy • Change of results • 23 more clusters identified • Only 8 in problem collection • Success rate of 172 out of 180. Quality of similarity measure Extended Clusters si,j= |Ri| ≥di,j= |Rj|

  33. Data set • 25 million web pages, domains with US • The chunking strategies. Fingerprint for : • entire document • every four lines (Threshold = 15) • every two lines of text (Threshold = 25) Improved crawling

  34. Improved crawling

  35. Problems • Multiple pages from the same collection • Links to several replicated contents • Solution • Suppressing and grouping results • “Replica” link and “Collection” link in results Improved result presentation

  36. Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger A Comparison of Techniques to Find Mirrored Hosts on the WWW

  37. A and B are mirrors • For every document in A • Highly similar document in B • With the same path • Considering only to entire web sites • Partial mirrors ignored Concepts

  38. IP address based • Identical or similar IP addresses • URL string based • Term vector matching on URL strings • Host name matching • Full path matching • Prefix matching • Positional word bigram matching Methodology

  39. URL string and connectivity based • URL string based + outlinks • Host connectivity based • Two hosts are mirrors if they link to similar set of hosts Methodology

  40. Terminology Precision at correct host pairs within K = ------------------------------------rank K total host pairs within K Recall at correct host pairs within K = ------------------------------------rank K total host pairs within K, from all algos

  41. Results

  42. IP4 and prefix were the best single algorithms • Limited in recall • Best approach – a combination of all Conclusion from results

  43. What are the different situations one can use link based and content based analysis for duplicate detection?  What are the methods to improve content base analysis? How can we merge both methods? What will be the improvements? discussion

More Related