Authors

Finding Replicated Web Collections(Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina)A Comparison of Techniques to Find Mirrored Hosts on WWW(Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika Henzinger)

Authors

Identifying replicated content • Cho et al, a bottom up approach • Using content based analysis • Computing similarity measures • Improved Crawling • Reducing clutter from search engine results • Bharat et al, a top down approach • Using page attributes • URL, IP Address, Connectivity What are they talking about?

Needs only the URLs of pages, not the pages themselves • Mirrors can be discovered even when very few of their duplicate pages are simultaneously present in the collection Pros and cons – Top down

Might discover mirrors • even under renaming of paths • Too small for top down appraoch • Changed pages between different crawling intervals might create problems Pros and cons – Bottom up

Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Finding replicated web collections

man printf

Crawler’s task becomes easy • Improved search engine results • Ranking • Improved Archiving Why Identifying Replicated Content is Important?

Update Frequency Why Replicated Content Identification is Difficult? ? ? ? ? dup-2.com dup-1.com www.original.com

Mirror Partial Coverage Why Replicated Content Identification is Difficult? www.original.com dup-1.com dup-2.com

Different Formats Why Replicated Content Identification is Difficult? dup-2.com dup-1.com www.original.com

Partial Crawls Why Replicated Content Identification is Difficult? duplicate.com www.original.com

Similarity of Collections – WEB GRaph

Similarity of Collections – Collection Induced Subgraph Collection Size = 4 • Assumption : • Location of the hyperlinks in the pages is immaterial

Similarity of Collections – Identical Collection dup-1.com www.original.com

Close copies of each other – Human view Automatic identification, over large web pages Similarity of Collections – Similar Collection Textual Overlap Option

Similarity of Collections – Similar Collection c 10110001001000010111101001000100011 10101001011111101010101011110101011 10110111010101001011010101010110001 32 bits

Similarity of Collections – Similar Collection Text 2 Text 2 101111….011 101010….011 100010….011 1111110….011 100010….011 101110….011 111010….011 110100….011 1111110….011 100010….011 101010….011 101101….011 101010….011 101010….011 110100….011 111010….011

Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10110001001000010111101001000100011 X out Y matches If X > T (threshold) => Two pages are similar

Similarity of Collections – Transitive Similarity  P P`    P` P`` P P` P``  P P``

Similarity of link structure • One-to-one • Collection Sizes

Similarity of link structure • Link Similarity • Break Points

Clusters Similar Cluster : Ci Cj, i,j (Pairwise Similarity) Cluster Cardinality = number of collections Cluster = equi-sized collections Identical Cluster : Ci Cj, i,j

Computing similar clusters Cluster Cardinality = 2 Collection Size = 5

Computing similar clusters Cluster Cardinality = 3 Collection Size = 3

Identify trivial clusters Cluster growing algorithm

Growth Strategy Cluster growing algorithm Ri Rj si,j = 3 di,j = 3 |Ri| = 3 |Rj| = 3 si,j= di,j = |Ri| = |Rj|

Cluster growing algorithm

Sample • Select 25 replicated collections – target • 5-10 mirrors from each target • 35000 pages from target + 15000 random pages • Results • 180 non-trivial collections • 149 collections -> 25 clusters • 180 – 149 = 31 problem collection • Due to partial mirrors Quality of similarity measure

Partial Mirrors Quality of similarity measure

Change of growth strategy • Change of results • 23 more clusters identified • Only 8 in problem collection • Success rate of 172 out of 180. Quality of similarity measure Extended Clusters si,j= |Ri| ≥di,j = |Rj|

Data set • 25 million web pages, domains with US • The chunking strategies. Fingerprint for : • entire document • every four lines (Threshold = 15) • every two lines of text (Threshold = 25) Improved crawling

Improved crawling

Problems • Multiple pages from the same collection • Links to several replicated contents • Solution • Suppressing and grouping results • “Replica” link and “Collection” link in results Improved result presentation

Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger A Comparison of Techniques to Find Mirrored Hosts on the WWW

A and B are mirrors • For every document in A • Highly similar document in B • With the same path • Considering only to entire web sites • Partial mirrors ignored Concepts

IP address based • Identical or similar IP addresses • URL string based • Term vector matching on URL strings • Host name matching • Full path matching • Prefix matching • Positional word bigram matching Methodology

URL string and connectivity based • URL string based + outlinks • Host connectivity based • Two hosts are mirrors if they link to similar set of hosts Methodology

Terminology Precision at correct host pairs within K = ------------------------------------rank K total host pairs within K Recall at correct host pairs within K = ------------------------------------rank K total host pairs within K, from all algos

Results

IP4 and prefix were the best single algorithms • Limited in recall • Best approach – a combination of all Conclusion from results

What are the different situations one can use link based and content based analysis for duplicate detection? What are the methods to improve content base analysis? How can we merge both methods? What will be the improvements? discussion

Authors

Authors

Presentation Transcript

AUTHORS:

Authors

Authors

Authors

Authors

Authors

Authors :

Authors:

Authors

Authors

Authors

Authors

Authors

Authors

Authors

Authors

Authors

Authors:

Authors

Authors

Authors

Authors: