1 / 46

Do Not Crawl In The DUST: Different URLs Similar Text

Do Not Crawl In The DUST: Different URLs Similar Text. Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar. Talk Outline. Problem statement and motivation Related work Our contribution The DustBuster algorithm

vesna
Download Presentation

Do Not Crawl In The DUST: Different URLs Similar Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Do Not Crawl In The DUST:Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar

  2. Talk Outline • Problem statement and motivation • Related work • Our contribution • The DustBuster algorithm • Experimental results • Concluding remarks

  3. Even the WWW Gets Dusty • DUST–Different URLs Similar Text • Examples: • Standard Canonization: • “http://domain.name/index.html”“http://domain.name” • Domain names and virtual hosts • “http://news.google.com”“http://google.com/news” • Aliases and symbolic links: • “http://domain.name/~shuri” “http://domain.name/people/shuri” • Parameters with little affect on content • Print=1 • URL transformations: • “http://domain.name/story_”“http://domain.name/story?id=”

  4. DUST Rules! • Dust rule:Transforms one URL to another • Example: “index.html” “” • Valid DUST rule: r is a valid DUST rule w.r.t. site S if for every URL u  S, • r(u) is a valid URL • r(u) and u have “similar” contents • Why similar and not identical? • Comments, news, text ads, counters

  5. DUST is Bad • Expensive to crawl • Access the same document via multiple URLs • Forces us to shingle • An expensive technique used to discover similar documents • Ranking algorithms suffer • References to a document split among its aliases • Multiple identical results • The same document is returned several times in the search results • Any algorithm based on URLs suffers

  6. We Want To • Given: a list of URLs from a site S • Crawl log • Web server log • Want: to find valid DUST rules w.r.t. S • As many as possible • Including site-specific ones • Minimize number of fetches • Applications: • Site-specific canonization • More efficient crawling

  7. How do we Fight DUST Today? (1) Standard Canonization • Domain name aliases • Standard extensions • Default file names: index.html, default.htm • File path canonizations: “dirname/../” “”, “//” “/” • Escape sequences: “%7E” “~”

  8. Standard Canonization is not Enough • Site-specific DUST: • “story_”“story?id=“ • “news.google.com”“google.com/news” • “labs”“laboratories” • This DUST is harder to find

  9. How do we Fight DUST Today? (2) Shingles • Shingles are document sketches [Broder,Glassman,Manasse 97] • Used to compare documents for similarity • Pr(Shingles are equal) = Document similarity • Compare documents by comparing shingles • Calculate Shingle: • Take all m word sequences • Hash them with hi • Choose the min • That's your shingle

  10. Shingles are Not Perfect • Shingles expensive: • Require fetch • Parsing • Hash • Shingles do not find rules • Therefore, not applicable to new pages

  11. More Related Work • Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] • Identifying plagiarized documents [Hoad,Zobel 03] • Finding near-replicas [Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03] • Copy detection [Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96]

  12. Our contributions • An algorithm that • finds site-specific valid DUST rules • requires minimum number of fetches • Convincing results in experiments • Benefits to crawling

  13. Types of DUST • Alias DUST: simple substring substitutions • “story_1259” “story?id=1259” • “news.google.com” “google.com/news” • “/index.html” “” • Parameter DUST: • Standard URL structure: protocol://domain.name/path/name?para=val&pa=va • Some parameters do not affect content: • Can be removed • Can changed to a default value

  14. Our Basic Framework • Input: URL list • Detect likely DUST rules • Eliminate redundant rules • Validate DUST rules using samples: • Eliminate DUST rules that are “wrong” • Further eliminate duplicate DUST rules No Fetch Required

  15. How to detect likely DUST rules? • Large support principle: Likely DUST rules have lots of “evidence” supporting them • Small buckets principle: Ignore evidence that supports many different rules

  16. Large Support Principle • A pair of URLs (u,v) is an instance of rule r, if: • r(u) = v • Support(r) = all instances (u,v) of r Large Support Principle The support of a valid DUST rule is large

  17. Rule Support:An Equivalent View • : a string • Ex:  = “story_” • u: URL that contains  as a substring • Ex: u = “http://www.sitename.com/story_2659” • Envelope of  in u: • A pair of strings (p,s) • p: prefix of u preceding  • s: suffix of u succeeding  • Example: p = “http://www.sitename.com/”, s = “2659” • E(α): all envelopes of  in URLs that appear in input URL list

  18. Envelopes Example

  19. Rule Support:An Equivalent View •  : an alias DUST rule • Ex:  = “story_”,  = “story?id=“ • Lemma: |Support( )| = | E() ∩E()| • Proof: • bucket(p,s) = {  | (p,s)  E() } • Observation: (u,v) is an instance of   if and only if u = p  s and v = p  s for some (p,s) • Hence, (u,v) is an instance of   iff (p,s)  E() ∩ E()

  20. Large Buckets • Often there is a large set of substrings that are interchangeable within a given URL while not being DUST: • page=1,page=2,… • lecture-1.pdf, lecture-2.pdf • This gives rise to large buckets:

  21. I am a DUCK not a DUST Small Bucket Principle • Big Buckets: • popular prefix suffix • Often do not contain similar content • Big buckets are expensive to process Small Buckets Principle Most of the support of valid Alias DUST rules is likely to belong to small buckets

  22. Algorithm – Detecting Likely DUST Rules No Fetch here! • Scan Log and form buckets • Ignore big buckets • For each small Bucket: • For every two substrings α, β in the bucket • print (α, β) • Sort by (α, β) • For every pair (α, β): • Count • If (Count > threshold) print αβ

  23. Size and Comments • Consider only instances of rules whose size “matches” • Use ranges of sizes • Running time O(Llog(L)) • Process only short substrings • Tokenize URLs

  24. Our Basic Framework • Input: URL list • Detect likely DUST rules • Eliminate redundant rules • Validate DUST rules using samples: • Eliminate DUST rules that are “wrong” • Further eliminate duplicate DUST rules No Fetch Required

  25. Eliminating RedundantRules • Rule φrefines rule ψ if SUPPORT(φ) SUPPORT(ψ ) “/vlsi/” “/labs/vlsi/” “/vlsi”  “/labs/vlsi” Lemma: A substitution ruleα’  β’ refines ruleα βif and only if there exists an envelope (γ,δ) such thatα’ = γ◦α◦δandβ’=γ◦β ◦ δ • Lemma helps us identify refinements easily • φ refines ψ? remove ψ if supports match No Fetch here!

  26. Validating Likely Rules • For each likely rule r, for both directions • Find sample URLs from list to which r is applicable • For each URL u in the sample: • v = r(u) • Fetch u and v • Check if content(u) is similar to content(v) • if fraction of similar pairs > threshold: • Declare rule r valid

  27. Comments About Validation • Assumption: • if validation beyond threshold in 100 it will be the same for any validation above • Why isn’t threshold 100%? • A 95% valid rule may still be worth it • Dynamic pages change often

  28. Experimental Setup • We experiment on logs of two web sites: • Dynamic Forum • Academic Site • Detected from a log of about 20,000 unique URLs • On each site we used four logs from different time periods

  29. Precision at k

  30. Precision vs. Validation

  31. Recall • How many of the DUST do we find? • What other duplicates are there: • Soft errors • True copies: • Last semesters course • All authors of paper • Frames • Image galleries

  32. DUST Distribution • In a crawl examined 18% of the crawl was reduced. 47.1 DUST 1.8% misc 25.7% Images 17.9% Exact Copy 7.6% Soft Errors

  33. Conclusions DustBuster is an efficient algorithm • Finds DUST rules • Can reduce a crawl • Can benefit ranking algorithms

  34. THE END

  35. Things to fix • = => --> • all rules with “” • Fix drawing urls crossing alpha not all p and all s

  36. So far, non-directional • Prefer shrinking rules • Prefer lexicographically lowering rules • Check those directions first

  37. Parametric DUST • Parameter name and possible values • What rules: • Remove parameter • Substitute one value with another • Substitute all values with a single value • Rules are validated the same way the alias rules are • Will not discuss further

  38. False Rules • Unfortunately we see a lot of “wrong” rules • Substitute 1 with 2 • Just wrong: • One domain name with another with similar software • False rules examples: • /YoninaEldar/ != /DavidMalah/ • /labs/vlsi/oldsite != /labs/vlsi • -2. != -3.

  39. Filtering out False Rulese • Getting rid of the big buckets • Using the size field: • False dust rules: • May give valid URLs • Content is not similar • Size is probably different • Size ranges used • Tokenization helps

  40. DustBuster – cleaning up the rules • Go over list with a window • If • Rule a refines rule b • Their support size is close • Leave only rule a

  41. DustBuster – Validation • Validation per rule • Get sample URLs • URLs that the rule can be applied • Apply URL => applied URL • Get content • Compare using shingles

  42. DustBuster - Validation • Stop fetching when: • #failures > 100 * (1-threshold) • Page that doesn't exist is not similar to anything else • Why use threshold < 100%? • Shingles not perfect • Dynamic pages may change a lot fast

  43. Detect Alias DUST – take 2 • Tokenize of course • Form buckets • Ignore big buckets • Count support only if size matches • Don't count Long substrings • Results are cleaner

  44. Eliminate Redundancies • 1: EliminateRedundancies(pairs_list R) • 2: for i = 1 to |R| do • 3: if (already eliminated R[i]) continue • 4: to_eliminate_current := false • /* Go over a window */ • 5: for j = 1 to min(MW, |R| - i) do • /* Support not close? Stop checking */ • 6: if (R[i].size - R[i+j].size > max(MRD*R[i].size, MAD)) break • /* a refines b? remove b */ • 7: if (R[i] refines R[i+j]) • 8: eliminate R[i+j] • 9: else if (R[i+j] refines R[i]) then • 10: to_eliminate_current := true • 11: break • 12: if (to_eliminate_current) • 13: eliminate R[i] • 14: return R No Fetch here!

  45. Validate a Single Rule • 1:ValidateRule(R, L) • 2: positive := 0 • 3: negative := 0 • /* Stop When You Are sure you either succeeded or failed */ • 4: while (positive < (1 - ε) N AND (negative < εN) do • 5: u := a random URL from L to which R is applicable • 6: v := outcome of application of R to u • 7: fetch u and v • 8: if (fetch u failed) continue • /* Something went wrong, negative sample */ • 9: if (fetch v failed) OR (shingling(u)  shingling(v)) • 10 negative := negative + 1 • /* Another positive sample */ • 11: else • 12: positive := positive + 1 • 13: if (negative ε N ) • 14: retrun FALSE • 15: return TRUE

  46. Validate Rules • 1:Validate(rules_list R, test_log L) • 2 create list of rules LR • 3: for i = 1 to |R| do • /* Go over rules that survived = valid rules */ • 4: for j = 1 to i - 1 do • 5: if (R[j] was not eliminated AND R[i] refines R[j]) • 6: eliminate R[i] from the list • 7: break • 8: if (R[i] was eliminated) • 9: continue • /* Test one direction */ • 10: if (ValidateRule(R[i].alpha  R[i].beta, L)) • 11: add R[i].alpha  R[i].beta to LR • /* Test other direction only if first direction failed*/ • 12: else if • (ValidateRule(R[i].beta  R[i].alpha, L)) • 13: add R[i].alpha  R[i].beta to LR • 14: else • 15: eliminate R[i] from the list • 16: return LR

More Related