learning url patterns for webpage de duplication n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning URL Patterns for Webpage De-duplication PowerPoint Presentation
Download Presentation
Learning URL Patterns for Webpage De-duplication

Loading in 2 Seconds...

play fullscreen
1 / 28

Learning URL Patterns for Webpage De-duplication - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Learning URL Patterns for Webpage De-duplication. Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw. Outlines. Introduction Duplicate URLs Problem Definition Related Works Algorithms URL Preprocessing Rule Generation Evaluation Conclusions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Learning URL Patterns for Webpage De-duplication' - aileen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
learning url patterns for webpage de duplication

Learning URL Patterns for Webpage De-duplication

Authors: Hema Swetha Koppula…

WSDM 2010

Reporter: Jing Chiu

Email: D9815013@mail.ntust.edu.tw

Data Mining & Machine Learning Lab

outlines
Outlines
  • Introduction
    • Duplicate URLs
    • Problem Definition
  • Related Works
  • Algorithms
    • URL Preprocessing
    • Rule Generation
  • Evaluation
  • Conclusions

Data Mining & Machine Learning Lab

introduction
Introduction
  • Duplicate URLs
  • Problem Definition

Data Mining & Machine Learning Lab

duplicate urls
Duplicate URLs
  • Making URLs search engine friendly
    • http://en.wikipedia.org/wiki/Casino_Royale
    • http://en.wikipedia.org/?title=Casino_Royale
  • Session-id or cookie information present in URLs
    • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8
    • http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8
  • Irrelevant or superfluous components in URLs
    • http://www.amazon.com/Lord-Rings/dp/B000634DCW
    • http://www.amazon.com/dp/B000634DCW
  • Webmaster construct URL representations with custom delimiters
    • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2
    • http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351&_tab=2

Data Mining & Machine Learning Lab

problem definition
Problem Definition
  • Given a set of duplicate clusters and their corresponding URLs
    • Learning Rules from URL strings which can identify duplicates
    • Utilizing learned Rules for normalizing unseen duplicate URLs into a unique normalized URL
  • Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL

Data Mining & Machine Learning Lab

related works
Related Works
  • Do not crawl in the dust: different urls with similar text
    • Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld.
    • Conference: International conference on World Wide Web 2007
    • DUST algorithm
      • Discovering substring substitution rules to transform URLs of similar content to one canonical URL
      • Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure

Data Mining & Machine Learning Lab

related works cont
Related Works (cont.)
  • De-duping urls via rewrite rules
    • Authors: A. Dasgupta, R. Kumar, and A. Sasturkar
    • Conference: ACM SIGKDD international conference on Knowledge discovery and data mining
    • Considering a broader set of rule types which subsume the DUST rules
      • DUST rules
      • session-id rules
      • irrelevant path components
      • Complicate rewrites
    • Algorithm learns rules from a cluster of URLs with similar page content
      • such a cluster is referred to as a duplicate cluster or a dup cluster

Data Mining & Machine Learning Lab

algorithms
Algorithms
  • URL Preprocessing
    • Basic Tokenization
    • Deep Tokenization
  • Rule Generation
    • Pair-wise Rule Generation
    • Rule Generalization

Data Mining & Machine Learning Lab

url preprocessing
URL Preprocessing
  • Basic Tokenization
    • Using the standard delimiters specified in theRFC 1738
    • Extracted Tokens:
      • Protocol
      • Hostname
      • Path components
      • Query-args
  • Deep Tokenization
    • Using unsupervised technique to learn custom URL encodings used by webmasters

Data Mining & Machine Learning Lab

url preprocessing cont
URL Preprocessing (cont.)

Data Mining & Machine Learning Lab

rule generation
Rule Generation
  • Definitions
    • URL
    • Rule
  • Example
    • u1: http://360.yahoo.com/friends-lttU7d6kIuGq
      • u1 = {k(1,3)= http, k(2,2)= 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = lttU7d6kIuGq}
    • u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ
      • u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ}
    • Rule
      • Context (C ):
        • c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ
      • Transformation (T):
        • t(k(3.3,1.1)) = lttU7d6kIuGq.

Data Mining & Machine Learning Lab

rule generation cont
Rule Generation (cont.)
  • Pair-wise Rule Generation
    • Target Selection
    • Source Selection
  • Rule Generalization
    • Pair 1:
      • http://www.imdb.com/title/tt0810900/photogallery
      • http://www.imdb.com/title/tt0810900/mediaindex
    • Pair 2:
      • http://www.imdb.com/title/tt0053198/photogallery
      • http://www.imdb.com/title/tt0053198/mediaindex
    • Rule 1:
      • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex
    • Rule 2:
      • c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex

Data Mining & Machine Learning Lab

evaluation
Evaluation
  • Dataset
  • Rule Numbers after each step

Data Mining & Machine Learning Lab

evaluation cont
Evaluation (cont.)
  • Small dataset

Data Mining & Machine Learning Lab

evaluation cont1
Evaluation (cont.)
  • Small dataset

Data Mining & Machine Learning Lab

evaluation cont2
Evaluation (cont.)
  • Large dataset

Data Mining & Machine Learning Lab

evaluation cont3
Evaluation (cont.)
  • Large dataset

Data Mining & Machine Learning Lab

conclusion
Conclusion
  • Presented a set of scalable and robust techniques for de-duplication of URLs
    • Basic and deep tokenization
    • Rule generation and generalization
  • Easy adaptability to MapReduce paradigm
  • Evaluate effectiveness on both small and large dataset

Data Mining & Machine Learning Lab

thanks for your attention
Thanks for your attention
  • Questions?

Data Mining & Machine Learning Lab

algorithm 1
Algorithm 1

Data Mining & Machine Learning Lab

algorithm 2
Algorithm 2

Data Mining & Machine Learning Lab

algrithm 3
Algrithm 3

Data Mining & Machine Learning Lab

algorithm 4
Algorithm 4

Data Mining & Machine Learning Lab

algorithm 5
Algorithm 5

Data Mining & Machine Learning Lab

definitions of url
Definitions of URL
  • URL: A URL u is defined as function
    • u : K → V ∪ {⊥}
    • K: keys
      • k(x.i,y.j)
      • x, y represent the position index from the start and end of the URL
      • i,j represent the deep token index
    • V: Values
    • A key not present in the URL is denoted by ⊥

Data Mining & Machine Learning Lab

definitions of rule
Definitions of Rule
  • RULE: A Rule r is defined as a function
    • r : C → T
    • C: context
      • C : K → V ∪ {∗}
    • T: transformation
      • T : K → V ∪ {⊥,K’}
        • K’ = K ∪ ValueConversions
        • ValueConversions = {Lowercase(K), Uppercase(K), Encode(K), Decode(K), ...}

Data Mining & Machine Learning Lab

rule coverage
Rule Coverage

Data Mining & Machine Learning Lab

mapreduce
MapReduce

Data Mining & Machine Learning Lab