Factors affecting website reconstruction from the web infrastructure
Download
1 / 31

JCDL 2007 - Factors Affecting Reconstruction from the WI - PowerPoint PPT Presentation


  • 362 Views
  • Uploaded on

Factors Affecting Website Reconstruction from the Web Infrastructure. Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007. Outline. Web-repository crawling with Warrick

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'JCDL 2007 - Factors Affecting Reconstruction from the WI' - JasminFlorian


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Factors affecting website reconstruction from the web infrastructure

Factors Affecting Website Reconstruction from the Web Infrastructure

Frank McCown, Norou Diawara, and Michael L. Nelson

Old Dominion UniversityComputer Science DepartmentNorfolk, Virginia, USAJCDL 2007

Vancouver, BCJune 20, 2007


Outline
Outline Infrastructure

  • Web-repository crawling with Warrick

  • How successful is a reconstruction?

  • Reconstruction experiment

  • Significant findings


Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg


Crawling the crawlers
Crawling the Crawlers http://img.webpronews.com/securitypronews/110705blackhat.jpg


Cached image
Cached Image http://img.webpronews.com/securitypronews/110705blackhat.jpg


Cached pdf
Cached PDF http://img.webpronews.com/securitypronews/110705blackhat.jpg

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

canonical

MSN version Yahoo version Google version


  • McCown, et al., http://img.webpronews.com/securitypronews/110705blackhat.jpgBrass: A Queueing Manager for Warrick, IWAW 2007.

  • McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

  • McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.

  • McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http://warrick.cs.odu.edu/


Measuring the difference
Measuring the Difference http://img.webpronews.com/securitypronews/110705blackhat.jpg

Apply Recovery Vector for each resource

(rc, rm, ra)

changed missing added

Compute Difference Vector for website


Some difference vectors
Some Difference Vectors http://img.webpronews.com/securitypronews/110705blackhat.jpg

D = (changed, missing, added)

(0,0,0) – Perfect recovery

(1,0,0) – All resources are recovered but changed

(0,1,0) – All resources are lost

(0,0,1) – All recovered resources are at new URIs


How much change is a bad thing
How Much Change is a Bad Thing? http://img.webpronews.com/securitypronews/110705blackhat.jpg

Lost

Recovered


How much change is a bad thing1
How Much Change is a Bad Thing? http://img.webpronews.com/securitypronews/110705blackhat.jpg

Lost

Recovered


Assigning penalties
Assigning Penalties http://img.webpronews.com/securitypronews/110705blackhat.jpg

Penalty Adjustment

(Pc, Pm, Pa)

Apply to each resource

Or Difference vector


Defining success

0 http://img.webpronews.com/securitypronews/110705blackhat.jpg

1

Less successful

More successful

Defining Success

success = 1 – dmEquivalent to percent of recovered resources


Reconstruction experiment
Reconstruction Experiment http://img.webpronews.com/securitypronews/110705blackhat.jpg

  • 300 websites chosen randomly from Open Directory Project (dmoz.org)

  • Crawled and reconstructed each website every week for 14 weeks

  • Examined change rates, age, decay, growth, recoverability


Success of website recovery each week http://img.webpronews.com/securitypronews/110705blackhat.jpg

*On average, we recovered 61% of a website on any given week.


Recovery of textual resources
Recovery of Textual Resources http://img.webpronews.com/securitypronews/110705blackhat.jpg


Recovery by tld
Recovery by TLD http://img.webpronews.com/securitypronews/110705blackhat.jpg


Birth and decay
Birth and Decay http://img.webpronews.com/securitypronews/110705blackhat.jpg


Recovery of html resources
Recovery of HTML Resources http://img.webpronews.com/securitypronews/110705blackhat.jpg


Recovery by age
Recovery by Age http://img.webpronews.com/securitypronews/110705blackhat.jpg


Statistics for repositories
Statistics for Repositories http://img.webpronews.com/securitypronews/110705blackhat.jpg


Which factors are significant

External backlinks http://img.webpronews.com/securitypronews/110705blackhat.jpg

Internal backlinks

Google’s PageRank

Hops from root page

Path depth

MIME type

Query string params

Age

Resource birth rate

TLD

Website size

Size of resources

Which Factors Are Significant?


Mild correlations
Mild Correlations http://img.webpronews.com/securitypronews/110705blackhat.jpg

  • Hops and

    • website size (0.428)

    • path depth (0.388)

  • Age and # of query params (-0.318)

  • External links and

    • PageRank (0.339)

    • Website size (0.301)

    • Hops (0.320)


Regression analysis
Regression Analysis http://img.webpronews.com/securitypronews/110705blackhat.jpg

  • No surprises: all variables are significant, but overall model only explains about half of the observations

  • Three most significant variables: PageRank, hops and age (R-squared = 0.1496)


Conclusions
Conclusions http://img.webpronews.com/securitypronews/110705blackhat.jpg

  • Most of the sampled websites were relatively stable

    • One third of the websites never lost a single resource

    • Half of the websites never added any new resources

  • The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other)

  • How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs


Thank you
Thank You http://img.webpronews.com/securitypronews/110705blackhat.jpg

Sorry, Dad… You lost me in the first two minutes.

Frank McCown

[email protected]

http://www.cs.odu.edu/~fmccown/


Injecting server components into crawlable pages
Injecting Server Components into Crawlable Pages http://img.webpronews.com/securitypronews/110705blackhat.jpg

Erasure codes

HTML pages

Recover at least m blocks


Web Server http://img.webpronews.com/securitypronews/110705blackhat.jpg

Static files(html files, PDFs, images, style sheets, Javascript, etc.)

Web Infrastructure

Recoverable

config

Perlscript

Dynamicpage

Database

Not Recoverable


ad