thinking differently about web page preservation l.
Download
Skip this Video
Download Presentation
Thinking Differently About Web Page Preservation

Loading in 2 Seconds...

play fullscreen
1 / 77

Thinking Differently About Web Page Preservation - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

Thinking Differently About Web Page Preservation. Michael L. Nelson, Frank McCown, Joan A. Smith Old Dominion University Norfolk VA {mln,fmccown,jsmit}@cs.odu.edu Library of Congress Brown Bag Seminar June 29, 2006

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Thinking Differently About Web Page Preservation' - kyros


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
thinking differently about web page preservation

Thinking Differently About Web Page Preservation

Michael L. Nelson, Frank McCown, Joan A. Smith

Old Dominion University

Norfolk VA

{mln,fmccown,jsmit}@cs.odu.edu

Library of Congress

Brown Bag Seminar

June 29, 2006

Research supported in part by NSF, Library of Congress and Andrew Mellon Foundation

background
Background
  • “We can’t save everything!”
    • if not “everything”, then how much?
    • what does “save” mean?
women and children first
“Women and Children First”

HMS Birkenhead, Cape Danger, 1852

638 passengers

193 survivors

all 7 women & 13 children

image from: http://www.btinternet.com/~palmiped/Birkenhead.htm

slide4

We should probably

save a copy of this…

slide5

Or maybe we don’t

have to…

the Wikipedia link

is in the top 10, so

we’re ok, right?

slide6

Surely we’re saving

copies of this…

slide7

2 copies in

the UK

2 Dublin Core

records

That’s probably

good enough…

slide8

What about the

things that we know

we don’t need to keep?

You DO support

recycling, right?

lessons learned from the aiht

Preservation metadata is like a David Hockney Polaroid collage:

each image is both true and incomplete,

and while the result is not faithful, it does capture the “essence”

Lessons Learned from the AIHT

(Boring stuff: D-Lib Magazine, December 2005)

images from: http://facweb.cs.depaul.edu/sgrais/collage.htm

preservation fortress model
Preservation: Fortress Model

Five Easy Steps for Preservation:

  • Get a lot of $
  • Buy a lot of disks, machines, tapes, etc.
  • Hire an army of staff
  • Load a small amount of data
  • “Look upon my archive ye Mighty, and despair!”

image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg

alternate models of preservation
Alternate Models of Preservation
  • Lazy Preservation
    • Let Google, IA et al. preserve your website
  • Just-In-Time Preservation
    • Wait for it to disappear first, then a “good enough” version
  • Shared Infrastructure Preservation
    • Push your content to sites that might preserve it
  • Web Server Enhanced Preservation
    • Use Apache modules to create archival-ready resources

image from: http://www.proex.ufes.br/arsm/knots_interlaced.htm

outline lazy preservation
Outline: Lazy Preservation
  • Web Infrastructure as a Resource
  • Reconstructing Web Sites
  • Research Focus
cost of preservation

Publisher’s cost

(time, equipment, knowledge)

Client-view Server-view

H

Filesystem backups

Furl/Spurl

Browser cache

InfoMonitor

LOCKSS

Hanzo:web

iPROXY

TTApache

Web archivesSE caches

H L H

Coverage of the Web

Cost of Preservation
outline lazy preservation20
Outline: Lazy Preservation
  • Web Infrastructure as a Resource
  • Reconstructing Web Sites
  • Research Focus
research questions
Research Questions
  • How much digital preservation of websites is afforded by lazy preservation?
    • Can we reconstruct entire websites from the WI?
    • What factors contribute to the success of website reconstruction?
    • Can we predict how much of a lost website can be recovered?
    • How can the WI be utilized to provide preservation of server-side components?
prior work
Prior Work
  • Is website reconstruction from WI feasible?
    • Web repository: G,M,Y,IA
    • Web-repository crawler: Warrick
    • Reconstructed 24 websites
  • How long do search engines keep cached content after it is removed?
timeline of se resource acquisition and release
Timeline of SE Resource Acquisition and Release

Vulnerable resource – not yet cached (tca is not defined)

Replicated resource – available on web server and SE cache (tca < current time < tr)

Endangered resource – removed from web server but still cached (tca < current time < tcr)

Unrecoverable resource – missing from web server and cache (tca< tcr< current time)

Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites, D-Lib Magazine, 12(2), February 2006. Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical report, arXiv cs.IR/0512069, 2005.

cached pdf
Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

canonical

MSN version Yahoo version Google version

web repository characteristics
Web Repository Characteristics

C Canonical version is stored

M Modified version is stored (modified images are thumbnails, all others are html conversions)

~R Indexed but not retrievable

~S Indexed but not stored

se caching experiment
SE Caching Experiment
  • Create html, pdf, and images
  • Place files on 4 web servers
  • Remove files on regular schedule
  • Examine web server logs to determine when each page is crawled and by whom
  • Query each search engine daily using unique identifier to see if they have cached the page or image

Joan A. Smith,Frank McCown, and Michael L. Nelson. Observed Web Robot Behavior on Decaying Web Subsites. D-Lib Magazine, February 2006, 12(2)

reconstructing a website
Reconstructing a Website

Original URL

Web Repo

Warrick

Starting URL

Results page

Cached URL

Retrieved resource

File system

Cached resource

  • Pull resources from all web repositories
  • Strip off extra header and footer html
  • Store most recently cached version or canonical version
  • Parse html for links to other resources
how much did we reconstruct
How Much Did We Reconstruct?

“Lost” web site Reconstructed web site

A

A

B’

C’

F

B

C

G

E

D

E

F

Missing link to D; points to old resource G

F can’t be found

reconstruction diagram
Reconstruction Diagram

added 20%

changed 33%

missing 17%

identical 50%

websites to reconstruct
Websites to Reconstruct
  • Reconstruct 24 sites in 3 categories:

1. small (1-150 resources) 2. medium (150-499 resources)3. large (500+ resources)

  • Use Wget to download current website
  • Use Warrick to reconstruct
  • Calculate reconstruction vector
results
Results

Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Reconstructing Websites for the Lazy Webmaster, Technical Report, arXiv cs.IR/0512069, 2005.

warrick milestones
Warrick Milestones
  • www2006.org – first lost website reconstructed (Nov 2005)
  • DCkickball.org – first website someone else reconstructed without our help (late Jan 2006)
  • www.iclnet.org – first website we reconstructed for someone else (mid Mar 2006)
  • Internet Archive officially “blesses” Warrick (mid Mar 2006)1

1http://frankmccown.blogspot.com/2006/03/warrick-is-gaining-traction.html

outline lazy preservation39
Outline: Lazy Preservation
  • Web Infrastructure as a Resource
  • Reconstructing Web Sites
  • Research Focus
proposed work
Proposed Work
  • How lazy can we afford to be?
    • Find factors influencing success of website reconstruction from the WI
    • Perform search engine cache characterization
  • Inject server-side components into WI for complete website reconstruction
  • Improving the Warrick crawler
    • Evaluate different crawling policies
      • Frank McCownand Michael L. Nelson, Evaluation of Crawling Policies for a Web-repository Crawler, ACM Hypertext 2006.
    • Development of web-repository API for inclusion in Warrick
factors influencing website recoverability from the wi
Factors Influencing Website Recoverability from the WI
  • Previous study did not find statistically significant relationship between recoverability and website size or PageRank
  • Methodology
    • Sample large number of websites - dmoz.org
    • Perform several reconstructions over time using same policy
    • Download sites several times over time to capture change rates
evaluation
Evaluation
  • Use statistical analysis to test for the following factors:
    • Size
    • Makeup
    • Path depth
    • PageRank
    • Change rate
  • Create a predictive model – how much of my lost website do I expect to get back?
recovery of web server components
Recovery of Web Server Components
  • Recovering the client-side representation is not enough to reconstruct a dynamically-produced website
  • How can we inject the server-side functionality into the WI?
  • Web repositories like HTML
    • Canonical versions stored by all web repos
    • Text-based
    • Comments can be inserted without changing appearance of page
  • Injection: Use erasure codes to break a server file into chunks and insert the chunks into HTML comments of different pages
evaluation47
Evaluation
  • Find the most efficient values for n and r (chunks created/recovered)
  • Security
    • Develop simple mechanism for selecting files that can be injected into the WI
    • Address encryption issues
  • Reconstruct an EPrints website with a few hundred resources
se cache characterization
SE Cache Characterization
  • Web characterization is an active field
  • Search engine caches have never been characterized
  • Methodology
    • Randomly sample URLs from four popular search engines: Google, MSN, Yahoo, Ask
    • Download cached version and live version from the Web
    • Examine HTTP headers and page content
    • Test for overlap with Internet Archive
    • Attempt to access various resource types (PDF, Word, PS, etc.) in each SE cache
summary lazy preservation
Summary: Lazy Preservation

When this work is completed, we will have…

  • demonstrated and evaluated the lazy preservation technique
  • provided a reference implementation
  • characterized SE caching behavior
  • provided a layer of abstraction on top of SE behavior (API)
  • explored how much we store in the WI (server-side vs. client-side representations)
web server enhanced preservation how much preservation do i get if i do just a little bit

Web Server Enhanced Preservation“How much preservation do I get if I do just a little bit?”

Joan A. Smith

outline web server enhanced preservation
Outline: Web Server Enhanced Preservation
  • OAI-PMH
  • mod_oai: complex objects + resource harvesting
  • Research Focus
www and dl separate worlds
WWW and DL: Separate Worlds

“Crawlapalooza”

WWW

WWW

DL

DL

Today

“Harvester Home

Companion”

1994

The problem is not that the WWW doesn’t work; it clearly does.

The problem is that our (preservation) expectations have been lowered.

slide53

Data Providers /

Repositories

Service Providers /

Harvesters

“A repository is a network accessible server that can process the 6 OAI-PMH requests …

A repository is managed by a data provider to expose metadata to harvesters.” 

“A harvester is a client application that issues OAI-PMH requests.  A harvester is operated by a service provider as a means of collecting metadata from repositories.”

aggregators
Aggregators
  • aggregators allow for:
    • scalability for OAI-PMH
    • load balancing
    • community building
    • discovery

data providers

(repositories)

service providers

(harvesters)

aggregator

oai pmh data model

resource

OAI-PMH sets

OAI-PMH identifier

item

OAI-PMH identifier

metadataPrefix

datestamp

Dublin Core

metadata

MARCXML

metadata

records

OAI-PMH data model

entry point to all records pertaining to the resource

metadata pertaining

to the resource

oai pmh used by google academiclive msn
OAI-PMH Used by Google & AcademicLive (MSN)
  • Why support OAI-PMH?
    • These guys are in business (i.e., for profit)
    • How does OAI-PMH help their bottom line?
    • By improving the search and analysis process
resource harvesting with oai pmh

resource

item

Dublin Core

metadata

MARCXML

metadata

MPEG-21

DIDL

METS

records

Resource Harvesting with OAI-PMH

OAI-PMH identifier

= entry point to all records pertaining to the resource

metadata pertaining

to the resource

simple

more expressive

highly

expressive

highly

expressive

outline web server enhanced preservation58
Outline: Web Server Enhanced Preservation
  • OAI-PMH
  • mod_oai: complex objects + resource harvesting
  • Research Focus
two problems

The representation problem

Machine-readable formats and human-

readable formats have different requirements

The counting problem

There is no way to determine the

list of valid URLs at a web site

Two Problems
mod oai solution
mod_oai solution
  • Integrate OAI-PMH functionality into the web server itself…
  • mod_oai: an Apache 2.0 module to automatically answer OAI-PMH requests for an http server
    • written in C
    • respects values in .htaccess, httpd.conf
  • compile mod_oai on http://www.foo.edu/
  • baseURL is now http://www.foo.edu/modoai
    • Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)

http://www.foo.edu/modoai?verb=ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=mime:video:mpeg

The human-readable web site

Prepped for machine-friendly harvesting

Give me a list of all resources, include Dublin Core metadata, dating from 9/15/2004 through today, and

that are MIME type video-MPEG.

a crawler s view of the web site
A Crawler’s View of the Web Site

Not crawled

(protected)

web root

Not crawled

(Generated on-the-fly

by CGI, e.g.)

Not crawled

robots.txt

or robots META tag

Not crawled

(unadvertised & unlinked)

Crawled pages

Not crawled

(remote link only)

Not crawled

(too deep)

Remote web site

apache s view of the web site
Apache’s View of the Web Site

Require authentication

web root

Generated on-the-fly

(CGI, e.g.)

Tagged:

No robots

Unknown/not visible

the problem defining the whole site
The Problem: Defining The “Whole Site”
  • For a given server, there are a set of URLs, U, and a set of files F
    • Apache maps U F
    • mod_oai maps F U
  • Neither function is 1-1 nor onto
    • We can easily check if a single u maps to F, but given F we cannot (easily) generate U
  • Short-term issues:
    • dynamic files
      • exporting unprocessed server-side files would be a security hole
    • IndexIgnore
      • httpd will “hide” valid URLs
    • File permissions
      • httpd will advertise files it cannot read
  • Long-term issues
    • Alias, Location
      • files can be covered up by the httpd
    • UserDir
      • interactions between the httpd and the filesystem
a webmaster s omniscient view

MySQL

httpd

  • Data1
  • User.abc
  • Fred.foo
  • file1
  • /dir/wwx
  • Foo.html
A Webmaster’s Omniscient View

Dynamic

web root

Authenticated

Tagged:

No robots

Orphaned

Deep

Unknown/not visible

http get versus oai pmh getrecord
HTTP “Get” versus OAI-PMH GetRecord

HTTP GetRecord

Machine-readable

HTTP GET

JHOVEMETADATA

Human-readable

MD-5 LS

Complex Object

mod_oai

Apache Web Server

“GET /modoai/?verb=GetRecord&identifier=

headlines.html&metadaprefix=oai_didl”

“GET /headlines.html HTTP1.1”

WEB SITE

oai pmh data model in mod oai

resource

OAI-PMH sets

MIME type

item

HTTP header

metadata

Dublin Core

metadata

MPEG-21

DIDL

records

OAI-PMH data model in mod_oai

http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf

OAI-PMH identifier

= entry point to all records pertaining to the resource

metadata pertaining

to the resource

complex objects that tell a story

DC metadata

Jhove metadata

Checksum

Provenance

Complex Objects That Tell A Story

http://foo.edu/bar.pdf

encoded as an MPEG-21 DIDL

Russian Nesting Doll

<didl> <metadata source="jhove">...</metadata> <metadata source="file">...</metadata> <metadata source="essence">...</metadata> <metadata source="grep">...</metadata> ... <resource mimeType="application/pdf" identifier=“http://foo.edu/bar.pdf encoding="base64> SADLFJSALDJF...SLDKFJASLDJ </resource>

</didl>

  • Resource and metadata packaged together as a complex digital object represented via XML wrapper
  • Uniform solution for simple & compound objects
  • Unambiguous expression of locator of datastream
  • Disambiguation between locators & identifiers
  • OAI-PMH datestamp changes whenever the resource (datastreams & secondary information) changes
  • OAI-PMH semantics apply: “about” containers, set membership
  • First came Lenin
  • Then came Stalin…
resource discovery listidentifiers
Resource Discovery: ListIdentifiers

HARVESTER:

  • issues a ListIdentifiers,
  • finds URLs of updated resources
  • does HTTP GETs updates only
  • can get URLs of resources with specified MIME types
preservation listrecords
Preservation: ListRecords

HARVESTER:

  • issues a ListRecords,
  • Gets updates as MPEG-21 DIDL documents (HTTP headers, resource By Value or By Reference)
  • can get resources with specified MIME types
what does this mean
What does this mean?
  • For an entire web site, we can:
    • serialize everything as an XML stream
    • extract it using off-the-shelf OAI-PMH harvesters
    • efficiently discover updates & additions
  • For each URL, we can:
    • create “preservation ready” version with configurable {descriptive|technical|structural} metadata
      • e.g., Jhove output, datestamps, signatures, provenance, automatically generated summary, etc.

Jhove & other

pertinent info

Harvest the

resource

or lexical signatures,

Summaries, etc

extract

metadata

include an index translations…

Wrap it all together

In an XML Stream

Ready for the future

outline web server enhanced preservation71
Outline: Web Server Enhanced Preservation
  • OAI-PMH
  • mod_oai: complex objects + resource harvesting
  • Research Focus
research contributions
Research Contributions

Thesis Question: How well can Apache support web page preservation?

Goal: To make web resources “preservation ready”

  • Support refreshing (“how many URLs at this site?”): the counting problem
  • Support migration (“what is this object?”): the representation problem

How: Using OAI-PMH resource harvesting

  • Aggregate forensic metadata
    • Automate extraction
  • Encapsulate into an object
    • XML stream of information
  • Maximize preservation opportunity
    • Bring DL technology into the realm of WWW
experimentation evaluation
Experimentation & Evaluation
  • Research solutions to the counting problem
    • Different tools yield different results
    • Google sitemap <> Apache file list <> robot crawled pages
    • Combine approaches for one automated, full URL listing
      • Apache logs are detailed history of site activity
      • Compare user page requests with crawlers’ requests
      • Compare crawled pages with actual site tree
  • Continue research on the representation problem
    • Integrate utilities into mod_oai (Jhove, etc.)
    • Automate metadata extraction & encapsulation
  • Serialize and reconstitute
    • complete back-up of site & reconstitution through XML stream
summary web server enhanced preservation
Summary: Web Server Enhanced Preservation
  • Better web harvesting can be achieved through:
    • OAI-PMH: structured access to updates
    • Complex object formats: modeled representation of digital objects
  • Address 2 key problems:
    • Preservation (ListRecords) – The Representation Problem
    • Web crawling (ListIdentifiers) – The Counting Problem
  • mod_oai: reference implementation
    • Better performance than wget & crawlers
    • not a replacement for DSpace, Fedora, eprints.org, etc.
  • More info:
    • http://www.modoai.org/
    • http://whiskey.cs.odu.edu/

Automatic harvesting of web resources rich in metadata packaged for the future

Today: manual

Tomorrow: automatic!

summary

Summary

Michael L. Nelson

summary76
Summary
  • Digital preservation is not hard, its just big.
    • Save the women and children first, of course, but there is room for many more…
  • Using the by-product of SE and WI, we can get a good amount of preservation for free
    • prediction: Google et al. will eventually see preservation as a business opportunity
  • Increasing the role of the web server will solve most of the digital preservation problems
    • complex objects + OAI-PMH = digital preservation solution
slide77

“As you know, you preserve the files you

have. They’re not the files you might want

or wish to have at a later time”

“if you think about it, you can have all the

metadata in the world on a file and a file

can be blown up”

image from: http://www.washingtonpost.com/wp-dyn/articles/A132-2004Dec14.html