1 / 14

NSDL Persistent Archive Service

NSDL Persistent Archive Service. Charles Cowart SDSC Summer Institute August 25 th , 2004. What is the Archive Service?. A system of OAI harvesting, web crawling, and data management tools used to copy or ‘back up’ data hosted on the web.

kato
Download Presentation

NSDL Persistent Archive Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NSDL Persistent Archive Service Charles Cowart SDSC Summer Institute August 25th, 2004

  2. What is the Archive Service? • A system of OAI harvesting, web crawling, and data management tools used to copy or ‘back up’ data hosted on the web. • An interface for NSDL and other communities to access that data through the web, as though it was the original.

  3. Why is it needed? • Many research projects have made valuable data available to the scientific and educational community on the World Wide Web. • Educators may base some of their curriculum on data which may change or become lost should the website be taken offline. • PAS offers a history of snapshots that can be relied upon should the original change or disappear.

  4. How does it work? • The NSDL makes available a ‘card catalog’ of valuable websites via the web and an XML based protocol called OAI-PMH. • Each website is represented as an OAI record, which looks something like this:

  5. Anatomy of an OAI Record, part 1 <record><header> <identifier>oai:arXiv.org:cs/0112017</identifier> <datestamp>2002-0-2-28</datestamp> <setSpec>math</setSpec></header> <metadata><oai_dc:dc ...misc schema references...>

  6. Anatomy of an OAI Record, part 2 <dc:title>Using Structural Metadata</dc:title> <dc:creator>Dushay, Naomi</dc:creator> <dc:subject>Digital Libraries</dc:subject> <dc:description>blahblah<dc:description> <dc:identifier>http://arXiv.org/abs/cs/0112017 </dc:identifier> </oai_dc:dc></metadata></record>

  7. How does it work? • A program called a Harvester collects the data from the NSDL by connecting to a web server in much the same way a web browser does. • The Harvester will extract links or URLs, filter out the ‘bad ones’, as well as URLs for sites we are not allowed to crawl.

  8. The Crawlmaster • The Harvester sends the extracted URLs and sends them to another program called the Crawlmaster. • The Crawlmaster’s job is to oversee the execution of n number of web crawlers. • The crawlmaster ensures a constant flow of crawling activity; as a job completes it issues a new one.

  9. The Crawlers • Crawlers are processes responsible for collecting and transforming the data made available through an initial URL. • It will capture the data referenced by the initial URL, and if that data is HTML, will collect material referenced by that HTML as well as subsequent HTML pages. • The Crawler uses a set of heuristics to gauge how much to collect and how much to leave alone. • Once the crawling is complete, each HTML page’s links are rewritten to point to our copied material, rather than the original.

  10. The Web Page, Recoded Inside a collected web page, this: <A HREF=“http://www.priweb.org/ed/earthtrips/Edisto/edfossil.html> Look at additional fossils from Edisto Beach</A> Becomes this: <A HREF="http://srb.npaci.edu/cgi-bin/nsdl.cgi?uid= /2004-01-07T01:59:16Z/24B150FF509302235AEAAF4894557D1D /edfossil.html">Look at additional fossils from Edisto Beach</A>

  11. The Storage Resource Broker • Once a crawl is complete, the material is stored in SDSC’s Storage Resource Broker. • The Storage Resource Broker allows us to store, manage, and easily retrieve very large collections of files. • Each collection in the SRB is associated with the original URL and OAI record ID issued by the NSDL and obtained by the Harvester.

  12. Accessing the Archive Service • The archived collections are served up through an HTTP portal to the SRB. • The HTTP portal is implemented as a web CGI interface which takes as a parameter the OAI record ID issued by the NSDL. • The portal will then return either the initial page to the latest archive directly, or XML data referencing the entire archive history of that record.

  13. Some Statistics • As of July 15th, the NSDL Persistent Archive contained: • 3,008 Gigabytes of data • In 21,420,181 files • Representing ~91,000 distinct URLs. • Spanning eight months from December 2003 to July 2004. • Span represents a complete archive

More Related