1 / 11

Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomput

Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomputer Center (moore, charliec)@sdsc.edu http://www.npaci.edu/DICE/. Reagan Moore Sheau Yen Chen Charles Cowart George Kremenek Erdem Kulrul Richard Marciano

azizi
Download Presentation

Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomput

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomputer Center (moore, charliec)@sdsc.edu http://www.npaci.edu/DICE/

  2. Reagan Moore Sheau Yen Chen Charles Cowart George Kremenek Erdem Kulrul Richard Marciano Arcot Rajasekar Michael Wan Persistent Archive Team

  3. Status • Architecture design • Choice of web crawler • Demonstration • Proof of concepts

  4. Architecture • Built on existing tools • Retrieve metadata • OAI metadata harvester • Retrieve digital entities • Web crawler • Organize and archive digital entities • Data grid • Provide access • OAI and HTTP interfaces

  5. OAI Interfaces • OAI service provider interface • Used Tom Kalt’s (U Mass) OAI harvester classes • Initiate connection • Retrieve metadata as XML • Parse XML into objects • OAI data provider interface • Custom CGI interface to SRB/MCAT written in C • Parses OAI2 requests and generates SRB client calls • Transforms from SRB objects to XML

  6. Web Crawler • HTML crawler choice • WGET (Gnu) • WebBase (Stanford) • HTML/XML translator (SDSC) • Capabilities • Parallelized for performance • Recursively crawl web site • Build link graph structure • Translation of links to logical name space

  7. Data Grid • Organize retrieved digital entities • Snapshot based (time) • Support for compound documents • Conversion of all internal URL links to SRB URL links, and associated SRB logical name space for digital entities • Manage storage of digital entities • Store on disk / archive at SDSC, could be replicated to any other site

  8. Implementation • URL list generation from “harvesting of NSDL repository” • Crawl and retrieve digital entities into a “buffer area” • Archive into snapshot organized collections • Flags / time stamps for changed data for OAI based retrieval

  9. Demonstration • Register digital entity by original URL • Store DC metadata • Crawl based on text file of desired URLs • Tested on LoC American Memory collection • Currently crawl two levels • Manages CGI redirection • Organize compound documents • Add SRB links for redirection • Preserve external web links • Display results using INQ interface to SRB

  10. C, C++, Libraries Unix Shell Databases DB2, Oracle, SQLServer Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Common APIs Application Linux I/O OAI Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM

  11. General Information • http://www.npaci.edu/DICE

More Related