1 / 46

Smart Objects and Dumb (but Open!) Archives

Smart Objects and Dumb (but Open!) Archives. Michael L. Nelson NASA Langley Research Center & University of North Carolina mln@ils.unc.edu http://www.ils.unc.edu/~mln/ Cornell University CS 502 – Computing Methods for DLs Guest Lecture April 20, 2001. Outline.

nanda
Download Presentation

Smart Objects and Dumb (but Open!) Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smart Objects and Dumb (but Open!) Archives Michael L. Nelson NASA Langley Research Center & University of North Carolina mln@ils.unc.edu http://www.ils.unc.edu/~mln/ Cornell University CS 502 – Computing Methods for DLs Guest Lecture April 20, 2001

  2. Outline • History / problem statement / motivation • Buckets: smart objects • Bucket implementation • Smart objects, dumb archives (SODA) • Open Archive Initiative (OAI) • Bucket Communication Space (BCS) • Future work • Conclusions

  3. NASA Scientific and Technical Information • Formal publications cover a decreasing percentage of NASA’s STI output • most DLs focus only on formal publications • Informal STI is maintained by only by a network of collegial distribution • aging and shrinking workforce weakens this network • Customers want much more than formal publication • rather than stretch the meaning of “report” or “document”, define a new object for DL transactions

  4. NASA LaRC Publications 1991-1999

  5. STI Observations • Media formats are instantiations of a more general class of information • Most DLs are uni-format, following the obsolete media boundaries of their non-digital predecessors • “Separate but equal” DLs considered harmful • customer should not have to re-integrate what should never have been de-integrated... • institutional knowledge being lost because we don’t have a publishing vector established

  6. Information Lost Over Time

  7. Pyramid of Scientific and Technical Information (STI) Information is created in a variety of formats. Formal publications, the focus of most DL projects, are supported by a pyramid of informal information.

  8. The Tyranny of the Archive(Content is King) The information content is more important than the systems used for its storage, management and retrieval Objects should not be “locked” in specific DLs or archives

  9. Buckets • Aggregation + intelligence = buckets • metadata + data + methods = buckets • Object-oriented, intelligent agent archival entities • A collection of all information about a project: • manuscripts - software • data - images • video - etc. • Customizable, heterogeneous • buckets can “learn”, “talk”, and “coordinate” • buckets control terms and conditions, display, etc. -- not the archive that holds them

  10. Design Goals • Aggregation • DLs should be shielded from the transient nature of file formats • Prevent information hemorrhaging by archiving all data types • Intelligence • Aggregation (above) implies code, why stop at passive objects? Make objects smart... • Bucket-bucket & bucket-tool intelligence

  11. Design Goals • Self-Sufficiency • Maximum autonomy & survivability: fully self-sufficient buckets • Option to internally store all needed materials • Mobility • Why should an information object be stuck in one place? • Mobility for replication, workflow, data collection

  12. Design Goals • Heterogeneity • One size does not fit all... • Different buckets for different applications, sites, disciplines, etc. • Archive Independence • Focus is on information, not yet another DL “system” • does not require an archive to function • “Work with everything; break nothing”

  13. Bucket Architecture A Typical NASA DL Bucket -- Other Bucket Types Possible!

  14. A Sample Bucket 4 packages: - report (4 elements) - appendix (2 elements) - contact information (2 elements) - translation (1 element)

  15. Another Sample Bucket 2 packages: - pre-print (2 elements) - pointer to SFX reference linking service for published and pre-print versions (2 elements) this bucket display for the Universal Preprint Service https://ups.cs.odu.edu/

  16. Heterogeneous Buckets • Buckets are envisioned to locally modifiable and extensible • There is a default set of public methods defined for buckets • additional methods can be locally defined • Buckets can “learn” new methods • new “default” methods, or locally defined extensions • override default methods

  17. Bucket Messages • Sample bucket messages: http://home.larc.nasa.gov/~mln/bucket/ http://home.larc.nasa.gov/~mln/bucket/?method=display invokes the default display method http://home.larc.nasa.gov/~mln/bucket/?method=metadata returns the metadata for the bucket http://home.larc.nasa.gov/~mln/bucket/?method=display&pkg_name=report&element_name=tr1253.pdf displays a single element http://home.larc.nasa.gov/~mln/bucket/?method=list_methods lists all the methods that this bucket implements

  18. BUCKET DEMO Bucket Methods most methods take various arguments; see Appendix B in dissertation http://home.larc.nasa.gov/~mln/phd/ supersedes Table 1 in NASA TM 1998 208419

  19. Bucket Metadata • Due to Dienst heritage, uses RFC-1807 format • this is likely to change in the future • Metadata defines the content and appearance of the bucket • bibliographic and control information • But can store any format of metadata • bucket does not need to “understand” all formats • special purpose, legacy or obscure formats • COSATI, MARC • http://foo.edu/bucket-27/?method=metadata&format=cosati

  20. Current Implementation • File system semantics: • 1 bucket = 1 directory • 1 package = 1 directory in bucket • 1 element = 1 file in package directory • index.cgi is the bucket “lid” • http dependency for access • index.cgi written in Perl 5.0 • Methods should not change when the implementation changes • still use http as transport protocol • Oracle, Lotus Notes implementations being developed • Java, PHP, Tcl, etc. implementations possible too

  21. Bucket Structure Bucket index.cgi _method.pkg _http.pkg _log.pkg _tc.pkg report.pkg appendix.pkg source files for methods http dependency files terms and conditions logs software.pkg testdata.pkg _md.pkg _state.pkg metadata bucket state default bucket packages sample bucket payload

  22. Systems Tested

  23. SODA:Smart Objects, Dumb Archives • Objects are more important than the archive that holds them • The object should be the authority on its contents, not an archive • We envision a general shift of intelligence from archives to the objects themselves • DL protocols should find, index, and search -- not know about file formats, policy, terms and conditions, etc.

  24. Presentation Responsibility Shifts From Dienst to Buckets

  25. SODA • Current DLs have tight integration between the data object, the archive it is in, and the interface used to access it • 1-1 model between DL and archive • By decoupling these functions, we can separate their development and maintenance • N-M model between DLs and archives

  26. SODA Students and Educators . . . . . . Library Users Researchers Corporate Developers DLSs Building From Archives and Buckets NASA DLS Avionics DLS NCSTRL Archives Managing Buckets . . . NASA Archive ACM Archive CoRR . . . All Known Buckets (in archives and out) . . .

  27. “Dumb Archive” • Archives should be little more than set managers • Several possible archive candidates • LDAP, Dienst, Guildford Protocol, others • Our implementation: a “modified” bucket, DA: • it has all of the regular bucket methods, plus: • da_list - list all buckets in the archive • da_put - put a bucket in an archive • da_delete - delete a bucket from an archive • da_info - archive-level metadata • da_get - redirect to this bucket all operations modulo appropriate T&C

  28. DA Structure Bucket index.cgi _method.pkg _http.pkg _log.pkg _tc.pkg source files for methods http dependency files terms and conditions logs no bucket payload _md.pkg _state.pkg holdings.pkg metadata bucket state DA data structures • holdings.pkg package for DA • does not use packages/elements • scalability concerns • uses GDBM/NDBM files (hashes) • 1 hash per argument to da_put default bucket packages

  29. OAI as a “Dumb Archive” • Originally used a separate protocol & implementation for the “dumb archive” • Now using the metadata harvesting protocol defined by the Open Archive Initiative (OAI) • OAI evolved from the Universal Preprint Service (UPS) • http://www.dlib.org/dlib/february00/vandesompel-ups/02vandesompel-ups.html • http://ups.cs.odu.edu/ • http://www.openarchives.org/ • OAI does not require smart objects, but does create a “dumb archive” layer

  30. OAI Bucket Structure Bucket index.cgi _method.pkg _http.pkg _log.pkg _tc.pkg oai source files for methods http dependency files terms and conditions oai.pl element is a support library that defines access for the specific DL logs _md.pkg _state.pkg metadata bucket state bucket payload is DL specific support library default bucket packages in addition to the ~ 30 bucket methods each OAI verb is implemented as a separate method

  31. Intelligence • Shift of responsibility into the data objects opens up an entire new class of applications: data objects as intelligent agents • Premise: instead of having the data objects do nothing while they patiently wait to be accessed, have them do something useful while waiting ...

  32. Bucket Communication Space • Provides a well known, shared memory model for buckets to communicate • communications model: Linda (Javaspace) • Applications: • Bucket matching • the same author (separated by publisher, time) • different authors (finding similar works) • Metadata scrubbing • Format translation (metadata, images, documents) • Bucket messaging • including broadcast & multicast

  33. BCS Structure Bucket index.cgi _method.pkg _http.pkg _log.pkg _tc.pkg source files for methods http dependency files terms and conditions logs no bucket payload _md.pkg _state.pkg bcs.pkg • bcs.pkg package for BCS • uses GDBM/NDBM files (hashes) • for registr • included programs • mdt (metadata conversion) • Image Alchemy (image conversion) metadata bucket state BCS data structures conversion programs default bucket packages

  34. BCS Methods • bcs_list, bcs_register, bcs_unregister • set management • bcs_convert_image • wrapper for Image Alchemy program • no bucket hooks in 1.6 • bcs_convert_metadata • wrapper for “mdt” program • bucket hooks in 1.6

  35. BCS DEMO BCS Methods • bcs_message • “search”, “search/replace”, “search/mesg” functionality • bcs_similarity • all x all comparison • n x all comparison (n=1 .. all) • adjustable threshold for “similarity”

  36. Similarity Results from UPS • NACA - 3036 documents • UPS Math - 3831 documents • for 6867 documents, ran for 42 hours (561k comparisons / hour) • used default value of 0.85 for similarity • NACA - 159 similar documents • UPS Math - 35 similar documents • No similarity between NACA & UPS Math • Optimizations: • clustering of collection • distributed computation of similarity matrix

  37. Future Work • Alternate implementations for buckets • Java, Oracle, Python, Tcl… • Alternate API access • CORBA, SOAP • New functionality for buckets • Standard packages / elements for: revisions, citations, checksums

  38. Future Work • Security, authentication, T&C • investigate X.509, Kerberos, MD5 • formalize ACLs • Specialized buckets • discipline- or data-specific buckets • computational buckets • software reuse, RPC-like support • Reduce the centralization of the BCS • cf. Berkeley’s xFS – serverless file system • http://now.cs.berkeley.edu/Xfs/xfs.html • Passive -> Active objects • e.g., LANL’s Active Recommendation Project • http://www.c3.lanl.gov/~rocha/lww/

  39. Impact • SODA • significant immediate interoperability benefits • frees the object from the tyranny of the archive • Bucket aggregation: evolutionary concept • benefit begins immediately, continues indefinitely • no more information hemorrhaging • Bucket intelligence: revolutionary concept • benefit is mid- to long-term • full impact unknown; a flexible framework will allow others to innovate • make archived objects active, not passive

  40. if bucket software doesn’t work out, we’ll market products with Phil’s likeness thanks to Rod Waid for Phil http://dlib.cs.odu.edu/

  41. Emergency backup slides...

  42. Why Digital Libraries? digital library = collection of information both digitized and organized -- M. Lesk, 1997 • “Why not just use the WWW” ? • WWW by itself has low archival & management characteristics • “Why not use a RDBMS?” • In the same way that a card catalog is not a TL, a RDBMS is candidate technology for use in DLs • DL is the union of the content and services defined on the content

  43. Digital Libraries? • Ultimately, the product of a research institution is information • information objects (generally publications) are frequently the only tangible measure of research output (compressing an entire body of literature): • Traditional libraries (TLs) are expensive, and less and less information is being archived by fewer and fewer TLs

  44. TLs vs. DLs • DLs clearly better than TLs at: • Dissemination, storing information variety • However, TL objects are more survivable • Who will archive the research information? • the publishers? • the institutions? • the authors? • Will the average DL object still be accessible in 10 years?

  45. Cosine Correlation With Frequency Term Weighting n  (tdij X tdik) i=1 similarity (dj,dk) = n n  tdij2 X  tdik2 i=1 i=1 where tdij = the ith term in the vector for document j tdik = the ith term in the vector for document k n = the number of unique terms in the data set Adapted from Harman (1992), originally from Salton & Lesk (1968)

  46. Similarity Matrix

More Related