Stanford Digital Libraries Technologies Projects

Stanford Digital Libraries Technologies Projects Pratik Dave Raghu Akkapeddi For CPSC 689DL Fall ’02 Texas A&M University

interLib Joint effort of U.C. Berkeley, U.C. Santa Barbara, and Stanford University. Testbed developed by SDSC (San Diego Supercomputing Center). Demonstrated on CDL (California Digital Library). Berkeley - tools and technologies to support highly improved models of the "scholarly information life cycle." Our goal is to facilitate the move from the current centralized, discrete publishing model, to a distributed, continuous, and self-publishing model, while still preserving the best aspects of the current model such as peer review. • Santa Barbara - The Alexandria Digital Earth Prototype (ADEPT) aims to use the digital earth metaphor for organizing, using, and presenting • information at all levels of spatial and temporal resolution. • creating geospatial information and meta-information collections; • building operational services for: (1) discovering heterogeneous, distributed collections; (2) organizing these resources into • Iscapes (Information Landscapes) tailored for specific applications; and (3) collaborative use and visualization of iscapes; • applying and evaluating adept services in undergraduate learning; and developing scalable, efficient, and secure systems

Stanford Component: Goals Develop technologies to overcome barriers to effective DLs. “An important part of the project's vision is that digital libraries will not just be collections of information repositories. Rather, they will include aspects of communication among patrons and between patrons and human library staff.” “Design and implement the infrastructure and services needed for collaboratively creating, disseminating, sharing, and managing information in a DL context. “ “Main thrust of project is technology creation, evaluation, and deployment.“

People Hector Garcia-Molina - chair of CS department, distributed objects Terry Winograd - HCI and usability Dan Boneh - Security Andreas Paepcke - interoperability

Barriers to effective DLs • 1. Heterogeneity of information and services • Lack of powerful filtering mechanisms that let users find truly valuable information • Insufficient availability of interfaces and tools that effectively operate on portable devices • 4. Lack of a solid economic infrastructure that encourages providers to make information available and gives users privacy guarantees.

Retrieving Information SDLIP PowerBrowsing Query Translation Value Filtering WebBase

Simple Digital Library Interoperability Protocol (SDLIP)

SDLIP = InfoBus architecture Our basic approach is to use distributed objects to allow integrated access to heterogenous services across networks. We call this system the InfoBus. We use CORBA to provide communication between remote processes. In particular, we use Xerox PARC's ILU, a free implementation of a CORBA superset, MICO, a free CORBA implementation under the Gnu license, and Visigenic, a commercial provider. We use Java, C++, and the interpreted, object-oriented language Python for our development work. Clients use SDLIP to request searches to be performed over information sources. The result documents are returned synchronously, or they are streamed from service to client as they become available.

SDLIP Core Synchronous access Client sends request + tokens: Server Set ID, Client Request ID Parking Meter state model Delegation

SDLIP Async Delivery interface in client Result Cache locally Result Cache distributed Delegation

PowerBrowsing • Site Search/Keyword Completion • Accordion Summarization • Text Summarization • Form Entry

Site Search/Keyword Completion As a way to address bandwidth and battery life limitations, we provide local site search facilities for all sites. We incrementally index Web sites in real time as the PDA user visits them. These indexes have narrow scope at first, and improve as the user dwells on the site, or as more users visit the site over time. We address the keyword input problem by providing site specific keyword completion, and indications of keyword selectivity within sites.

Accordion Summarization We concentrate on end-game browsing, where the user is close to or on the target page. Web page is first represented as a short summary. The user can then drill down to discover relevant parts of the page. If desired, keywords can be highlighted and exposed automatically.

Text Summarization Each Web page is broken into text units that can each be hidden, partially displayed, made fully visible, or summarized. The methods accomplish summarization by different means. One method extracts significant keywords from the text units, another attempts to find each text unit's most significant sentence to act as a summary for the unit. We found that the combination of keywords and single-sentence summaries provides significant improvements in access times and number of pen actions, as compared to other schemes.

Form Entry • The form input widgets are not shown until the user is ready to fill them in. At that point, only one widget is shown at a time. The form is summarized on the screen by displaying just the text labels that prompt the user for each widget's information.

Query Translation • Deals with the problem of translating Boolean queries into different native languages supported by various search services to make distributed search possible and mask the users from the details of different query languages.

Value Filtering • The project is developing searching and filtering techniques that rely, in addition to textual similarity, on other information value metrics. These metrics may be opinion based, for example, did other colleagues we trust find a document useful, or has this document been reviewed by some editorial board? The metrics may also be access-pattern based, e.g., has this video been retrieved by many users? The metrics may be context-based. For example, is the information coming from a trustworthy source, do we know the author, or are the Web pages that point to this document related to our search? • Along similar lines, the Stanford Value Filtering project plans a service that allows users to annotate Web pages, without needing to physically modify those pages. The annotations might be reminders users leave for themselves, or they might be directed at colleagues who are known to be scanning the same information space. The annotations themselves can be useful value information, as are the collected access paths.

WebBase • The Stanford WebBase project is investigating various issues in crawling, storage, indexing, and querying of large collections of Web pages. The project builds on the previous Google activity that was part of the DLI1 initiative. The DLI2 WebBase project aims to build the necessary infrastructure to facilitate the development and testing of new algorithms for clustering, searching, mining, and classification of Web content.

Interpreting Information WebClustering

WebClustering • Clustering refers to the grouping of pages into categories, in a fashion similar to Yahoo Yahoo or the Open Directory . • We are currently investigating techniques to efficiently cluster the entire web. Traditional IR approaches are not appropriate in the context of the web, due to both the enormous size and hyperlinked nature of the web. We plan to use recently developed techniques that allow for similarity searches in high dimensional spaces (for instance) http://theory.stanford.edu/~indyk/vldb99.ps to allow for offline clustering of the web. Even with the newer techniques, the resource requirements will be large, especially as precision requirements are raised. Supercomputing resources will be a valuable asset in performing clustering and other mining operations on the contents of the web. Such resources will allow us to explore and evaluate more of the available clustering options as we develop the most effective techniques.

Managing Information • Archival Repositories • InterBib

Archival Repositories The goal of this project is to design and implement a modern, scalable digital library repository (DLR). Under our architecture, a Digital Library Repository (DLR) is formed by a collection of independent but collaborating sites.

Signatures as Object Handles Each object in a DLR has a handle used to identify and retrieve it. Handles are internal to the DLR and are not used by end users to identify documents. Given an object, we define its handle to be a (large) signature computed exclusively from its contents, using a checksum or a Cyclic Redundancy Check (CRC). If the contents are smaller than the size of the signature, the object (at creation time) is ``padded'' with a random string to make its size larger than the size of a signature. • Each site can generate objects and handles without consulting other sites. Only need to agree on signature function not on software versions, character sets, etc. • Handle can be reconstructed from object itself • Copies at different sites will have same handle 4. different objects will have different handles

No Deletions Because of our handle scheme, objects cannot be updated in place. That is, if the contents of an object are modified, it automatically becomes a new object, with a different handle. Another fundamental rule in our architecture is that objects are never (voluntarily) deleted. Allowing deletions is dangerous when sites are managed independently; in particular, it makes it hard to distinguish between a deleted object and one that was corrupted (``morphed'' into another) and needs to be restored.

Layered Architecture Since each DLR site may be implemented differently, it is important to have well defined and as simple as possible site interfaces. • Object Store Layer • Identity Layer - provides access to objects via handles - provides basic facilities for reporting changes to its objects • Complex Objects Layer - Manages collections of related objects • Reliability Layer - Coordinates replication of objects to multiple stores for long term archiving • Upper Layers - protect IP - enforce security - charging customers under various revenue models

Layered Architectures Diagram

Awareness Everywhere Awareness services (standing orders, subscriptions, alerts) are important in digital libraries. They are also important for our reliability and indexing layers: if one site is backing up another, it must be aware of new objects or corrupted objects to take appropriate action.. In our architecture, awareness services are an integral part of every layer.

Disposable Auxiliary Structures Layers typically maintain auxiliary structures for improving performance. In our architecture these structures are designed to be disposable, so they can be reconstructed from the underlying digital objects.

InterBib 3 facilities: • conversion of bibliographies among different formats • the processing of documents to include bibliographies • the collaborative accumulation of bibliographies that can be searched.

Converting Bibliographies online form, accepts BibTeX or Refer to HTML and MIF (FrameMaker) and converts back and forth. Retains good ones for InterBib Server

Generating Bibliographies RTF, HTML, or Framemaker MIF files. Generally, citations in your documents need to be of the form '[garc89, ullm93]' ut you can use any keys you want, as long as they match the ones in your BibTeX file. In Refer, you can use the field '%L' to specify a key. If no key is specified in Refer, InterBib will construct an all lower-case key from the first four letters of the first author and the publication year. Apostrophes are left out of the key. You can also change the characters you use to delimit citations in your document.

Sharing Bibliographies can search for relevant entries

Sharing Information DietORB Digital Wallets

DietORB a highly minimalized CORBA for handheld devices. We developed a CORBA ORB for the Palm Pilot PDA. The ORB currently only allows the PDA to call out to full-sized services. This project is associated with MICO, a free, GNU-licensed CORBA implementation.

Digital Wallets A digital wallet is a software component that allows a user to make an electronic payment with a financial instrument (such as a credit card or a digital coin), and hides the low-level details of executing the payment protocol that is used to make the payment.

Extensible accommodate all of the user's different payment instruments, and inter-operate with multiple payment protocols. vendors should be able to develop electronic coupons that offer discounts on products without requiring that users install a new wallet to hold these coupons and make payments with them.

Client-Driven Vendors should not be capable of invoking the client's digital wallet to do anything that the end-user may resent or consider an annoyance.

Symmetric Vendors and banks run software analogous to wallets, which manages their end of the financial operations. Since the functionality is so similar, it makes sense to re-use, whenever possible, the same infrastructure and interfaces within wallets, vendors, and banks.

Generalized Interfaces should be similar regardless of what type of device or computer that the wallet, bank, or vendor application is running on. A digital wallet running on an "alternative" device, such as a personal digital assistant (PDA) or a smart card, for example, has substantial functionality in common with a digital wallet built as an extension to a web browser. Thus, a digital wallet in these two environments should re-use the same instrument and protocol management interfaces.

Stanford Digital Libraries Technologies Projects