Mixed content, mixed metadata: Information discovery in the NSDL

Mixed content, mixed metadata:Information discovery in the NSDL

The National Science Digital Library The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education). http://nsdl.org/

Basic Assumptions Mixed content Very large digital libraries will have mixed content from many sources, with large variations in formats, structure, packaging, access permissions, etc. Mixed metadata The metadata about the items in a very large digital library will vary greatly in extent, standards, and quality.

Mixed Content Examples: NSDL-funded collections at Cornell Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl.Logic-based tools to support programming and to implement formal computational mathematics.

Mixed Metadata: the Chimera of Standardization Technical reasons Characteristics of formats and genres Differing user needs Social and cultural reasons Economic factors Installed base Conclusion: There will not be a single metadata standard for the items in the NSDL.

NSDL: The Spectrum of Interoperability Level Agreements Example Federation Strict use of standards AACR, MARC (syntax, semantic, Z 39.50 and business) Harvesting Digital libraries expose Open Archives metadata; simple metadata harvesting protocol and registry Gathering Digital libraries do not Web crawlers cooperate; services must and search engines seek out information

NSDL: The Spectrum of Interoperability Chronology The first phase of the NSDL has concentrated on gathering Dublin Core metadata using the Open Archives Initiative protocol for metadata harvesting. Current expansions include: (a) A wider range of metadata standards, e.g., LOM, Onix (b) Automatic indexing of web sites recommended by users (c) Links to the SCORM federation of structured learning objects

The NSDL Repository Services The repository is a resource for service providers. It holds information about every collection and item known to the NSDL. NSDL Repository Users Collections

NSDL Search Service: First Phase NSDL Repository harvest Portal Search andDiscoveryService Portal Portal crawl Lucene Collections

NSDL Search Service: First Phase Approach Collections map metadata to Dublin Core, make it available via the Open Archives protocol. The search service augments Dublin Core metadata with indexing of full-text where available. User interface returns snippets derived from the metadata, with links to full content and to metadata.

NSDL Search Service: First Phase The first phase search service is useful, but has weaknesses: Ranking by similarity to query not sufficient. Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata). Dublin Core records provide limited information. (d) Browsing environment limited. Most users begin their search with a Web search engine (e.g., Google) What are the methods for improving information discovery as the system grows in size and the mixture of content increases?

Effective Information Discovery: Before Digital Information Searching (a) Resources separated into categories of related materials. Each category organized, indexed and searched separately. Catalogs and indexes built on tightly controlled metadata standards, e.g., MARC, MeSH headings, etc. Search engines used Boolean operators and fielded searching. Query languages and search interfaces assumed a trained user. Resources were physical items.

Effective Information Discovery: With Homogeneous Digital Information Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).

Information Discovery in a Messy World:Cross-Domain Metadata Dublin Core "... indexes [such as Lycos] are most useful in small collections within a given domain. As the scope of their coverage expands, indexes succumb to problems of large retrieval sets and problems of cross-disciplinary semantic drift. Richer records, created by content experts, are necessary to improve search and retrieval." [Weibel 1995]

Information Discovery in a Messy World: Web Search Engines Web search engines have adapted to a very large scale. Other techniques, such as cross-domain metadata and federated searching have failed to scale up. • What new concepts and techniques have enabled this adaptation? • What can we learn that is applicable to other information discovery tasks? • How is NSDL making use of this understanding?

Information Discovery in a Messy World Building blocks Brute force computation The expertise of users -- human in the loop Methods (a) Better understanding of how and why users seek for information (b) Relationships and context information (c) Multi-modal information discovery (d) User interfaces for exploring information

Brute Force Computing Few people really understand Moore's Law • Computing power doubles every 18 months • Increases 100 times in 10 years • Increases 10,000 times in 20 years Simple algorithms plus immense computing power may outperform skilled humans

The Expertise of Users:The Human in the Loop Return objects Return hits Browse content Search index

Understanding How and Why Users Seek for Information Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query

Research Topics from the NSDL How can users indicate preferences? They do not want to see research articles. [Machine learning methods can identify collections of research articles.] Their students have a specific mathematical background. [Usage data can identify items of similar academic level.] Detailed metadata requirements will not be accepted!

Relationship and Contextual Information Methods for capturing context Analysis of citations and links (e.g., PageRank) Mining usage logs (e.g., customers who buy the same product) Reviews (e.g., reputation management) Structural relationships (e.g., domain names)

Multi-Modal Information Discovery With mixed content and mixed metadata, the amount of information about the various resources varies greatly butclues from many difference sources can be combined. "The fundamental premise of the research was that the integration of these technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task." [Wactlar, 2000]

The Expertise of Users:Examples Return objects Return hits Browse content Search index

NSDL Search Service: Second Phase Developments Metadata Accept any metadata that is available in a range of formats System for reviews and annotations, with reputation management Search system Multimodal retrieval and ranking Dynamic generation of snippets by search engine

NSDL Search Service: Second Phase Developments (cont.) Usability and human factors Wider range of browsing tools (e.g., collection visualization) Filters by education level and education quality, where known Web compatibility Expose records for Web crawlers to index Browser bookmarklet to add NSDL information to Web pages

Further Reading Caroline R. Arms and William Y. Arms Mixed content, mixed metadata: information discovery in a messy world In Metadata in Practice, Editors: Diane Hillmann and Elaine Westbrooks, ALA Editions (forthcoming 2004) Carl Lagoze, et al. Core Services in the Architecture of the National Digital Library for Science Education (NSDL) Joint Conference on Digital Libraries, July 2002. http://arxiv.org/abs/cs.DL/0201025.

Acknowledgements and Disclaimer The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research, Columbia University and Cornell University. The NSDL Search Service has been developed in partnership with a team at the University of Massachusetts, Amherst. The ideas discussed in this talk do not represent the official views of the NSF or of the Core Integration team. This work is funded in part by the NSF, grant number 0227648

Mixed content, mixed metadata:Information discovery in the NSDL

Mixed content, mixed metadata: Information discovery in the NSDL

Mixed content, mixed metadata: Information discovery in the NSDL

Presentation Transcript

Mixed Strategies

Mixed Practice

Mixed Communities

The Mixed Man

Amplifying Community Content Creation with Mixed-Initiative Information Extraction

Mixed Strategies

in mixed company

Mixed Numbers

Mixed Economies

in mixed company

in mixed company

Mixed Protozoa

Mixed Strategies

Mixed designs

Mixed Information types

Mixed Reality

MIXED

Mixed content vs. individual content study

Mixed Review

Mixed Review

MIXED

Well mixed