The INFOMINE project

Gordon Paynter Infomine Lead Programmer and the Infomine team: Steve Mitchell, Margaret Mooney, Julie Mason et al. at the University of California, Riverside The INFOMINE project

The Infomine Project • Introduction to Infomine • The core Infomine system • Automation: finding and describing resources • Collaboration: the Fiat Lux portals • Conclusions

Introduction to Infomine • Infomine is a virtual library • Infomine's goal is to provide organised access to the Internet in the same way that we do for printed works • Library catalogs focus on books and periodicals • Infomine focuses on web sites (mostly, now) • There are many differences between books and web sites

Web sites: What is a “web site” anyway? Continually changing Frequently disappear Google: 2 billion pages Books Vs. Web sites • Books: • Easily-defined, physical objects • Static • Permanent • LC: 119 million items

Web sites: Anyone can publish Few indexers: Infomine, LII, IPL, BUBL, MEL, Scout; all are post-hoc Can be downloaded and processed Books Vs. Web sites • Books: • Limited number of publishers • Existing, coordinated cataloging effort • Text not usually electronically available

Simplifying the problem • Editorial standards: • Only select the best Web sites • Automated assistance: • Collection building • Automated and semi-automated resource description • Catalog maintenance • Wide collaboration • More contributors • Less redundant effort

The core Infomine system • Infomine for patrons • Behind the scenes: Infomine for content builders • Open source inputs: what the community gives us • Open source outputs: what we're distributing

Infomine core: behind the scenes

Infomine core: open source inputs • The Linux operating system • Debian GNU/Linux • Infrastructure: • The Apache webserver • MySQL and Berkeley DB databases • Programming tools: • The GNU Compiler (gcc) and libraries, emacs • Common libraries

Infomine core: open source outputs • The Infomine general-purpose library • http://infomine.ucr.edu/iVia/ • The full libInfomine library • Available in August (as documentation completed) • The full Infomine source • Available Fall 2002

Automation: finding and describing resources • Discovering new resources • The Infomine record builder • Extracting useful metadata • Automatically classifying records • Open source inputs • Open source outputs

Discovering new resources • The semi-automatic focused web crawler • You suggest a topic or search term • The crawler searches for web pages and clusters them • You identify useful clusters of documents (optional) • The crawler reports the top 20 hubs and authorities • You choose from the list of URLs • The automatic record builder helps generate metadata • The fully-automatic focused web crawler • Coming soon!

The Infomine record builder • Input: a URL or list of URLs • From the focused crawler • From the record builder interface • The record builder creates a new record • Fully-automatic operation • The builder creates new records on its own • Semi-automatic operation • The builder interacts with you at each stage • Output: new records in the pending database

New research: LCSH assignment • Dr. Steve Jones, of the University of Waikato • Aim: assign LCSH based on document content • Use training data to build a model • Training data: documents with keyphrases and LCSH • Model: based on keyphrase and LCSH co-occurrence • Use model to assign LCSH to new documents • Extract keyphrases with Kea • Similarity measures identify the best LCSH

forest insects bark beetles borers (insects) tobacco hornworm scolytidae greenhouse whitefly agriculture in literature mountain pine beetle New research: LCSH assignment • forest insects

cruciferae Buriats brassica phytophagous insects plants, effect of metals on blood groups in animals rapeseed hybridization, vegetable New research: LCSH assignment • BRASSICA • CROPS • PLANT BREEDING

atmospheric chemistry meteorology continentality (meteorology) chemical oceanography multidimensional chromatography turbulent diffusion (meteorology) aerosols precipitation scavenging New research: LCSH assignment • CLIMATOLOGY • ENVIRONMENTAL SCIENCES • POLLUTION

New research: LCC assignment • Dr. Eibe Frank, of the University of Waikato • Aim: assign LCC based on a set of LCSH • Infomine has LCSH but no LCC • Use with LCSH classifier for new documents • Use training data to build a model • Training data: documents with LCSH and LCC • Model: LCC-hierarchy of Support Vector Machines • Use model to assign LCC to new documents

New research: LCC assignment • Performance (preliminary) • Absolute accuracy around 58% (pleasing) • Also: 4% are too specific, 3% too general • Top-level accuracy around 80% • What to do if we encounter completely new LCSH? • QA1-43: Science > Mathematics > General

Automation: open source inputs • General and C++ tools • Linux, Apache, gcc, flex, curl, etc • Java tools • The Java MARC Events (James) toolkit • The Waikato Environment for Knowledge Analysis (WEKA) machine learning toolkit • The Kea keyphrase extraction program

Automation: open source outputs • LCSHtoLCC: LCC assignment • http://infomine.ucr.edu/iVia/ • KPtoLCSH: LCSH assignment • Available August • PhraseRate: keyphrase extractor • http://infomine.ucr.edu/iVia/ • Artur's Automatic Annotator • Available in Fall 2002 (with Infomine)

Collaboration: the Fiat Lux Portals • Fiat Lux • Advantages of collaboration • MyI:Research guides and pathfinders • Themes: co-branding for collaborators • Open standards, protocols and source code • Challenges of collaboration

Fiat Lux • Established at ALA Midwinter 2002 • Prominent, librarian-built, public portals: • BUBL, Infomine, IPL, lii.org, MEL, VRL • Goal: resource sharing through collaboration • Fiat Lux represents: • 170 librarians • 100,000 records • 30 million searches/year

Advantages of collaboration • Greater sustainability and scalability • Reduced redundant effort • Shared cataloging effort • More resources cataloged • Everyone gets a bigger (better) dataset • Shared systems development • Scalability of systems • Preserving institutional identity

Themes: co-branding through iVia • Co-branding for institutional cooperators • Many data views can be “themed” • The data is the same • The appearance is altered • http://infomine.ucr.edu/cgi-bin/canned_search?query=tree&theme=wfu

MyI: custom collections • Create research guides / pathfinders • Create a “MyI category” • Add records to categories in the record editor • “Batch add” to your category • Create searches for your records • Examples: • CSUF-MC, CSUF-MC-NATAM, CSUF-MC-ASIAM... • UCR-DB-MUSIC, UCR-ACCESS-CDL-PASSWORD • UDM-edu459

Challenges of collaboration • Investigating lii.org integration: • Granularity of metadata • Different editorial processes • Collection focus and audience level • Scholarly Vs. K-12 Vs. public library • How do you merge duplicate records? • LCSH, keywords: easy to combine • Annotation: not sure yet • These are editorial issues rather than technical

The INFOMINE project

The INFOMINE project

Presentation Transcript

THE PROJECT

The Project

The Project

The Project:

The project

INFOMINE WEBSITE

The Project

The Project

The project

The Project

The project

The project

The Project

The Project

THE PROJECT

THE PROJECT

The Project

The project

The project

The Project

The project