1 / 54

The INFOMINE project

Gordon Paynter Infomine Lead Programmer and the Infomine team: Steve Mitchell, Margaret Mooney, Julie Mason et al. at the University of California, Riverside. The INFOMINE project. The Infomine Project. Introduction to Infomine The core Infomine system

isabel
Download Presentation

The INFOMINE project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gordon Paynter Infomine Lead Programmer and the Infomine team: Steve Mitchell, Margaret Mooney, Julie Mason et al. at the University of California, Riverside The INFOMINE project

  2. The Infomine Project • Introduction to Infomine • The core Infomine system • Automation: finding and describing resources • Collaboration: the Fiat Lux portals • Conclusions

  3. Introduction to Infomine • Infomine is a virtual library • Infomine's goal is to provide organised access to the Internet in the same way that we do for printed works • Library catalogs focus on books and periodicals • Infomine focuses on web sites (mostly, now) • There are many differences between books and web sites

  4. Web sites: What is a “web site” anyway? Continually changing Frequently disappear Google: 2 billion pages Books Vs. Web sites • Books: • Easily-defined, physical objects • Static • Permanent • LC: 119 million items

  5. Web sites: Anyone can publish Few indexers: Infomine, LII, IPL, BUBL, MEL, Scout; all are post-hoc Can be downloaded and processed Books Vs. Web sites • Books: • Limited number of publishers • Existing, coordinated cataloging effort • Text not usually electronically available

  6. Simplifying the problem • Editorial standards: • Only select the best Web sites • Automated assistance: • Collection building • Automated and semi-automated resource description • Catalog maintenance • Wide collaboration • More contributors • Less redundant effort

  7. The core Infomine system • Infomine for patrons • Behind the scenes: Infomine for content builders • Open source inputs: what the community gives us • Open source outputs: what we're distributing

  8. Infomine core: behind the scenes

  9. Infomine core: open source inputs • The Linux operating system • Debian GNU/Linux • Infrastructure: • The Apache webserver • MySQL and Berkeley DB databases • Programming tools: • The GNU Compiler (gcc) and libraries, emacs • Common libraries

  10. Infomine core: open source outputs • The Infomine general-purpose library • http://infomine.ucr.edu/iVia/ • The full libInfomine library • Available in August (as documentation completed) • The full Infomine source • Available Fall 2002

  11. Automation: finding and describing resources • Discovering new resources • The Infomine record builder • Extracting useful metadata • Automatically classifying records • Open source inputs • Open source outputs

  12. Discovering new resources • The semi-automatic focused web crawler • You suggest a topic or search term • The crawler searches for web pages and clusters them • You identify useful clusters of documents (optional) • The crawler reports the top 20 hubs and authorities • You choose from the list of URLs • The automatic record builder helps generate metadata • The fully-automatic focused web crawler • Coming soon!

  13. The Infomine record builder • Input: a URL or list of URLs • From the focused crawler • From the record builder interface • The record builder creates a new record • Fully-automatic operation • The builder creates new records on its own • Semi-automatic operation • The builder interacts with you at each stage • Output: new records in the pending database

  14. New research: LCSH assignment • Dr. Steve Jones, of the University of Waikato • Aim: assign LCSH based on document content • Use training data to build a model • Training data: documents with keyphrases and LCSH • Model: based on keyphrase and LCSH co-occurrence • Use model to assign LCSH to new documents • Extract keyphrases with Kea • Similarity measures identify the best LCSH

  15. forest insects bark beetles borers (insects) tobacco hornworm scolytidae greenhouse whitefly agriculture in literature mountain pine beetle New research: LCSH assignment • forest insects

  16. cruciferae Buriats brassica phytophagous insects plants, effect of metals on blood groups in animals rapeseed hybridization, vegetable New research: LCSH assignment • BRASSICA • CROPS • PLANT BREEDING

  17. atmospheric chemistry meteorology continentality (meteorology) chemical oceanography multidimensional chromatography turbulent diffusion (meteorology) aerosols precipitation scavenging New research: LCSH assignment • CLIMATOLOGY • ENVIRONMENTAL SCIENCES • POLLUTION

  18. New research: LCC assignment • Dr. Eibe Frank, of the University of Waikato • Aim: assign LCC based on a set of LCSH • Infomine has LCSH but no LCC • Use with LCSH classifier for new documents • Use training data to build a model • Training data: documents with LCSH and LCC • Model: LCC-hierarchy of Support Vector Machines • Use model to assign LCC to new documents

  19. New research: LCC assignment • Performance (preliminary) • Absolute accuracy around 58% (pleasing) • Also: 4% are too specific, 3% too general • Top-level accuracy around 80% • What to do if we encounter completely new LCSH? • QA1-43: Science > Mathematics > General

  20. Automation: open source inputs • General and C++ tools • Linux, Apache, gcc, flex, curl, etc • Java tools • The Java MARC Events (James) toolkit • The Waikato Environment for Knowledge Analysis (WEKA) machine learning toolkit • The Kea keyphrase extraction program

  21. Automation: open source outputs • LCSHtoLCC: LCC assignment • http://infomine.ucr.edu/iVia/ • KPtoLCSH: LCSH assignment • Available August • PhraseRate: keyphrase extractor • http://infomine.ucr.edu/iVia/ • Artur's Automatic Annotator • Available in Fall 2002 (with Infomine)

  22. Collaboration: the Fiat Lux Portals • Fiat Lux • Advantages of collaboration • MyI:Research guides and pathfinders • Themes: co-branding for collaborators • Open standards, protocols and source code • Challenges of collaboration

  23. Fiat Lux • Established at ALA Midwinter 2002 • Prominent, librarian-built, public portals: • BUBL, Infomine, IPL, lii.org, MEL, VRL • Goal: resource sharing through collaboration • Fiat Lux represents: • 170 librarians • 100,000 records • 30 million searches/year

  24. Advantages of collaboration • Greater sustainability and scalability • Reduced redundant effort • Shared cataloging effort • More resources cataloged • Everyone gets a bigger (better) dataset • Shared systems development • Scalability of systems • Preserving institutional identity

  25. Themes: co-branding through iVia • Co-branding for institutional cooperators • Many data views can be “themed” • The data is the same • The appearance is altered • http://infomine.ucr.edu/cgi-bin/canned_search?query=tree&theme=wfu

  26. MyI: custom collections • Create research guides / pathfinders • Create a “MyI category” • Add records to categories in the record editor • “Batch add” to your category • Create searches for your records • Examples: • CSUF-MC, CSUF-MC-NATAM, CSUF-MC-ASIAM... • UCR-DB-MUSIC, UCR-ACCESS-CDL-PASSWORD • UDM-edu459

  27. Challenges of collaboration • Investigating lii.org integration: • Granularity of metadata • Different editorial processes • Collection focus and audience level • Scholarly Vs. K-12 Vs. public library • How do you merge duplicate records? • LCSH, keywords: easy to combine • Annotation: not sure yet • These are editorial issues rather than technical

More Related