1 / 15

A Mini Experiment

A Mini Experiment. Win Shih, John Pardavila, Krishna Rayavaram University at Albany, SUNY LiSUG Conference, October 12, 2009. Project Overview. Scope Resources Time. Content Acquisition. Crawls 220 file types File system crawling Direct connection to databases Content feed API.

varen
Download Presentation

A Mini Experiment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Mini Experiment Win Shih, John Pardavila, Krishna Rayavaram University at Albany, SUNY LiSUG Conference, October 12, 2009 2009 LiSUG Conference

  2. Project Overview • Scope • Resources • Time 2009 LiSUG Conference

  3. Content Acquisition • Crawls 220 file types • File system crawling • Direct connection to databases • Content feed API 2009 LiSUG Conference

  4. Query Processing • Google Algorithm • Keymatch • Self-learning spell checker • Suggested Queries • Language support • Google Stemming 2009 LiSUG Conference

  5. Results Display • Google Standard • Templates/Wizard • Output in XML 2009 LiSUG Conference

  6. Sample Mini Libraries • Denver Public Library (http://denverlibrary.org/) • University of Colorado Health Sciences Library (http://hsclibrary.uchsc.edu/) • New York State Archives (http://www.archives.nysed.gov/aindex.shtml) • Combined Arms Research Library (http://cgsc.leavenworth.army.mil/carl/) 2009 LiSUG Conference

  7. Albany Student Press • 1919-2009: Celebrating 90 years of service • Size: 2,288 PDF files • Coverage: 1916 - 1985 2009 LiSUG Conference

  8. Google Mini Features • Crawl URLs • Collections • Front Ends • Key Matching, Related Queries, Result Biasing • Status and Reports • Search Reports, Logs, and Events • Server Administration • Networks, Accounts, Notifications, SHH, more 2009 LiSUG Conference

  9. Technologies Used • XML, XSLT, XSL for the interface. • You do not need a coder to generate results! -Wizard vs. Coding. • The libraries front ends pointing to the Google Mini are developed using combinations of PHP, JavaScript, XHTML, and CSS in the Drupal Content Management System 2009 LiSUG Conference

  10. Mini ASP Search • Demonstrating the Search • Strings (Harvey Milk) • By Date (Specific Date, Month & Year ) 2009 LiSUG Conference

  11. Lessons Learned • Quality of OCR Scan • Incorrect character recognition affects accuracy on search results. • Non-OCRed documents (Google Mini will not be able to index PDF image) • Metadata – Search Engine Optimization • Metadata is a good mechanism to improve the visibility of a posted web page in search engine results. • Can enhance the search ranking and results of PDF files. • None of the PDFs contain metadata. Added metadata to Title, Subject and Keywords attributes 2009 LiSUG Conference

  12. Lessons Learned • File naming convention • Google Mini does index file names. • ASP Files named in this format: yyyy_mm_dd • Granularity • PDF files at issue level, instead of article level, is not granular enough and will affect the search experience. • In a keyword search, search terms can appear in several articles within the same issue. However, there will be only one result entry in Google search result listing. Patrons have to use Adobe Reader search function to locate the appearance of the search term. 2009 LiSUG Conference

  13. Lessons Learned • Clustering • Automatic Filtering • Add parameter “filter=0” • Proxyreload • Shows updated XSL stylesheet preview rather than wait for 15 minutes set by XSLT server. • Add paramter “proxyreload=1” • Image Quality • Some of the scanned images are quite light and it might affect the quality of OCR. It can also be difficult to read. 2009 LiSUG Conference

  14. Lessons Learned • Ranking of Results Most of the time the ranking of the results is the 1st link in the result page if the string is indexed properly. • Word Spacing If the text or information in the issues have more than one space between words, Mini doesn’t seem to index or show accurate result. 2009 LiSUG Conference

  15. Future Plans • Rescan the whole collection • Explore other products, including open source discovery tools • Raise funds for expansion and sustainability of the project • Continuing the collaboration • ‘Beta’ testing other digitized collections • Incorporate user feedback 2009 LiSUG Conference

More Related