1 / 14

Heuristic Approach for Automatic Metadata Capture of E-books/Journals

ARD Prasad DRTC Indian Statistical Institute Bangalore. Heuristic Approach for Automatic Metadata Capture of E-books/Journals. Agenda. Earlier Experiment with printed books Present Experiment with E-Books & E-Journals. Heuristics for Printed Books. Heuristics for the ... Title page

donkor
Download Presentation

Heuristic Approach for Automatic Metadata Capture of E-books/Journals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARD Prasad DRTC Indian Statistical Institute Bangalore Heuristic Approach for Automatic Metadata Capture of E-books/Journals

  2. Agenda • Earlier Experiment with printed books • Present Experiment with E-Books & E-Journals

  3. Heuristics for Printed Books • Heuristics for the ... • Title page • Verso of the title page

  4. Methodology for Printed Books • Scan the title page • OCR the image • Generate the output in HTML • Apply Heuristics to HTML pages • Identify the bibliographic elements

  5. Heuristics for Verso of the Title Page • Identify date & edition etc. • See whether prenatal cataloging is available • Identify the bibliographic elements in prenatal catalog • Counter check the identifications from the title page • Resolution in case of conflicts

  6. Generating Bibliographic Records • Once the bibliographic elements are identified • Generate bibliographic records in • ISO-2707 • Dublin Core

  7. Sample Heuristics for Identifying Title • Order of the Bibliographic elements • Titles are found in upper or upper middle portion of the title page. • The title appears first in the title page (75.15 per cent) (In few cases author or series occupies first position.) • Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.

  8. If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen). • It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others. • Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.

  9. Heuristics for other elements • Sub titles • Edition • Volume • Authors/ Contributor • Publisher • Place • Year • Series

  10. Present Experiment • E-Books (from sites like amazon.com ) • E-Journals (Non-OAI compliant)

  11. Methodology • Template based Identification • Heuristic based Identification

  12. Disadvantages of Template Based Approach • For every new site / templates are to be created • A site may change the appearance and require you to develop more than one template for each site or journal

  13. Methodology • Study few sites to develop heuristics • Web Crawler to probe the site • Identify the files having documents (filter irrelevant files) • Apply heuristics on the files having e-documents • Generating Dublin Core Records

  14. Welcome to International Conference on Semantic Web & Digital Libraries 21st – 23rd February, 2007 Indian Statistical Institute Bangalore Thank You

More Related