1 / 37

NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion. JATS-CON, April 2, 2014 OSA – The Optical Society & DCL – Data Conversion Laboratory, Inc. scholarly publisher with 19 current and legacy journals, 300+ conference proceedings.

haracha
Download Presentation

NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion JATS-CON, April 2, 2014 OSA – The Optical Society & DCL – Data Conversion Laboratory, Inc.

  2. scholarly publisher with 19 current and legacy journals, 300+ conference proceedings

  3. OSA Governance: Build more-flexible products and services! How? • Break 1917-2012 content into “well-polished” atomic pieces following an industry standard • Develop infrastructure to manage and enrich content, to build new products and services in an agile fashion • Budget allocated for five-year strategic plan

  4. Some evidence of success With content converted to NLM XML, have developed • Enhanced article: Interactive HTML • Derivative products: ImageBank • Business Intelligence: New insights into author, topic, funding, and other trends

  5. Citation data

  6. Equation data

  7. Legacy content (750,000 journal pages)We expected this . . .

  8. This . . . not so much Journal as Comic book School yearbook

  9. 1. Most confusing: Articles skipping pages, sometimes in two directions

  10. 2. Most shocking: legacy PDF not matching Legacy print Print Legacy PDF for same article

  11. 3. Most pervasive: nonscientific content tacked onto research articles These are not the authors

  12. Project specifications: two extremes 1. Hand the project over to the trusted vendor and be done with it 2. Spend up to a year doing heavy content analysis and spec creation

  13. Data Conversion Laboratory • We convert content from any format to any format. • Expertise with JATS, and most industry standard DTD’s and Schemas • Established in 1981; a pioneer in the data conversion industry • Over a billion pages converted • Expertise in complex conversion projects; STM Publishing, eBooks, Technical documents, Educational Publishing, and Library Digitization. • Projects range from one book to entire libraries and legacy collections • Infrastructure for large-scale projects, with automated tracking, quality assurance, and customer reporting for every item • Industries include Publishing, Technical Societies, Aerospace, Government, Defense, Health Sciences, Libraries & Universities • Publish DCLNews, a monthly newsletter devoted to XML and Electronic Publishing topics going to 7,000 subscribers

  14. Thoughts on Managing a Large Legacy Conversion Effort Phased Approach Flexibility and Collaboration Keep it Simple Keep Monitoring Quality

  15. 1) Phased Approach Why? Varied sources (PDF, XML, SGML) Content that changed over time Very large input corpus going back to 1917 Allow for the quick, phased release of new OSA products Strategy for OSA materials Focus on one source type at a time but keep the big picture in mind Convert newest material first Review and decide on conversion nuances as they came up

  16. Source Material Challenges XML • OSA Proprietary DTD • NLM v2.3 DTD PDF • PDF Normal • PDF Image SGML • Multiple DTDs

  17. 2) Build Flexibility and Collaboration into the Conversion Process • Develop an overall specification, with allowance for change as new scenarios are uncovered • Software development sprints to incorporate changes • Close collaboration with OSA to manage new situations affecting completed work and work in process

  18. Tools Used to Retain Flexibility • Client-Vendor collaboration for decision making • Hub and Spoke processing • Handling of conversion anomalies • Quality assurance reviews • Learning databanks

  19. 3)There’s a Lot of Detail – Keep It Simple • Fitting structures into the existing JATS tagging structure • CALS to HTML table conversion • MathMLline break retention • Cross-reference ranges • Rendering limitations • Unexpected content scenarios

  20. Cross-Reference Ranges • Bibliographic • Figure

  21. Rendering Limitations • No CSS support for table character alignment PDF: HTML:

  22. Unexpected Content Scenarios • Missing text - Printed page problems

  23. Unexpected Content Scenarios (cont.) • Jumping pages

  24. Unexpected Content Scenarios (cont.) • Special characters with no corresponding Unicode

  25. Unexpected Content Scenarios (cont.) • Non-standard Structure ____________________________________ <body><boxed-text><sec><title>Optical Activities in Industry</title><p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments                      for this column which should be sent to him at the above address</p><p><inline-graphic xlink:href="ao-8-4-792-i001"/></p></sec></boxed-text>

  26. Unexpected Content Scenarios (cont.) • White space filler

  27. 4) Keep Checking Quality – Don’t Get Too Far Ahead • Visual review • OSA Schematron • Reporting stylesheets • OCR and hyphenation spellchecker software • QA software • Learning databanks

  28. Visual Review • Correct entities are used • Math displays correctly • Table alignment is accurate • Images correspond to the source

  29. OSA Schematron • The Schematron includes over 300 checks • Warning:ALERT[LJF:RGCO250]: ref 'b10': unpublished materials must have @publication-type='other' ($unpublished and @publication-type != 'communication' and @publication-type != 'other' / warning) [report] • Warning:ALERT [LJF:JBCO140]: no tables found but title reads 'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap)) / warning) [report] • ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than one article-title (count(article-title) &gt; 1) [report]

  30. DCL QA Software • Highlight any discrepancies between the specifications and the tagging • Identify suspicious start of a paragraph • Flag missing external files associated with the XML • Find missing cross references to specified structures such as Tables and Figures

  31. Hyphenation Spellchecker

  32. Reporting Stylesheets • Provides easier review of metadata components for a set of articles

  33. OCR Tools • Modified versions of the fonts designed to help distinguish between similar looking characters – “O” vs “0”, “Z” vs “2”, “1” vs “l” used within the proofreading phase

  34. Learning Databanks • Ongoing updates made based on feedback and newly determined rules and structures • Conversion software • QA software • Schematron • Spellchecker and hyphenation software • Editorial guidelines • Image creation

  35. Conclusions OSA has nearly completed a large backfile conversion project in close coordination with DCL. The project, which is based around NLM markup, has allowed OSA to enhance its publishing platform, build derivative products, and significantly improve its ability to gather business intelligence from a deep journal backfile. We offer the following lessons learned: • With large content projects, plan ahead but prepare to work in an agile fashion • The content owner should stay engaged throughout the project to align real-time decisions with business aims • Owner–vendor collaboration—when the right partners are involved—improves morale, attention to detail, and decision-making

  36. Scott Dineen Sr. Director Publishing Production & Technol. The Optical Society sdinee@osa.org Devorah Ashlem Senior Project Manager Data Conversion Laboratory dashlem@dclab.com

More Related