Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt: PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt: Unrivaled Historical Information Meets Modern Technology. M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH). Historical Importance of Chemisches Zentralblatt. 1830 Chemisches Zentralblatt 1969.

Download Presentation

Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Digitalization and chemical entity recognition of chemisches zentralblatt

Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

Unrivaled Historical Information

Meets Modern Technology

M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH)


Digitalization and chemical entity recognition of chemisches zentralblatt

Historical Importance of Chemisches Zentralblatt

1830 Chemisches Zentralblatt 1969

First and oldest abstracts journal in chemistry

Covers chemical literature from 1830 to 1969

Describes the „birth“ of chemistry as science (vs. alchemy)

1840

1907 Chemical Abstracts …

Biggest and single abstracts source in chemistry

Currently >31 million papers and patents

Content 1840-1906 added retrospectively

1772

1817 Gmelin Handbook …

1771

1881 Beilstein Handbook …


Digitalization and chemical entity recognition of chemisches zentralblatt

Chemisches Zentralblatt: Content

  • Covers 140 years of chemistry

  • About 3.6 million abstracts

    • journal articles

    • patents

  • 900‘000 pages (115‘000 for time period 1830-1906)

    • 700‘000 pages with abstracts

    • 200‘000 pages of indexes („Register“)

      • Author 1830

      • Subject

        • alphabetic1830

        • systematic1863

      • Patent1897

      • Formula1925

      • General indexes1883


Digitalization and chemical entity recognition of chemisches zentralblatt

History of Chemisches Zentralblatt: Rise

„Pharmaceutisches Central-Blatt“, 403 abstracts/544 pages/10 journals, weekly after 8 months.

1830

1850

Title changes to „Chemisch-Pharmaceutisches Central-Blatt“

1856

„Chemisches Central-Blatt“

1864

Introduction of a systematic table of contents  Classification of chemistry

1879

First patent abstracts in „kleinen Mittheilungen“

1883

1st edition of General Index

1884

In-text images

1888

273 journals excerpted


Digitalization and chemical entity recognition of chemisches zentralblatt

History of Chemisches Zentralblatt: Prosperity

1897

Holding passes to Deutsche Chemische Gesellschaft for DM 15‘000.Introduction of patent index.

1901

Editorial office moves from Leipzig to Berlin.

CA

1919

Takes over abstracts from Angew. Chem.

Split into scientific (I/III) and technical part (II/IV).

1921

Begins to cover foreign patents.

1924

CZ is reunified into one journal of abstracts.

1925

Introduction of formula index.

1929

Centennial: Richard Willstätter accentuates „timeliness, exactness, completeness“ as attributes and requirements for quality of CZ.


Digitalization and chemical entity recognition of chemisches zentralblatt

History of Chemisches Zentralblatt: Decline

Pages

1940

|

1945

WW II: Difficulties in collecting information.

1944 bombing of editorial office.

Editorial Office

East Berlin

Double production of CZ in Eastand West Germany.

1947

|

1949

1950

Reunification of CZ under Eastand West German organisations.

1954

Trying to fill gap by supplement volumes.

1961

Berlin Wall does not hinder production.

1967

Introduction of SRD (Schnellreferatedienst, quick abstract service) for organic chemistry.

Editorial Office

West Berlin

GDR office declares unable to afford production of SRD and of journal.

CZ ceases publication.

1969

CA

SRD continued as „Chemischer Informationsdienst“ (ChemInform).


Digitalization and chemical entity recognition of chemisches zentralblatt

Chemisches Zentralblatt vs. CA: Quantity

Abstracts

Pages

WW II

WW II

WW I

WW I

CA format change


Digitalization and chemical entity recognition of chemisches zentralblatt

Chemisches Zentralblatt vs. CA: Quality

  • Many textbooks on chemical literature claim better quality of Chemisches Zentralblatt than CA for pre-WW II

    • H. Skolnik, The literature matrix of chemistry, 1982: „outstanding A/I service“

    • R.E. Maizell, How to find chemical information, 3rd ed. 1998, citing E.J. Crane, „[..] has value because of [..] good abstracts“

    • M. Mücke, Die chemische Literatur, 1982, „Zwar war CA zahlenmässig [..] dem Chemischen Zentralblatt überlegen, doch war dies gerade umgekehrt, was die Qualität der Referate betraf.“

    • R.T. Bottle, J.F. Rowland, Information Sources in Chemistry, 4th ed. 1993, „Before WW II, many chemists regarded CZ as superior in coverage to CA; its abstracts were longer and more informative [...]“

  • A.S.K. Atsu, Comparative coverage of chemical abstracting services in the period 1906-1940, M. Sc. Thesis, City University, London (1976)


Digitalization and chemical entity recognition of chemisches zentralblatt

Chemisches Zentralblatt vs. CA: Quality

Example: Hans Fischer, Georg Stangler, Synthese des Mesoporphyrings, Mesohämins und über die Konstitution des Hämins, Justus Liebigs Ann. Chem. 459(1927), 53-98.


Digitalization and chemical entity recognition of chemisches zentralblatt

Chemisches Zentralblatt: Digitalization

  • Relevant for documentation of prior art

  • Continuous and growing demand of the information

  • FIZ Chemie Berlin has scanned the whole work and offers a full text searchable database for the web and the dataset for integration in Intranets

  • ETH Zurich has bought the digitalized raw material (pdfs with OCRed text in the background) from FIZ and is creating a database offering full text search

    • 900‘000 pdf pages,1.3 TB

    • Raw text content incl. search index about 10 GB

  • CAS has performed automatic translation (German  English) of the 1897-1907 volumes and included in CAplus


Digitalization and chemical entity recognition of chemisches zentralblatt

Reasons for buying digitalized Chem. Zentralblatt

www.infochembio.ethz.ch/en/holdings.html


Digitalization and chemical entity recognition of chemisches zentralblatt

Reasons for buying digitalized Chem. Zentralblatt

  • Space

    • Loss of compact shelving space in basement (432 m  194 m, -55%)

    • Disposal of printed Beilstein, CA, Chem. Zentralblatt

  • Access

    • e-books, e-journals, end-user databases at workbench of chemist

    • Chemists trained to electronic sources, print and µ-film cumbersome

  • Restoration costs due to deterioration of acid-containing paper

    • 17K€/t for deacidification : Chem. Zentralblatt 1.6 t  27K€

    • Digitalization and operation costs much higher (10x), but can be divided

  • Ease of use : Search / Browse / Print


Digitalization and chemical entity recognition of chemisches zentralblatt

Quality of Obtained Raw Data

  • Errors upon conversion

  • Visual inspection of pages: Cover Flow / Quick Look technology


Digitalization and chemical entity recognition of chemisches zentralblatt

Quality of Raw Data Observed: Page Errors

  • File errors (conversion)

    • Unreadable directories (missing content)

    • Defect pdf files (missing content)

  • Errors during scanning (visual inpection)

    • Duplicate pages (shifting page index)

    • Missing pages (shifting page index, missing content)

    • Issues scanned in wrong order (minor)

    • Two pages on one (shifting page index)

    • Wrong volume (missing content)


Digitalization and chemical entity recognition of chemisches zentralblatt

Quality of Raw Data Observed: OCR

  • ETH works with OCR from FIZ Chemie

    • page  word index, 346 million „words“

    • 8.8% with only 1 character

      • slightly expanded fonts, e.g. for author names, sum formulas

      • Abbreviations (journal names, Zentralblatt = C), numbers

      • element names in structure formulas


Digitalization and chemical entity recognition of chemisches zentralblatt

Planned Tasks ETH Zürich

  • Adding navigation structure, provide DB search and browse for ETH members (Q4/09)

  • Mining and Markup (Q1/10)

    • Bibliographic references

    • Authors

    • General Subject Headings

  • Reference linking to journal articles and patents (Q1/10)


Digitalization and chemical entity recognition of chemisches zentralblatt

Chemisches Zentralblatt: Conclusion

  • Covers chemical literature from 1830 to 1969

  • Very good abstract quality

    • Better quality (length, details) than CA for pre-WW II period 1907-1940

  • Contains also important patent information

  • Invaluable information in indexes (e.g. synonyms of ancient chemical names)

  • Only comprehensive abstract journal on the market up to 1907

    • More comprehensive than CA for 19th century literature

    • Complements Beilstein and Gmelin handbooks for 19th century literature


Digitalization and chemical entity recognition of chemisches zentralblatt

Importance of Chemisches Zentralblatt: Example

Org. Lett., 2006, 8 (19), pp 4279–4281

The authors have retracted this paper on November 15, 2007 (Org. Lett. 2007, 24, 5139)

Chemisches Zentralblatt., 1904, 2, 1145


Digitalization and chemical entity recognition of chemisches zentralblatt

InfoChem Motivation

  • Text search in Chemisches Zentralblatt:

    • Abstracts in German language

    • High number of old German chemical names

  • Chemists think in structures!!!

  • Language independent structure search would help ALL scientists to access this historical source and to use the relevant information of this art

  • Required technology for structure search projects

  • Optimized German-English dictionaries

  • 30 million SPRESI names


Digitalization and chemical entity recognition of chemisches zentralblatt

Overview of Approach and Applied Technology

OCR

SPRESI Dictionaries

Comparison (quantitative)

NER N2S

Manual abstraction of sample set for evaluation

ICANNOTATOR

.tiff Documents

Database

skhflaskjlkfjlkdj

Combined search on federated search system

(ICFEDSEARCH)

Link to original literature

Pdf documents

Text under image


Digitalization and chemical entity recognition of chemisches zentralblatt

Challenges OCR (1)

1830

1969

1870

1910

1930


Digitalization and chemical entity recognition of chemisches zentralblatt

Challenges OCR (2)

  • Bad quality of original source: dirty (blotted, stained) pages

  • print from back page


Digitalization and chemical entity recognition of chemisches zentralblatt

Challenges OCR (3)

  • Tables:extremely small fonts,

  • not recognizable begin / end of columns


Digitalization and chemical entity recognition of chemisches zentralblatt

Challenges OCR (4)

  • Ambiguous old fonts (h=b; c=e; ligations)

  • Spaced text

Specific rules, large German dictionaries and extensive training are applied to correct systematic mistakes of standard OCR process


Digitalization and chemical entity recognition of chemisches zentralblatt

Challenges Annotation (1)

  • Names lack position, valence or stoichiometric information

    • Pimarsäureis it the R or L form?

    • Platinchloridin which oxidation state II, III, IV?

  • Chemical names that indicate a chemical class

    • Nitrolsäure (nitrolic acid)

    • Lactonsäure(lactonic acid)any of several acids with a lactone ring bearing the carboxylic group

  • Mixed compounds

    • EunoleNaphthole + Eucalyptusöl

    • PikrotoxinPikrotoxinin + Pikrotin

NO solution: correct structure information is not available in the original source


Digitalization and chemical entity recognition of chemisches zentralblatt

Challenges Annotation (2)

  • Obsolete German language

    • Schwefelsaures Natrium, Chlorür, Bromür

  • Historical names

    • Pelopeum  Columbium  Niobium

  • Different spelling for the same name:

    • Dibrom…  Bibrom…

    • Ätzkali  Aetzkali


Digitalization and chemical entity recognition of chemisches zentralblatt

Solutions in Annotation Process

  • Correction of German-specific grammar

  • Translation in English of not available chemical names

  • Research in old sources:

    • Beilstein

    • Brockhaus Encyclopedia

    • German-English dictionaries of chemistry

    • Meyers Encyclopedia

    • Pierer Encyclopedia

    • References to very old books, journals, articles

      • “Naturwissenschaftliche Exzerpte und Notizen Mitte 1877 bis Anfang 1883”

      • by Karl Marx


Digitalization and chemical entity recognition of chemisches zentralblatt

Results Annotation Chemisches Zentralblatt

  • 120,000 pages covering time period 1830-1907

  • 2.4 million chemical names with associated structure

    • 98,000 unique names

    • 47,000 unique structures

Quantitative comparison with manually abstracted sample set

  • Recall 51%

  • Precision87%


Digitalization and chemical entity recognition of chemisches zentralblatt

Federated Search Prototype


Digitalization and chemical entity recognition of chemisches zentralblatt

Federated Search Prototype


Digitalization and chemical entity recognition of chemisches zentralblatt

Federated Search Prototype


Digitalization and chemical entity recognition of chemisches zentralblatt

Summary

  • Described history, content and importance nowadays of Chemisches Zentralblatt

  • Illustrated how the challenges of OCR and annotation process have been solved

  • Time period 1830-1907 contains 98,000 unique names and 47,000 unique structures

    • Quantitative comparison proves over 50% recall and nearly 90% precision

    • Generated structure searchable Chemisches Zentralblatt database is integrated in ICFEDSEARCH


Digitalization and chemical entity recognition of chemisches zentralblatt

Outlook


Digitalization and chemical entity recognition of chemisches zentralblatt

Acknowledgements

  • Prof. Dr. Deplanque, Mr. Heineke and FIZ Chemie Team Berlin

  • Ms. Langanke

  • InfoChem Team

  • Chemistry Biology Pharmacy Information Center (ETH Zürich)

Thank you!

ETH Zürich: www.infochembio.ethz.ch, [email protected]

InfoChem GmbH:www.infochem.de, www.spresi.com, [email protected]


  • Login