Extracting Math from PostScript Documents

Extracting Math from PostScript Documents Michael Yang Univ. Calif., Irvine Richard Fateman Univ. Calif, Berkeley ISSAC-2004

Why Extract Math from Documents? • The current and recent past publications of scholarly journals in mathematics are not adequately indexed. • Imagine a query: “Find papers that involve this differential equation:” x2 y’’+xy’+(x2-m2)y=0 • Or “Is there a common name for this equation? [Ans: yes, Bessel’s] ISSAC-2004

Why Extract Math from Documents? • Find papers that may be relevant to a formula or a proof of a related theorem. • Find out if a discovery is actually novel or a rediscovery of a previous result. • Even: Is this formula true? ISSAC-2004

How can we search, anyway? • Search in integral tables using hashing, flexible pattern matching. • Example: TILU (Fateman, Einwohner) • The general problem looks like a huge challenge of unification with simplifications of analytic functions. Is a=f(b) the same as f-1(a)=b ? ISSAC-2004

These are obviously hard questions • But we are much better off if we can start with a few decades of the most recent math papers and their formulas to search. • Prerequisite: encoding of formulas with semantic markup, the point of this paper. ISSAC-2004

Why start with PostScript or PDF? • We have many papers, including math journals, online, some of them free, with essentially all markup removed, stored for printing as PS or PDF. • Automation of inserting the markup, even if only partly successful, can help enable further work to make it possible to index and search for math. ISSAC-2004

Is this easier or harder than OCR? • It should be easier, because all the characters are known as error-free glyphs. • OCR tends to make erroneous symbol identifications if there is inadequate word-based context. • For example o0O°º, 1lI|!i , Illinois (!), -_= • Well-known sources of PS provide stereotypes for the font/glyph/location mapping. • But it could be harder if the PostScript is truly obscure (PS is Turing equivalent, after all) ISSAC-2004

An Example From a paper by Cyril Banderier et al, ``Random Maps, Coalescing Saddles, Singularity Analysis, and Airy Phenomena,'' Random Structures and Algorithms, 19 3-4, 194--246 (2001)} only slightly edited by inserting newlines. [explain origin] ....0.002 0.0025 200 400 600 800 1000 k Figure 3. Left: The standard Airy distribution. Right: Observed frequencies of core sizes k 2 [20; 1000] in 50,000 random maps of size 2,000, showing the bimodal character of the distribution. variety of integral or power series representations including (see [1, 45]) 1) Ai(z) 1 2 Z 1 1 e i(zt t 3 =3) dt = 1 3 2=3 1 X n=0 3 1=3 z n ( n 1) 3) n sin 2(n 1) 3 : Equipped with this de nition, we present the main character of the paper, a probability distribution closely related to the Airy function. De nition 1. The standard .... ISSAC-2004

What is this really? In this particular case, extraction of the document image shows two formulas in the middle of the citation: ISSAC-2004

How could we encode this image? Recognize the characters on the page as equivalent to a expression, for example: $${\mbox Ai}(z) = {1\over{2 \pi}}\int _{-\infty}^{+\infty} e^{i(zt+t^3/3)}dt$$ $$~~= {1 \over {\pi 3^{2/3}}}\sum_{n=0}^\infty (3^{1/3}z)^n {{\Gamma((n+1)/3)} \over {n!}} \sin {{2(n+1)\pi}\over 3}.$$ or some alternative in MathML or OpenMath. What are the barriers to getting to this point? ISSAC-2004

Detecting Math in the first place • Look for changes in font, italics, font size changes, altered baselines. • Consider the density of text (formulas are low density). • Notice the presence of special characters unusual in text: = is common in math, but not in text (Also +, -, parens). ISSAC-2004

Implementation • Run PostScript through a modified Ghostscript (PS interpreter) to output text file information suitable for geometric/math processing. • Run this file through previously developed OCR-based technology (in Lisp) for using bounding-boxes, contents, positions,… to create a geometric 2-D “relative position” tree. Process further to identify semantic relationships if possible and output a hierarchical tree-representation of math formulas. • Convert this to TeX (could be MathML equally well). ISSAC-2004

Possible Future Work • Better font tools • Look at more producers of PS (not just TeX and dvips), e.g. Acrobat Distiller. • Run some tests (NEC) to see if we can extract sufficient formulas to add to the indexing information. • Examine the issue of “formula similarity” e.g. parameter substitution, simplification, rearrangement. (relatively easy in the context of integration because there is a designated variable of integration.) ISSAC-2004

Conclusions • It’s possible to automatically revisit previously typeset documents and invent plausible versions of TeX source-code for some, perhaps much, of published TeX. • This provides an additional link to a chain which may eventually lead to more widespread semantic encoding of math for index and retrieval. • Given the difficulties, a better route for the future is to have authors or editors use semantic mark-up for digital mathematical documents for “born digital documents.” Publishers should encourage this kind of work, although standards are currently disappointing. ISSAC-2004

Another paper, not included • Submitted to ISSAC-2004 • Author: R. Fateman ISSAC-2004

Rational Function Computing with Poles and Residues • Here’s the idea: consider 2 forms for the same rational expression. ISSAC-2004

Which form is better? • Generality of representation • Complexity (Cost) of operations • Arithmetic (+, *, /) • Integration, derivatives, limits, series, … • Numerical evaluation • Display for human viewing ISSAC-2004

Keep constant numerators over (powers of) linear denominators ( + polynomial) • Works for encoding arbitrary rational functions (over complex numbers) in one variable. • Plausibly requires high-precision floats if you start with ratio of polynomials where the roots of the denominator cannot be expressed as exact rational numbers. ISSAC-2004

PRO: Once you have this representation • Addition of rational functions is essentially free, compared to standard representation since no polynomial GCD is required. • a/b + c/d is already simplified except for sorting and the possibility that b=d • Multiplication of rational functions is inexpensive also, again no GCD needed. ISSAC-2004

CON: Do you want to use this representation? • Division is not fast, so it is more appropriate if division is infrequent. • If the input is not already in residue/pole form, or if you have to do division, finding zeros introduces approximations [maybe for the first time in a problem]. • Output forms may look longer. ISSAC-2004

Examples • Ordinary addition: orders of magnitude faster. E.g 45,000 times faster. • Ordinary multiplication: maybe 2X faster • What about mixtures of + and * together? What important algorithms are there? • Sparse determinant calculation. ISSAC-2004

A determinant benchmark • Consider matrices with entries of this form: • Determinant of 8X8 matrix in Macsyma 2.4, on a 2.6GHz Pentium 4 computer. • Using Gaussian Elimination 112 sec • Using Minor Expansion 109 sec • Using Residues/Poles (75% in bignum arithmetic) 41 sec • Using Residues/Poles and double-floats 1.6sec ISSAC-2004

Conclusions • No surprise that avoiding GCDs is a winner. • Using approximate calculations can provide huge speedups. Do we really need exact computation everywhere we provide it? • We have a potential application for high-precision zero-finding, as well as non-overflowing software floats (GMP, ARPREC) ISSAC-2004

Extracting Math from PostScript Documents

Extracting Math from PostScript Documents

Presentation Transcript

Adobe PostScript

Extracting Videos from YouTube

Extracting structure from reactions

Extracting fact from fiction

UST: Postscript

Extracting Opinions from Reviews

Extracting Energy from Wind

Extracting Tables from ERD

Extracting and Organizing Facts of Interest from OCRed Historical Documents

Extracting Relations from XML Documents

Extracting Instances of Relations from Web Documents using Redundancy

Extracting and Organizing Facts of Interest from OCRed Historical Documents

Extracting Value from SOA

PostScript

Tools for Extracting Metadata and Structure from DTIC Documents

Sinosecurities Postscript

Extracting models from design documents with Mapster

Automatic Concept Identification: Extracting Problem Solved Concepts From Patent Documents

Extracting Parallel Texts from Massive Web Documents

PostScript Training

PostScript