1 / 23

Extracting Math from PostScript Documents

Extracting Math from PostScript Documents. Michael Yang Univ. Calif., Irvine Richard Fateman Univ. Calif, Berkeley. Why Extract Math from Documents?. The current and recent past publications of scholarly journals in mathematics are not adequately indexed.

Download Presentation

Extracting Math from PostScript Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Math from PostScript Documents Michael Yang Univ. Calif., Irvine Richard Fateman Univ. Calif, Berkeley ISSAC-2004

  2. Why Extract Math from Documents? • The current and recent past publications of scholarly journals in mathematics are not adequately indexed. • Imagine a query: “Find papers that involve this differential equation:” x2 y’’+xy’+(x2-m2)y=0 • Or “Is there a common name for this equation? [Ans: yes, Bessel’s] ISSAC-2004

  3. Why Extract Math from Documents? • Find papers that may be relevant to a formula or a proof of a related theorem. • Find out if a discovery is actually novel or a rediscovery of a previous result. • Even: Is this formula true? ISSAC-2004

  4. How can we search, anyway? • Search in integral tables using hashing, flexible pattern matching. • Example: TILU (Fateman, Einwohner) • The general problem looks like a huge challenge of unification with simplifications of analytic functions. Is a=f(b) the same as f-1(a)=b ? ISSAC-2004

  5. These are obviously hard questions • But we are much better off if we can start with a few decades of the most recent math papers and their formulas to search. • Prerequisite: encoding of formulas with semantic markup, the point of this paper. ISSAC-2004

  6. Why start with PostScript or PDF? • We have many papers, including math journals, online, some of them free, with essentially all markup removed, stored for printing as PS or PDF. • Automation of inserting the markup, even if only partly successful, can help enable further work to make it possible to index and search for math. ISSAC-2004

  7. Is this easier or harder than OCR? • It should be easier, because all the characters are known as error-free glyphs. • OCR tends to make erroneous symbol identifications if there is inadequate word-based context. • For example o0O°º, 1lI|!i , Illinois (!), -_= • Well-known sources of PS provide stereotypes for the font/glyph/location mapping. • But it could be harder if the PostScript is truly obscure (PS is Turing equivalent, after all) ISSAC-2004

  8. An Example From a paper by Cyril Banderier et al, ``Random Maps, Coalescing Saddles, Singularity Analysis, and Airy Phenomena,'' Random Structures and Algorithms, 19 3-4, 194--246 (2001)} only slightly edited by inserting newlines. [explain origin] ....0.002 0.0025 200 400 600 800 1000 k Figure 3. Left: The standard Airy distribution. Right: Observed frequencies of core sizes k 2 [20; 1000] in 50,000 random maps of size 2,000, showing the bimodal character of the distribution. variety of integral or power series representations including (see [1, 45]) 1) Ai(z) 1 2 Z 1 1 e i(zt t 3 =3) dt = 1 3 2=3 1 X n=0 3 1=3 z n ( n 1) 3) n sin 2(n 1) 3 : Equipped with this de nition, we present the main character of the paper, a probability distribution closely related to the Airy function. De nition 1. The standard .... ISSAC-2004

  9. What is this really? In this particular case, extraction of the document image shows two formulas in the middle of the citation: ISSAC-2004

  10. How could we encode this image? Recognize the characters on the page as equivalent to a expression, for example: $${\mbox Ai}(z) = {1\over{2 \pi}}\int _{-\infty}^{+\infty} e^{i(zt+t^3/3)}dt$$ $$~~= {1 \over {\pi 3^{2/3}}}\sum_{n=0}^\infty (3^{1/3}z)^n {{\Gamma((n+1)/3)} \over {n!}} \sin {{2(n+1)\pi}\over 3}.$$ or some alternative in MathML or OpenMath. What are the barriers to getting to this point? ISSAC-2004

  11. Detecting Math in the first place • Look for changes in font, italics, font size changes, altered baselines. • Consider the density of text (formulas are low density). • Notice the presence of special characters unusual in text: = is common in math, but not in text (Also +, -, parens). ISSAC-2004

  12. Implementation • Run PostScript through a modified Ghostscript (PS interpreter) to output text file information suitable for geometric/math processing. • Run this file through previously developed OCR-based technology (in Lisp) for using bounding-boxes, contents, positions,… to create a geometric 2-D “relative position” tree. Process further to identify semantic relationships if possible and output a hierarchical tree-representation of math formulas. • Convert this to TeX (could be MathML equally well). ISSAC-2004

  13. Possible Future Work • Better font tools • Look at more producers of PS (not just TeX and dvips), e.g. Acrobat Distiller. • Run some tests (NEC) to see if we can extract sufficient formulas to add to the indexing information. • Examine the issue of “formula similarity” e.g. parameter substitution, simplification, rearrangement. (relatively easy in the context of integration because there is a designated variable of integration.) ISSAC-2004

  14. Conclusions • It’s possible to automatically revisit previously typeset documents and invent plausible versions of TeX source-code for some, perhaps much, of published TeX. • This provides an additional link to a chain which may eventually lead to more widespread semantic encoding of math for index and retrieval. • Given the difficulties, a better route for the future is to have authors or editors use semantic mark-up for digital mathematical documents for “born digital documents.” Publishers should encourage this kind of work, although standards are currently disappointing. ISSAC-2004

  15. Another paper, not included • Submitted to ISSAC-2004 • Author: R. Fateman ISSAC-2004

  16. Rational Function Computing with Poles and Residues • Here’s the idea: consider 2 forms for the same rational expression. ISSAC-2004

  17. Which form is better? • Generality of representation • Complexity (Cost) of operations • Arithmetic (+, *, /) • Integration, derivatives, limits, series, … • Numerical evaluation • Display for human viewing ISSAC-2004

  18. Keep constant numerators over (powers of) linear denominators ( + polynomial) • Works for encoding arbitrary rational functions (over complex numbers) in one variable. • Plausibly requires high-precision floats if you start with ratio of polynomials where the roots of the denominator cannot be expressed as exact rational numbers. ISSAC-2004

  19. PRO: Once you have this representation • Addition of rational functions is essentially free, compared to standard representation since no polynomial GCD is required. • a/b + c/d is already simplified except for sorting and the possibility that b=d • Multiplication of rational functions is inexpensive also, again no GCD needed. ISSAC-2004

  20. CON: Do you want to use this representation? • Division is not fast, so it is more appropriate if division is infrequent. • If the input is not already in residue/pole form, or if you have to do division, finding zeros introduces approximations [maybe for the first time in a problem]. • Output forms may look longer. ISSAC-2004

  21. Examples • Ordinary addition: orders of magnitude faster. E.g 45,000 times faster. • Ordinary multiplication: maybe 2X faster • What about mixtures of + and * together? What important algorithms are there? • Sparse determinant calculation. ISSAC-2004

  22. A determinant benchmark • Consider matrices with entries of this form: • Determinant of 8X8 matrix in Macsyma 2.4, on a 2.6GHz Pentium 4 computer. • Using Gaussian Elimination 112 sec • Using Minor Expansion 109 sec • Using Residues/Poles (75% in bignum arithmetic) 41 sec • Using Residues/Poles and double-floats 1.6sec ISSAC-2004

  23. Conclusions • No surprise that avoiding GCDs is a winner. • Using approximate calculations can provide huge speedups. Do we really need exact computation everywhere we provide it? • We have a potential application for high-precision zero-finding, as well as non-overflowing software floats (GMP, ARPREC) ISSAC-2004

More Related