[poppler] How to normalize MathematicalPi text?

Jeroen Ooms jeroen at berkeley.edu
Wed Mar 13 12:54:26 UTC 2019


A researcher who is using the R bindings to analyze large numbers of
scientific papers has asked me advice on the following:

When extracting results from scientific pdf, sometimes math symbols
cannot be extracted because symbols are encoded with a custom font
called Mathematical-Pi [1]. An example of such a paper is [2]. When we
extract text via poppler::page::text() all of the = < > α β characters
are random characters from Mathematical-Pi rather than the expected
unicode symbols. Unfortunately these are critical characters to
interpret the results, so we cannot ignore this.

I was wondering if someone has experience with normalizing text with
custom fonts into proper unicode ?

I think what would be needed is to construct a table that maps the
Mathematical-Pi characters into their proper unicode values. Then we
would need some hook for poppler::page::text() to replace textboxes
that are using the Mathematical-Pi font, into the corresponding utf-8
text.


 [1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5
 [2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf


More information about the poppler mailing list