[poppler] How to normalize MathematicalPi text?

Ross Moore ross.moore at mq.edu.au
Wed Mar 13 21:48:08 UTC 2019


Hi Jeroen,

On 13 Mar 2019, at 11:54 pm, Jeroen Ooms <jeroen at berkeley.edu<mailto:jeroen at berkeley.edu>> wrote:

A researcher who is using the R bindings to analyze large numbers of
scientific papers has asked me advice on the following:

When extracting results from scientific pdf, sometimes math symbols
cannot be extracted because symbols are encoded with a custom font
called Mathematical-Pi [1].

Those PDFs are not constructed correctly.
Although there is a /ToUnicode CMap, all the characters are mapped to <FFFD>
which is the “unknown character” glyph.   (see 2 images below)
So not useful at all, so far as Copy/Paste or Accessibility are concerned.


the [cid:C8771103-78A1-4C83-A00F-6E90EBF26ED6 at telstra.com.au] [cid:772CCEFF-A18E-4FDD-BFAD-E1DBD90218FF at telstra.com.au]

An example of such a paper is [2]. When we
extract text via poppler::page::text() all of the = < > α β characters
are random characters from Mathematical-Pi rather than the expected
unicode symbols. Unfortunately these are critical characters to
interpret the results, so we cannot ignore this.

That paper was constructed in 2004.
It has no /ToUnicode  at all for the  Universal-GreekwithMathPi font.
(see 3rd image)
So again, there is no hope of getting the correct characters by Copy/Paste.
The PDF Creator is listed as XPP.  No idea what program this is.
Maybe 15 years later it does a better job?

Back in 2004, Accessibility in scientific publications was not the kind of issue that it is becoming today.

[cid:FABDC454-1027-465E-92CC-EBEE297F4F62 at telstra.com.au]




I was wondering if someone has experience with normalizing text with
custom fonts into proper unicode ?

Yes, I do.
But only when producing PDFs with TeX-based software.
I can construct requisite CMap resources, and can include them in documents produced using LaTeX.

PDF graphics made with R have all kinds of issues, regarding fonts, font embeddings and Color spaces.
I’d appreciate you (or your colleague) sending me (off list) some example PDFs produced by R.
I’ll tinker with them, to see if I can make them more Prepress/Accessibility friendly;
e.g., suitable for documents satisfying PDF/A  and/or  PDF/UA standards.



I think what would be needed is to construct a table that maps the
Mathematical-Pi characters into their proper unicode values. Then we
would need some hook for poppler::page::text() to replace textboxes
that are using the Mathematical-Pi font, into the corresponding utf-8
text.


[1] https://files.acrobat.com/a/preview/b445ea2f-fcbb-44af-a798-fc854d8dd9b5<https://protect-au.mimecast.com/s/nrorCYW86Es65R85fVlG5_?domain=files.acrobat.com>
[2] https://github.com/ropensci/pdftools/files/2961444/Ames2004.pdf<https://protect-au.mimecast.com/s/-sCwCZY146sozZGzFx0rpe?domain=github.com>
_______________________________________________
poppler mailing list
poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/poppler<https://protect-au.mimecast.com/s/ggRfC1WLjwsE8yR8F1nXg0?domain=lists.freedesktop.org>


Hope this helps.

Ross


Dr Ross Moore
Department of Mathematics and Statistics
12 Wally’s Walk, Level 7, Room 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955  |  F: +61 2 9850 8114
M:+61 407 288 255  |  E: ross.moore at mq.edu.au<mailto:ross.moore at mq.edu.au>
http://www.maths.mq.edu.au
[cid:image001.png at 01D030BE.D37A46F0]
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University. <http://mq.edu.au/>
<http://mq.edu.au/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20190313/784834b5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2019-03-14 at 8.23.59 am.png
Type: image/png
Size: 307406 bytes
Desc: Screen Shot 2019-03-14 at 8.23.59 am.png
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20190313/784834b5/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2019-03-14 at 8.23.31 am.png
Type: image/png
Size: 331898 bytes
Desc: Screen Shot 2019-03-14 at 8.23.31 am.png
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20190313/784834b5/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2019-03-14 at 8.34.44 am.png
Type: image/png
Size: 530038 bytes
Desc: Screen Shot 2019-03-14 at 8.34.44 am.png
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20190313/784834b5/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4605 bytes
Desc: image001.png
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20190313/784834b5/attachment-0007.png>


More information about the poppler mailing list