[poppler] pdftops font subset question

Leonard Rosenthol lrosenth at adobe.com
Wed Jul 1 14:43:12 UTC 2020


Subsetting of a font has *ZERO* impact on the ability to extract text from a PDF…provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.

Leonard

From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of William Bader <williambader at hotmail.com>
Date: Wednesday, July 1, 2020 at 2:18 AM
To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
Subject: [poppler] pdftops font subset question

Is there any way to prevent pdftops from subsetting fonts? I want to be able to convert the ps back to a PDF and still be able to extract text with pdftotext.

I have a large single page PDF. When I drag to copy text in atril or okular or run pdftotext, it finds the text.
pdffonts shows about 40 fonts. They are all similar:
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
HelveticaNeueLTStd-Roman--Identity-H CID Type 0C       Identity-H       yes no  yes    214  0
HelveticaNeueLTStd-BdIt--Identity-H  CID Type 0C       Identity-H       yes no  yes    236  0
...
HelveticaLTStd-Bold--Identity-H      CID Type 0C       Identity-H       yes no  yes     70  0
Berkeley-Bold--Identity-H            CID Type 0C       Identity-H       yes no  yes     60  0

pdfinfo shows
ModDate:        Fri Jun 26 21:27:37 2020 WEST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      702 x 1296 pts
Page rot:       0
File size:      13501736 bytes
Optimized:      no
PDF version:    1.6

When I run the PDF through pdftops, it subsets the fonts, and then when I convert it back into a PDF with ghostscript ps2pdf, the text shows, but copying it or running pdftotext does not work.

The end of the generated ps is
%%+ font BHQHNF+MinionPro-Regular
%%+ font BHQHNG+Berkeley-Book
%%+ font BHQHNH+HelveticaLTStd-Bold
%%+ font BHQHNI+Berkeley-Bold
%%EOF
so it looks like pdftops is subsetting the fonts.

"grep Berkeley-Bold", for example, shows
%%BeginResource: font BHQHNI+Berkeley-Bold
/CIDFontName /BHQHNI+Berkeley-Bold def
/F60_0 /BHQHNI+Berkeley-Bold 0 pdfMakeFont16L3
%%+ font BHQHNI+Berkeley-Bold

"grep -A 1 ' Tc$' x.ps | grep '(' | head" also appears to show that the fonts have been subsetted.
(\000\025\000\014)
(\000\015\000\024)
(\000\001\000*)
(\000\002\000\003\000\012)
(\000\006\000\015)
(\000\014\000\017\000\005\000\007)
(\000\033\000\031)
(\000\013\000"\000"\000\026\000\022)
(\000\012\000\004)
(\000\024\000\023\000\017\000\001)

In testing, I also noticed that some pdftops options like -level3 generate ps files that crash ghostscript, but for now I think that is a ghostscript issue. https://bugs.ghostscript.com/show_bug.cgi?id=702526<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.ghostscript.com%2Fshow_bug.cgi%3Fid%3D702526&data=02%7C01%7Clrosenth%40adobe.com%7C62b37e7241c34165c04008d81d868ffb%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637291811063868341&sdata=ECAy%2BgQSc3Apfa5IvH6JiSjYNFqSPYzG6F3tTBZRLVw%3D&reserved=0>

The ghostscript bug report has a copy of the PDF.

I can post this as a poppler bug report, but I wanted to check first that I didn't miss a pdftops option or that there wasn't an internal flag that I could expose as an option in pdftops.

William

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20200701/a677a44a/attachment.htm>


More information about the poppler mailing list