[poppler] pdftops font subset question

Wed Jul 1 16:17:17 UTC 2020

Thanks for the information.

>provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.

I am using poppler (directly with pdftotext and indirectly with atril and okular).

The original PDF has sequences like the one below for each font.
Those Unicode CMaps can't be passed in postscript, so do I permanently lose useful text extraction when I convert this PDF to postscript with pdftops?

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <</Registry (Berkeley-Black-OV-BHQHLP) /Ordering (CIDUCS) /Supplement 0 >> def
/CMapName /Berkeley-Black-OV-BHQHLP def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfchar
<0001> <0066>
<0002> <006F>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

William

________________________________
From: Leonard Rosenthol <lrosenth at adobe.com>
Sent: Wednesday, July 1, 2020 10:43 AM
To: William Bader <williambader at hotmail.com>; poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
Subject: Re: [poppler] pdftops font subset question

Subsetting of a font has *ZERO* impact on the ability to extract text from a PDF…provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.

Leonard

From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of William Bader <williambader at hotmail.com>
Date: Wednesday, July 1, 2020 at 2:18 AM
To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
Subject: [poppler] pdftops font subset question

Is there any way to prevent pdftops from subsetting fonts? I want to be able to convert the ps back to a PDF and still be able to extract text with pdftotext.

I have a large single page PDF. When I drag to copy text in atril or okular or run pdftotext, it finds the text.

pdffonts shows about 40 fonts. They are all similar:

name                                 type              encoding         emb sub uni object ID

------------------------------------ ----------------- ---------------- --- --- --- ---------

HelveticaNeueLTStd-Roman--Identity-H CID Type 0C       Identity-H       yes no  yes    214  0

HelveticaNeueLTStd-BdIt--Identity-H  CID Type 0C       Identity-H       yes no  yes    236  0

...

HelveticaLTStd-Bold--Identity-H      CID Type 0C       Identity-H       yes no  yes     70  0

Berkeley-Bold--Identity-H            CID Type 0C       Identity-H       yes no  yes     60  0

pdfinfo shows

ModDate:        Fri Jun 26 21:27:37 2020 WEST

Tagged:         no

UserProperties: no

Suspects:       no

Form:           none

JavaScript:     no

Pages:          1

Encrypted:      no

Page size:      702 x 1296 pts

Page rot:       0

File size:      13501736 bytes

Optimized:      no

PDF version:    1.6

When I run the PDF through pdftops, it subsets the fonts, and then when I convert it back into a PDF with ghostscript ps2pdf, the text shows, but copying it or running pdftotext does not work.

The end of the generated ps is

%%+ font BHQHNF+MinionPro-Regular

%%+ font BHQHNG+Berkeley-Book

%%+ font BHQHNH+HelveticaLTStd-Bold

%%+ font BHQHNI+Berkeley-Bold

%%EOF

so it looks like pdftops is subsetting the fonts.

"grep Berkeley-Bold", for example, shows

%%BeginResource: font BHQHNI+Berkeley-Bold

/CIDFontName /BHQHNI+Berkeley-Bold def

/F60_0 /BHQHNI+Berkeley-Bold 0 pdfMakeFont16L3

%%+ font BHQHNI+Berkeley-Bold

"grep -A 1 ' Tc$' x.ps | grep '(' | head" also appears to show that the fonts have been subsetted.

(\000\025\000\014)

(\000\015\000\024)

(\000\001\000*)

(\000\002\000\003\000\012)

(\000\006\000\015)

(\000\014\000\017\000\005\000\007)

(\000\033\000\031)

(\000\013\000"\000"\000\026\000\022)

(\000\012\000\004)

(\000\024\000\023\000\017\000\001)

In testing, I also noticed that some pdftops options like -level3 generate ps files that crash ghostscript, but for now I think that is a ghostscript issue. https://bugs.ghostscript.com/show_bug.cgi?id=702526<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.ghostscript.com%2Fshow_bug.cgi%3Fid%3D702526&data=02%7C01%7Clrosenth%40adobe.com%7C62b37e7241c34165c04008d81d868ffb%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637291811063868341&sdata=ECAy%2BgQSc3Apfa5IvH6JiSjYNFqSPYzG6F3tTBZRLVw%3D&reserved=0>

The ghostscript bug report has a copy of the PDF.

I can post this as a poppler bug report, but I wanted to check first that I didn't miss a pdftops option or that there wasn't an internal flag that I could expose as an option in pdftops.

William

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20200701/38ea51e5/attachment-0001.htm>