[poppler] pdftops font subset question

Wed Jul 1 22:39:16 UTC 2020

Thanks again for the information.

>CMaps and CID Fonts predate PDF and were introduced first in Postscript as described in Adobe Technote 5014,

The PDF that is giving me problems has CID Type 0C fonts with the Identity-H encoding.
When I edit the PDF, I can find objects like the one below at the end.
It looks like pdftops isn't passing them to the postscript.

>I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to Postscript and run that Postscript though Acrobat Distiller – I get a fully searchable PDF.

I just have Linux, and I think that I don't have a way to run Acrobat. Would it be possible to take the PDF that I posted to https://bugs.ghostscript.com/show_bug.cgi?id=702526  and add the PS generated by Acrobat and the PDF generated from Distiller?
I looked at the Adobe document that you linked and a few others that I already had, and they seemed to be about external cmap files.
I would like to see an example of a ToUnicode CMap embedded in a postscript file.
I am hoping that seeing a working postscript file combined with the documentation that you linked and what I can see by editing the PDF should be enough to find a way to get pdftops to generate it.

Regards, William

A section of the original PDF. I think that CMapType 2 is the ToUnicode map. poppler understands it or else pdftotext wouldn't work.
I am hoping that it is something that poppler PSOutputDev::setupEmbeddedCIDType0Font() can generate. https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf

281 0 obj
<</Filter/FlateDecode/Length 322>>stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R281 def
1 begincodespacerange
<0000><ffff>
endcodespacerange
30 beginbfrange
<0001><0001><0043>
<0002><0002><0048>
...
<001f><001f><007a>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

endstream
endobj
212 0 obj
<</BaseFont/MPJWBI+HelveticaNeueLTStd-BdIt/ToUnicode 281 0 R/Type/Font
/Encoding /Identity-H/DescendantFonts[213 0 R]/Subtype/Type0>>
endobj

________________________________
From: Leonard Rosenthol <lrosenth at adobe.com>
Sent: Wednesday, July 1, 2020 2:48 PM
To: William Bader <williambader at hotmail.com>; poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
Subject: Re: [poppler] pdftops font subset question

> Those Unicode CMaps can't be passed in postscript, so do I permanently lose useful text extraction when I convert this PDF to postscript with pdftops?

>

Of course they can!   CMaps and CID Fonts predate PDF and were introduced first in Postscript as described in Adobe Technote 5014, https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf

I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to Postscript and run that Postscript though Acrobat Distiller – I get a fully searchable PDF.

Now… whether pdftops will output them – I don’t know.   And whether Ghostscript, upon encountering them, will correctly restore the font encoding.  Again, I don’t know.

Leonard

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20200701/14582b90/attachment.htm>