[poppler] pdftops font subset question
William Bader
williambader at hotmail.com
Wed Jul 1 16:17:17 UTC 2020
Thanks for the information.
>provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.
I am using poppler (directly with pdftotext and indirectly with atril and okular).
The original PDF has sequences like the one below for each font.
Those Unicode CMaps can't be passed in postscript, so do I permanently lose useful text extraction when I convert this PDF to postscript with pdftops?
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <</Registry (Berkeley-Black-OV-BHQHLP) /Ordering (CIDUCS) /Supplement 0 >> def
/CMapName /Berkeley-Black-OV-BHQHLP def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfchar
<0001> <0066>
<0002> <006F>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
William
________________________________
From: Leonard Rosenthol <lrosenth at adobe.com>
Sent: Wednesday, July 1, 2020 10:43 AM
To: William Bader <williambader at hotmail.com>; poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
Subject: Re: [poppler] pdftops font subset question
Subsetting of a font has *ZERO* impact on the ability to extract text from a PDF…provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.
Leonard
From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of William Bader <williambader at hotmail.com>
Date: Wednesday, July 1, 2020 at 2:18 AM
To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
Subject: [poppler] pdftops font subset question
Is there any way to prevent pdftops from subsetting fonts? I want to be able to convert the ps back to a PDF and still be able to extract text with pdftotext.
I have a large single page PDF. When I drag to copy text in atril or okular or run pdftotext, it finds the text.
pdffonts shows about 40 fonts. They are all similar:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
HelveticaNeueLTStd-Roman--Identity-H CID Type 0C Identity-H yes no yes 214 0
HelveticaNeueLTStd-BdIt--Identity-H CID Type 0C Identity-H yes no yes 236 0
...
HelveticaLTStd-Bold--Identity-H CID Type 0C Identity-H yes no yes 70 0
Berkeley-Bold--Identity-H CID Type 0C Identity-H yes no yes 60 0
pdfinfo shows
ModDate: Fri Jun 26 21:27:37 2020 WEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 702 x 1296 pts
Page rot: 0
File size: 13501736 bytes
Optimized: no
PDF version: 1.6
When I run the PDF through pdftops, it subsets the fonts, and then when I convert it back into a PDF with ghostscript ps2pdf, the text shows, but copying it or running pdftotext does not work.
The end of the generated ps is
%%+ font BHQHNF+MinionPro-Regular
%%+ font BHQHNG+Berkeley-Book
%%+ font BHQHNH+HelveticaLTStd-Bold
%%+ font BHQHNI+Berkeley-Bold
%%EOF
so it looks like pdftops is subsetting the fonts.
"grep Berkeley-Bold", for example, shows
%%BeginResource: font BHQHNI+Berkeley-Bold
/CIDFontName /BHQHNI+Berkeley-Bold def
/F60_0 /BHQHNI+Berkeley-Bold 0 pdfMakeFont16L3
%%+ font BHQHNI+Berkeley-Bold
"grep -A 1 ' Tc$' x.ps | grep '(' | head" also appears to show that the fonts have been subsetted.
(\000\025\000\014)
(\000\015\000\024)
(\000\001\000*)
(\000\002\000\003\000\012)
(\000\006\000\015)
(\000\014\000\017\000\005\000\007)
(\000\033\000\031)
(\000\013\000"\000"\000\026\000\022)
(\000\012\000\004)
(\000\024\000\023\000\017\000\001)
In testing, I also noticed that some pdftops options like -level3 generate ps files that crash ghostscript, but for now I think that is a ghostscript issue. https://bugs.ghostscript.com/show_bug.cgi?id=702526<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.ghostscript.com%2Fshow_bug.cgi%3Fid%3D702526&data=02%7C01%7Clrosenth%40adobe.com%7C62b37e7241c34165c04008d81d868ffb%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637291811063868341&sdata=ECAy%2BgQSc3Apfa5IvH6JiSjYNFqSPYzG6F3tTBZRLVw%3D&reserved=0>
The ghostscript bug report has a copy of the PDF.
I can post this as a poppler bug report, but I wanted to check first that I didn't miss a pdftops option or that there wasn't an internal flag that I could expose as an option in pdftops.
William
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20200701/38ea51e5/attachment-0001.htm>
More information about the poppler
mailing list