<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
p.xmsonormal, li.xmsonormal, div.xmsonormal
{mso-style-name:x_msonormal;
margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal">><span style="font-size:12.0pt;color:black"> Those Unicode CMaps can't be passed in postscript, so do I permanently lose useful text extraction when I convert this PDF to postscript with pdftops?<o:p></o:p></span></p>
<p class="MsoNormal">><o:p> </o:p></p>
<p class="MsoNormal">Of course they can! CMaps and CID Fonts predate PDF and were introduced first in Postscript as described in Adobe Technote 5014,
<a href="https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf">
https://www.adobe.com/content/dam/acom/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf</a><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I can tell you that if I export a PDF using CIDFonts from Adobe Acrobat to Postscript and run that Postscript though Acrobat Distiller – I get a fully searchable PDF.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Now… whether pdftops will output them – I don’t know. And whether Ghostscript, upon encountering them, will correctly restore the font encoding. Again, I don’t know.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Leonard<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">William Bader <williambader@hotmail.com><br>
<b>Date: </b>Wednesday, July 1, 2020 at 12:17 PM<br>
<b>To: </b>Leonard Rosenthol <lrosenth@adobe.com>, "poppler@lists.freedesktop.org" <poppler@lists.freedesktop.org><br>
<b>Subject: </b>Re: [poppler] pdftops font subset question<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Thanks for the information.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">>provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">I am using poppler (directly with pdftotext and indirectly with atril and okular).<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">The original PDF has sequences like the one below for each font.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">Those Unicode CMaps can't be passed in postscript, so do I permanently lose useful text extraction when I convert this PDF to postscript with pdftops?<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">/CIDInit /ProcSet findresource begin<o:p></o:p></span></p>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">12 dict begin<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">begincmap<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">/CIDSystemInfo <</Registry (Berkeley-Black-OV-BHQHLP) /Ordering (CIDUCS) /Supplement 0 >> def<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">/CMapName /Berkeley-Black-OV-BHQHLP def<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">/CMapType 2 def<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">1 begincodespacerange<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><0000> <FFFF><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">endcodespacerange<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">2 beginbfchar<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><0001> <0066><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><0002> <006F><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">endbfchar<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">endcmap CMapName currentdict /CMap defineresource pop end end<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black">William<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;color:black"><o:p> </o:p></span></p>
</div>
<div>
<div class="MsoNormal" align="center" style="text-align:center">
<hr size="0" width="100%" align="center">
</div>
<div id="divRplyFwdMsg">
<p class="MsoNormal"><b><span style="color:black">From:</span></b><span style="color:black"> Leonard Rosenthol <lrosenth@adobe.com><br>
<b>Sent:</b> Wednesday, July 1, 2020 10:43 AM<br>
<b>To:</b> William Bader <williambader@hotmail.com>; poppler@lists.freedesktop.org <poppler@lists.freedesktop.org><br>
<b>Subject:</b> Re: [poppler] pdftops font subset question</span> <o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="xmsonormal">Subsetting of a font has *<b>ZERO</b>* impact on the ability to extract text from a PDF…provided that you are following the rules for text extraction in ISO 32000 (PDF standard) vs. doing something like “grep”.<o:p></o:p></p>
<p class="xmsonormal"> <o:p></o:p></p>
<p class="xmsonormal">Leonard<o:p></o:p></p>
<p class="xmsonormal"> <o:p></o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="xmsonormal"><b><span style="font-size:12.0pt;color:black">From: </span>
</b><span style="font-size:12.0pt;color:black">poppler <poppler-bounces@lists.freedesktop.org> on behalf of William Bader <williambader@hotmail.com><br>
<b>Date: </b>Wednesday, July 1, 2020 at 2:18 AM<br>
<b>To: </b>"poppler@lists.freedesktop.org" <poppler@lists.freedesktop.org><br>
<b>Subject: </b>[poppler] pdftops font subset question</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"> <o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">Is there any way to prevent pdftops from subsetting fonts? I want to be able to convert the ps back to a PDF and still be able to extract text with pdftotext.</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">I have a large single page PDF. When I drag to copy text in atril or okular or run pdftotext, it finds the text.</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">pdffonts shows about 40 fonts. They are all similar:</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">name type encoding emb sub uni object ID</span><o:p></o:p></p>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">------------------------------------ ----------------- ---------------- --- --- --- ---------</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">HelveticaNeueLTStd-Roman--Identity-H CID Type 0C Identity-H yes no yes 214 0</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">HelveticaNeueLTStd-BdIt--Identity-H CID Type 0C Identity-H yes no yes 236 0</span><o:p></o:p></p>
</div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">...</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">HelveticaLTStd-Bold--Identity-H CID Type 0C Identity-H yes no yes 70 0</span><o:p></o:p></p>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Berkeley-Bold--Identity-H CID Type 0C Identity-H yes no yes 60 0</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">pdfinfo shows</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">ModDate: Fri Jun 26 21:27:37 2020 WEST</span><o:p></o:p></p>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Tagged: no</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">UserProperties: no</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Suspects: no</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Form: none</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">JavaScript: no</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Pages: 1</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Encrypted: no</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Page size: 702 x 1296 pts</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Page rot: 0</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">File size: 13501736 bytes</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">Optimized: no</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;font-family:"Courier New";color:black">PDF version: 1.6</span><o:p></o:p></p>
</div>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">When I run the PDF through pdftops, it subsets the fonts, and then when I convert it back into a PDF with ghostscript ps2pdf, the text shows, but copying it or running pdftotext does not work.</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">The end of the generated ps is</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%+ font BHQHNF+MinionPro-Regular</span><o:p></o:p></p>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%+ font BHQHNG+Berkeley-Book</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%+ font BHQHNH+HelveticaLTStd-Bold</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%+ font BHQHNI+Berkeley-Bold</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%EOF</span><o:p></o:p></p>
</div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">so it looks like pdftops is subsetting the fonts.</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">"grep Berkeley-Bold", for example, shows</span><o:p></o:p></p>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%BeginResource: font BHQHNI+Berkeley-Bold</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">/CIDFontName /BHQHNI+Berkeley-Bold def</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">/F60_0 /BHQHNI+Berkeley-Bold 0 pdfMakeFont16L3</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">%%+ font BHQHNI+Berkeley-Bold</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">"grep -A 1 ' Tc$' x.ps | grep '(' | head" also appears to show that the fonts have been subsetted.</span><o:p></o:p></p>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\025\000\014)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\015\000\024)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\001\000*)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\002\000\003\000\012)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\006\000\015)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\014\000\017\000\005\000\007)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\033\000\031)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\013\000"\000"\000\026\000\022)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\012\000\004)</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">(\000\024\000\023\000\017\000\001)</span><o:p></o:p></p>
</div>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">In testing, I also noticed that some pdftops options like -level3 generate ps files that crash ghostscript, but for now I think that is a ghostscript issue. <a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.ghostscript.com%2Fshow_bug.cgi%3Fid%3D702526&data=02%7C01%7Clrosenth%40adobe.com%7C59615df056674e7a130108d81dda3ab8%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637292170416098524&sdata=FnU2epC7oLzMcSBQa4uMDSKE5jZrLTdqAF1wjyQfJ4Q%3D&reserved=0">https://bugs.ghostscript.com/show_bug.cgi?id=702526</a></span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">The ghostscript bug report has a copy of the PDF.</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">I can post this as a poppler bug report, but I wanted to check first that I didn't miss a pdftops option or that there wasn't an internal flag that I could expose as an option in pdftops.</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black">William</span><o:p></o:p></p>
</div>
<div>
<p class="xmsonormal"><span style="font-size:12.0pt;color:black"> </span><o:p></o:p></p>
</div>
</div>
</div>
</div>
</div>
</body>
</html>