<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - rendering pdf and pdftotext give different results"
href="https://bugs.freedesktop.org/show_bug.cgi?id=104085">104085</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>rendering pdf and pdftotext give different results
</td>
</tr>
<tr>
<th>Product</th>
<td>poppler
</td>
</tr>
<tr>
<th>Version</th>
<td>unspecified
</td>
</tr>
<tr>
<th>Hardware</th>
<td>x86-64 (AMD64)
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux (All)
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>medium
</td>
</tr>
<tr>
<th>Component</th>
<td>utils
</td>
</tr>
<tr>
<th>Assignee</th>
<td>poppler-bugs@lists.freedesktop.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>galtgendo@o2.pl
</td>
</tr></table>
<p>
<div>
<pre>...well, of course they do - one renders pdf graphicallly the other just tries
to extract the text...
However, the issue is this: I've stumbled upon a pdf file, that's displayed
correctly, but pdftotext was dumping strings, that looked like typos, if not
for the "typo" being the same char.
So, I've looked into the content.
699 0 obj
<<
/BaseEncoding /WinAnsiEncoding
/Differences [
1
/zdot
/aogonek
/eogonek
/sacute
/cacute
/Sacute
/nacute
/Zdot
/zacute
/Zacute
]
/Type /Encoding
<span class="quote">>></span >
endobj
700 0 obj
<<
/Ascent 625
/CapHeight 625
/Descent -177
/Flags 4
/FontBBox [
5
-177
638
877
]
/FontFile2 712 0 R
/FontName /RDZRPI+TimesNewRoman
/ItalicAngle 0
/MissingWidth 777
/StemV 95
/Type /FontDescriptor
<span class="quote">>></span >
endobj
701 0 obj
<<
/Length 702 0 R
<span class="quote">>></span >
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R706 def
1 begincodespacerange
<00><ff>
endcodespacerange
10 beginbfrange
<01><01><015c>
<02><02><0105>
<03><03><0119>
<04><04><015b>
<05><05><0107>
<06><06><015a>
<07><07><0144>
<08><08><015b>
<09><09><017a>
<0a><0a><0179>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj
Well, that's just one of a few such sets. The point is that - for example -
'\zdot' should be '017c' or at least changing it to that gives proper results
in pdftotext. pdf file modified that way still displays correctly.
So, is there a step that pdftotext is skipping, that it could be doing to get
the proper result or is it something that only works during on-screen rendering
?</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>