[Poppler-bugs] [Bug 104085] New: rendering pdf and pdftotext give different results

Mon Dec 4 19:56:25 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=104085

            Bug ID: 104085
           Summary: rendering pdf and pdftotext give different results
           Product: poppler
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: utils
          Assignee: poppler-bugs at lists.freedesktop.org
          Reporter: galtgendo at o2.pl

...well, of course they do - one renders pdf graphicallly the other just tries
to extract the text...

However, the issue is this: I've stumbled upon a pdf file, that's displayed
correctly, but pdftotext was dumping strings, that looked like typos, if not
for the "typo" being the same char.

So, I've looked into the content.

699 0 obj
<<
  /BaseEncoding /WinAnsiEncoding
  /Differences [
    1
    /zdot
    /aogonek
    /eogonek
    /sacute
    /cacute
    /Sacute
    /nacute
    /Zdot
    /zacute
    /Zacute
  ]
  /Type /Encoding
>>
endobj

700 0 obj
<<
  /Ascent 625
  /CapHeight 625
  /Descent -177
  /Flags 4
  /FontBBox [
    5
    -177
    638
    877
  ]
  /FontFile2 712 0 R
  /FontName /RDZRPI+TimesNewRoman
  /ItalicAngle 0
  /MissingWidth 777
  /StemV 95
  /Type /FontDescriptor
>>
endobj

701 0 obj
<<
  /Length 702 0 R
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R706 def
1 begincodespacerange
<00><ff>
endcodespacerange
10 beginbfrange
<01><01><015c>
<02><02><0105>
<03><03><0119>
<04><04><015b>
<05><05><0107>
<06><06><015a>
<07><07><0144>
<08><08><015b>
<09><09><017a>
<0a><0a><0179>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj

Well, that's just one of a few such sets. The point is that - for example -
'\zdot' should be '017c' or at least changing it to that gives proper results
in pdftotext. pdf file modified that way still displays correctly.

So, is there a step that pdftotext is skipping, that it could be doing to get
the proper result or is it something that only works during on-screen rendering
?

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20171204/090642e9/attachment.html>