[Poppler-bugs] [Bug 99824] New: pdftotext breaks sentence in middle of sentence when text overflow the box, whereas pdftohtml captures the full sentence.

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Feb 15 12:19:08 UTC 2017


https://bugs.freedesktop.org/show_bug.cgi?id=99824

            Bug ID: 99824
           Summary: pdftotext breaks sentence in middle of sentence when
                    text overflow the box, whereas pdftohtml captures the
                    full sentence.
           Product: poppler
           Version: unspecified
          Hardware: Other
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: utils
          Assignee: poppler-bugs at lists.freedesktop.org
          Reporter: gauravarora.daiict at gmail.com

Created attachment 129623
  --> https://bugs.freedesktop.org/attachment.cgi?id=129623&action=edit
sample pdf which is facing this issue

While analyzing some specific set of files, we realized that lines generated by
pdftohtml and pdftotext is different where text overflows the line boundary of
box.

In case of pdftohtml the line is captured normally with full text of that line
in a single text element. Whereas in case of pdftotext line is broken in middle
of word and the rest of line is added as a separate line.

Explanation with example below:

Line as appear in pdftohtml output:

<text top="412" left="79" width="1021" height="17" font="0">To JOSEPH E. BLUTH
for research and development in the field of electronic photography and
transfer of video tape to motion picture film. [Laboratory]</text>



Line as appear in pdftotext

To JOSEPH E. BLUTH for research and development in the field of electronic
photography and transfer of video tape to mo

.
.
.
.
.
sfer of video tape to motion picture film. [Laboratory]


Line as it appear in pdf file:


http://i67.tinypic.com/i6w66e.png


Even though pdf file doesn't show this line correctly. pdftohtml is correctly
able to get the full line, hence pdftotext can also handle and get the full
line.

It seems weird for line to be broken like this. I have attached a sample pdf
file which shows this bug. I have tested the the file with poppler-0.51.0

/poppler/tmp/poppler-0.51.0$ /usr/local/bin/pdftotext -v
pdftotext version 0.51.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20170215/b10a3724/attachment.html>


More information about the Poppler-bugs mailing list