[Poppler-bugs] [Bug 103798] libpoppler cannot recreate pdftotext output, because physical_layout is not handled correctly

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Nov 26 16:46:38 UTC 2017


https://bugs.freedesktop.org/show_bug.cgi?id=103798

--- Comment #10 from dummydummy at gmx.fr ---

The https line is from https://poppler.freedesktop.org/
"
   Poppler is developed using git. To clone the repository use the following
command:

   git clone https://anongit.freedesktop.org/git/poppler/poppler.git
"
but it does not work with Debian 9 "Stretch" (stable).

Thank you for the address which works.

The command 
           git diff
seems to display the diff on the screen. I just redirected it into a file
with 
           git diff >git-diff.txt
If there is a better way, please let me know (I have never used git before)

The proposed modification ensures that the the function     
         ustring page::text(const rectf &r, text_layout_enum layout_mode) const
(in file .../gcc/poppler-page.cpp)
when called with  physical_layout  as  layout_mode correctly creates a 
TextOutputDev with second parameter set to true for physical_layout.


HOWEVER, even with this change, it is NOT possible to obtain the same result as
the output from pdftotext using libpoppler (The layout is different, for
example there are no blank lines whereas pdftotext adds them when needed).

The reason is that pdftotext creates a TextOutputDev with a filename (the
output file name in fact) as first parameter and page::text(...) instead
creates a TextOutputDev with a NULL as first parameter (as there is no output
filename).

When this TextOutputDev is subsequently passed into doc->displayPage(...), the
PDF-page is apparently parsed into "fragments". When the filename was provided
the text is simultaneously written into the output file (respecting the
physical_layout when required).
With the libpoppler function, only the parsing occurs (as there is no output
filename). To obtain the text in physical layout a different function
(TextOutputDev::getText (...)) is subsequently called to assemble the parsed
fragments from doc->displayPage(...). However, it does so differently to
doc->displayPage(...). This is why libpoppler cannot re-create the output of
pdftotext.

To summarise: the poppler code has currently two different functions for
providing text in physical_layout (one in doc->displayPage(...) and a
different, inferior one in TextOutputDev::getText (...)) - Is this really
necessary?

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20171126/4b39490c/attachment.html>


More information about the Poppler-bugs mailing list