[poppler] How do I extract text per column?

Sun Nov 11 23:37:51 UTC 2018

I have a pdf where each page is split into 2 columns. The text starts at
the 1st column (on the left) and when that column is filled it continues to
the 2nd column (on the right). Once the 2nd column is filled it starts at
the 1st column of the next page.
Sometimes, at the top of each page there might be some text that is
outside/above the columns and spans both of them.

I tried to get the text using Poppler::Page::text() with an empty rect, but
the text is weirdly formatted/mixed between columns.
How do I fix this? How do I get a list of rects for the text areas? How do
I identify the order the text is supposed to flow?

Or generally how am I supposed to extract text in a meaningful order?

Furthermore, at least on the first page, there seems to be some text at the
beginning that doesn't appear in a pdf viewer.

pdftotext seems to output a mostly correct order of text, which is also
correctly separated section by section. And correctly formatted. It even
removes hyphens from hyphenated words at the end of line!!!
I tried to look at the source to figure out how it does it, but it seems
that it uses undocumented internal APIs and not the APIs from the
documented bindings.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20181112/97e5e482/attachment.html>