[poppler] Recent changes in whitespace rendering with physical_layout
Jeroen Ooms
jeroen at berkeley.edu
Mon May 3 13:22:19 UTC 2021
I maintain R bindings called pdftools, mostly used for extracting text
from scientific documents. The bindings wrap the C++ API, in
particular we convert pdf to text using poppler::page::text() with
physical_layout.
Recently users have started to report changes in behaviour with newer
versions of poppler, in particular wrt whitespace. For example, all
pages are now terminated end with an '\f' symbol which was not the
case before. On Windows, linebreaks are now converted as '\r\n'
instead of just '\n' as before (we use mingw-w64 compilers). And also,
some documents that would contain a single linebreak in e.g. poppler
0.73, now have 4 or 5 linebreaks on the same place with the latest
poppler.
I had a look at the changelog but I couldn't find any notes of this.
Are these expected changes? The new behavior is causing some existing
pipelines to break, where people were using e.g. line offsets to
extract fragments of the text.
More information about the poppler
mailing list