[poppler] tweaking pdfto[html|xml] to avoid spaces within words + which spellcheck to use ...
Albretch Mueller
lbrtchx at gmail.com
Tue Jun 2 11:53:17 UTC 2020
which option should be used to avoid such results
<a href="...#183">Per cep tual Re sponse .</a></text>
or, which spellcheckers do you use in tandem with pdftohtml to
correct such spaces within words (and, optimally, spellcheck those
line).
It appears to be something either within the pdf file or the text
extraction algorithm (based on phonemes?), because the starting and
ending characters of the words/meaningful sequences of characters are
never splitted.
The spellcheck of libreoffice doesn't "correct all" such spaces
splitting words, which appear also, if you go: okular > export as >
text,
lbrtchx
More information about the poppler
mailing list