[poppler] tweaking pdfto[html|xml] to avoid spaces within words + which spellcheck to use ...

Albretch Mueller lbrtchx at gmail.com
Tue Jun 2 11:53:17 UTC 2020


 which option should be used to avoid such results

 <a href="...#183">Per cep tual  Re sponse .</a></text>

 or, which spellcheckers do you use in tandem with pdftohtml to
correct such spaces within words (and, optimally, spellcheck those
line).

 It appears to be something either within the pdf file or the text
extraction algorithm (based on phonemes?), because the starting and
ending characters of the words/meaningful sequences of characters are
never splitted.

 The spellcheck of libreoffice doesn't "correct all" such spaces
splitting words, which appear also, if you go: okular > export as >
text,

 lbrtchx


More information about the poppler mailing list