[poppler] Small issue with extracting prices from PDF

Jean-Sebastien Vachon vachonjs at gmail.com
Wed Dec 12 21:26:12 UTC 2018


Hi all,

I just started using the pdftotext python module to extract text from PDFs
and It really does look good so thanks for your hard work.

The only issue I am having right now is regarding the extraction of pricing
information such as within a menu. A lot of restaurants won't use a dot to
separate dollars and cents but will rely on a slightly smaller font size
for cents. As a result, an item listed at 4.00$ comes out at 400...

Is there anyway to detect such changes in fonts size/color and treat them
as separate words?

I am not sure if this would be better to support this on the python side or
directly within poppler.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20181212/4dc31752/attachment.html>


More information about the poppler mailing list