[poppler] Small issue with extracting prices from PDF
Jean-Sebastien Vachon
vachonjs at gmail.com
Wed Dec 12 21:26:12 UTC 2018
Hi all,
I just started using the pdftotext python module to extract text from PDFs
and It really does look good so thanks for your hard work.
The only issue I am having right now is regarding the extraction of pricing
information such as within a menu. A lot of restaurants won't use a dot to
separate dollars and cents but will rely on a slightly smaller font size
for cents. As a result, an item listed at 4.00$ comes out at 400...
Is there anyway to detect such changes in fonts size/color and treat them
as separate words?
I am not sure if this would be better to support this on the python side or
directly within poppler.
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20181212/4dc31752/attachment.html>
More information about the poppler
mailing list