[poppler] pdftotext font information

obsidian . obsidian9993 at gmail.com
Tue May 8 16:04:10 UTC 2018


Hi Suzuki,

have you noticed any problems while using the patched poppler-dump utility?



On Tue, May 8, 2018 at 2:25 AM, obsidian . <obsidian9993 at gmail.com> wrote:

> Thanks Suzuki.
>
> I was looking for something more tried, tested and "stable".
> I'm kind of surprised there's no other way to output char level
> information.
>
> On Sat, May 5, 2018 at 9:41 AM, Adam Reichold <adam.reichold at t-online.de>
> wrote:
>
>> Hello again,
>>
>> so I obviously forgot the attachment... |:-\ Sorry for that.
>>
>> Regards,
>> Adam
>>
>> Am 05.05.2018 um 08:16 schrieb Adam Reichold:
>> > Hello mpsuzuki,
>> >
>> > attached is a version of your patch with some inline comments.
>> >
>> > Generally speaking, I would say that some well-defined format like JSON
>> > or YAML would be preferable to the ad-hoc encoding?
>> >
>> > Best regards,
>> > Adam
>> >
>> > Am 03.05.2018 um 13:50 schrieb suzuki toshiya:
>> >> Current poppler-dump (a testing tool of cpp-frontend) has no feature to
>> >> demonstrate per-character bbox feature.
>> >> Attached patch adds the option to demonstrate it (I'm not saying "this
>> is ready
>> >> to use, please use", I want to understand your request and whether
>> existing
>> >> features could cover some part of your requests).
>> >>
>> >> The patched poppler-dump can work like this:
>> >>
>> >> $ cpp/tests/poppler-dump --show-glyph-list test.pdf
>> >> Page 1/1:
>> >> ---
>> >> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 )
>> >>         [0] @ ( x=72 y=72.624 w=13.344 h=21.6 )
>> >>         [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 )
>> >>         [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 )
>> >>         [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 )
>> >>         [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 )
>> >>         [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 )
>> >> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 )
>> >>         [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 )
>> >>         [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 )
>> >>         [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 )
>> >>         [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 )
>> >>         [4] @ ( x=180.648 y=72.624 w=6 h=21.6 )
>> >>         [5] @ ( x=186.648 y=72.624 w=6 h=21.6 )
>> >>         [6] @ ( x=192.648 y=72.624 w=6 h=21.6 )
>> >> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 )
>> >>         [0] @ ( x=72 y=112.428 w=3.996 h=10.8 )
>> >>         [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 )
>> >> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 )
>> >>         [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 )
>> >>         [1] @ ( x=86.328 y=112.428 w=6 h=10.8 )
>> >>         [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 )
>> >>         [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 )
>> >> ...
>> >>
>> >> Regards,
>> >> mpsuzuki
>> >>
>> >> suzuki toshiya wrote:
>> >>> Dear obsidian,
>> >>>
>> >>> Too many posts about similar issues :-)
>> >>> I'm not sure whether poppler maintainers are interested in the
>> enhancement of
>> >>> pdftotext,
>> >>> but recently Jeroen and I were working with cpp-frontend to have
>> similar features.
>> >>>
>> >>> in the latest version of poppler,
>> >>> cpp-frontend has a feature to retrieve the list of words with
>> bounding box,
>> >>> and it can retrieve the bounding box for each glyph in the word.
>> >>>
>> >>> --
>> >>>
>> >>> also I proposed a patch to retrieve the font family and point size:
>> >>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html
>> >>>
>> >>> it might be waiting the maintainers review. the discussion and result
>> would be
>> >>> found at here:
>> >>> https://github.com/ropensci/pdftools/issues/29
>> >>>
>> >>> --
>> >>>
>> >>>> - style, i.e. none, bold, italic
>> >>>
>> >>> if the document producer has a bold font and used in the document,
>> aslike
>> >>> Helvetica-Bold,
>> >>> it would be found by the family name.
>> >>> but if the document producer has no bold font and let the word
>> processor
>> >>> software synthesize the embolden fonts,
>> >>> it would be difficult for the PDF renderer to recognize it as
>> embolden font,
>> >>> because the embolding is done by showing same glyph with subtle
>> shifting.
>> >>> Simple PDF renderers would be unable to distinguish "normal font but
>> layered"
>> >>> and "embolden font".
>> >>>
>> >>> Regards,
>> >>> mpsuzuki
>> >>>
>> >>> obsidian . wrote:
>> >>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.
>> >>>>
>> >>>> Here's a sample line from the output:
>> >>>>     <word xMin="359.852025" yMin="462.548936" xMax="365.689478"
>> yMax="467.681498">foo</word>
>> >>>>
>> >>>> Is there a way to get font information for every word like:
>> >>>> - font family, e.g. Verdana
>> >>>> - style, i.e. none, bold, italic
>> >>>> - size, e.g. font size 9
>> >>>>
>> >>>> I'm using pdftotext version 0.55.0 on Windows.
>> >>>>
>> >>>>
>> >>>
>> >>> _______________________________________________
>> >>> poppler mailing list
>> >>> poppler at lists.freedesktop.org
>> >>> https://lists.freedesktop.org/mailman/listinfo/poppler
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> poppler mailing list
>> >>> poppler at lists.freedesktop.org
>> >>> https://lists.freedesktop.org/mailman/listinfo/poppler
>> >
>> >
>> >
>> > _______________________________________________
>> > poppler mailing list
>> > poppler at lists.freedesktop.org
>> > https://lists.freedesktop.org/mailman/listinfo/poppler
>> >
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180508/e78b5fc2/attachment.html>


More information about the poppler mailing list