[poppler] pdftotext font information
suzuki toshiya
mpsuzuki at hiroshima-u.ac.jp
Tue May 8 16:21:50 UTC 2018
Sorry, I would not have sufficient time to work with poppler until the end of
this month.
my patch for the poppler-dump was just a proof of concept, and I have not used
much yet.
have you experienced something?
obsidian . wrote:
> Hi Suzuki,
>
> have you noticed any problems while using the patched poppler-dump utility?
>
>
>
> On Tue, May 8, 2018 at 2:25 AM, obsidian . <obsidian9993 at gmail.com<mailto:obsidian9993 at gmail.com>> wrote:
> Thanks Suzuki.
>
> I was looking for something more tried, tested and "stable".
> I'm kind of surprised there's no other way to output char level information.
>
> On Sat, May 5, 2018 at 9:41 AM, Adam Reichold <adam.reichold at t-online.de<mailto:adam.reichold at t-online.de>> wrote:
> Hello again,
>
> so I obviously forgot the attachment... |:-\ Sorry for that.
>
> Regards,
> Adam
>
> Am 05.05.2018 um 08:16 schrieb Adam Reichold:
>> Hello mpsuzuki,
>>
>> attached is a version of your patch with some inline comments.
>>
>> Generally speaking, I would say that some well-defined format like JSON
>> or YAML would be preferable to the ad-hoc encoding?
>>
>> Best regards,
>> Adam
>>
>> Am 03.05.2018 um 13:50 schrieb suzuki toshiya:
>>> Current poppler-dump (a testing tool of cpp-frontend) has no feature to
>>> demonstrate per-character bbox feature.
>>> Attached patch adds the option to demonstrate it (I'm not saying "this is ready
>>> to use, please use", I want to understand your request and whether existing
>>> features could cover some part of your requests).
>>>
>>> The patched poppler-dump can work like this:
>>>
>>> $ cpp/tests/poppler-dump --show-glyph-list test.pdf
>>> Page 1/1:
>>> ---
>>> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 )
>>> [0] @ ( x=72 y=72.624 w=13.344 h=21.6 )
>>> [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 )
>>> [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 )
>>> [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 )
>>> [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 )
>>> [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 )
>>> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 )
>>> [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 )
>>> [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 )
>>> [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 )
>>> [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 )
>>> [4] @ ( x=180.648 y=72.624 w=6 h=21.6 )
>>> [5] @ ( x=186.648 y=72.624 w=6 h=21.6 )
>>> [6] @ ( x=192.648 y=72.624 w=6 h=21.6 )
>>> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 )
>>> [0] @ ( x=72 y=112.428 w=3.996 h=10.8 )
>>> [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 )
>>> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 )
>>> [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 )
>>> [1] @ ( x=86.328 y=112.428 w=6 h=10.8 )
>>> [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 )
>>> [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 )
>>> ...
>>>
>>> Regards,
>>> mpsuzuki
>>>
>>> suzuki toshiya wrote:
>>>> Dear obsidian,
>>>>
>>>> Too many posts about similar issues :-)
>>>> I'm not sure whether poppler maintainers are interested in the enhancement of
>>>> pdftotext,
>>>> but recently Jeroen and I were working with cpp-frontend to have similar features.
>>>>
>>>> in the latest version of poppler,
>>>> cpp-frontend has a feature to retrieve the list of words with bounding box,
>>>> and it can retrieve the bounding box for each glyph in the word.
>>>>
>>>> --
>>>>
>>>> also I proposed a patch to retrieve the font family and point size:
>>>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html
>>>>
>>>> it might be waiting the maintainers review. the discussion and result would be
>>>> found at here:
>>>> https://github.com/ropensci/pdftools/issues/29
>>>>
>>>> --
>>>>
>>>>> - style, i.e. none, bold, italic
>>>> if the document producer has a bold font and used in the document, aslike
>>>> Helvetica-Bold,
>>>> it would be found by the family name.
>>>> but if the document producer has no bold font and let the word processor
>>>> software synthesize the embolden fonts,
>>>> it would be difficult for the PDF renderer to recognize it as embolden font,
>>>> because the embolding is done by showing same glyph with subtle shifting.
>>>> Simple PDF renderers would be unable to distinguish "normal font but layered"
>>>> and "embolden font".
>>>>
>>>> Regards,
>>>> mpsuzuki
>>>>
>>>> obsidian . wrote:
>>>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.
>>>>>
>>>>> Here's a sample line from the output:
>>>>> <word xMin="359.852025" yMin="462.548936" xMax="365.689478" yMax="467.681498">foo</word>
>>>>>
>>>>> Is there a way to get font information for every word like:
>>>>> - font family, e.g. Verdana
>>>>> - style, i.e. none, bold, italic
>>>>> - size, e.g. font size 9
>>>>>
>>>>> I'm using pdftotext version 0.55.0 on Windows.
>>>>>
>>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/poppler
>
>
>
>
>
More information about the poppler
mailing list