[poppler] pdftotext font information

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Tue May 8 16:33:23 UTC 2018


Dear Adam,

Now I don't have sufficient time to draft something, but

> Generally speaking, I would say that some well-defined format like JSON
> or YAML would be preferable to the ad-hoc encoding?

I agree. I had no special reason to use ad-hoc format.

I prefer JSON if I choose JSON and YAML, but, considering
pdftotext has XML output, XML output might be expected too?

Regards,
mpsuzuki


Adam Reichold wrote:
> Hello mpsuzuki,
> 
> attached is a version of your patch with some inline comments.
> 
> Generally speaking, I would say that some well-defined format like JSON
> or YAML would be preferable to the ad-hoc encoding?
> 
> Best regards,
> Adam
> 
> Am 03.05.2018 um 13:50 schrieb suzuki toshiya:
>> Current poppler-dump (a testing tool of cpp-frontend) has no feature to
>> demonstrate per-character bbox feature.
>> Attached patch adds the option to demonstrate it (I'm not saying "this is ready
>> to use, please use", I want to understand your request and whether existing
>> features could cover some part of your requests).
>>
>> The patched poppler-dump can work like this:
>>
>> $ cpp/tests/poppler-dump --show-glyph-list test.pdf
>> Page 1/1:
>> ---
>> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 )
>>         [0] @ ( x=72 y=72.624 w=13.344 h=21.6 )
>>         [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 )
>>         [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 )
>>         [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 )
>>         [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 )
>>         [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 )
>> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 )
>>         [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 )
>>         [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 )
>>         [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 )
>>         [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 )
>>         [4] @ ( x=180.648 y=72.624 w=6 h=21.6 )
>>         [5] @ ( x=186.648 y=72.624 w=6 h=21.6 )
>>         [6] @ ( x=192.648 y=72.624 w=6 h=21.6 )
>> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 )
>>         [0] @ ( x=72 y=112.428 w=3.996 h=10.8 )
>>         [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 )
>> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 )
>>         [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 )
>>         [1] @ ( x=86.328 y=112.428 w=6 h=10.8 )
>>         [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 )
>>         [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 )
>> ...
>>
>> Regards,
>> mpsuzuki
>>
>> suzuki toshiya wrote:
>>> Dear obsidian,
>>>
>>> Too many posts about similar issues :-)
>>> I'm not sure whether poppler maintainers are interested in the enhancement of
>>> pdftotext,
>>> but recently Jeroen and I were working with cpp-frontend to have similar features.
>>>
>>> in the latest version of poppler,
>>> cpp-frontend has a feature to retrieve the list of words with bounding box,
>>> and it can retrieve the bounding box for each glyph in the word.
>>>
>>> --
>>>
>>> also I proposed a patch to retrieve the font family and point size:
>>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html
>>>
>>> it might be waiting the maintainers review. the discussion and result would be
>>> found at here:
>>> https://github.com/ropensci/pdftools/issues/29
>>>
>>> --
>>>
>>>> - style, i.e. none, bold, italic
>>> if the document producer has a bold font and used in the document, aslike
>>> Helvetica-Bold,
>>> it would be found by the family name.
>>> but if the document producer has no bold font and let the word processor
>>> software synthesize the embolden fonts,
>>> it would be difficult for the PDF renderer to recognize it as embolden font,
>>> because the embolding is done by showing same glyph with subtle shifting.
>>> Simple PDF renderers would be unable to distinguish "normal font but layered"
>>> and "embolden font".
>>>
>>> Regards,
>>> mpsuzuki
>>>
>>> obsidian . wrote:
>>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.
>>>>
>>>> Here's a sample line from the output:
>>>>     <word xMin="359.852025" yMin="462.548936" xMax="365.689478" yMax="467.681498">foo</word>
>>>>
>>>> Is there a way to get font information for every word like:
>>>> - font family, e.g. Verdana
>>>> - style, i.e. none, bold, italic
>>>> - size, e.g. font size 9
>>>>
>>>> I'm using pdftotext version 0.55.0 on Windows.
>>>>
>>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 



More information about the poppler mailing list