[poppler] pdftotext font information

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Thu May 3 10:32:06 UTC 2018


Dear obsidian,

Too many posts about similar issues :-)
I'm not sure whether poppler maintainers are interested in the enhancement of
pdftotext,
but recently Jeroen and I were working with cpp-frontend to have similar features.

in the latest version of poppler,
cpp-frontend has a feature to retrieve the list of words with bounding box,
and it can retrieve the bounding box for each glyph in the word.

--

also I proposed a patch to retrieve the font family and point size:
https://lists.freedesktop.org/archives/poppler/2018-April/013035.html

it might be waiting the maintainers review. the discussion and result would be
found at here:
https://github.com/ropensci/pdftools/issues/29

--

> - style, i.e. none, bold, italic

if the document producer has a bold font and used in the document, aslike
Helvetica-Bold,
it would be found by the family name.
but if the document producer has no bold font and let the word processor
software synthesize the embolden fonts,
it would be difficult for the PDF renderer to recognize it as embolden font,
because the embolding is done by showing same glyph with subtle shifting.
Simple PDF renderers would be unable to distinguish "normal font but layered"
and "embolden font".

Regards,
mpsuzuki

obsidian . wrote:
> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.
> 
> Here's a sample line from the output:
>     <word xMin="359.852025" yMin="462.548936" xMax="365.689478" yMax="467.681498">foo</word>
> 
> Is there a way to get font information for every word like:
> - font family, e.g. Verdana
> - style, i.e. none, bold, italic
> - size, e.g. font size 9
> 
> I'm using pdftotext version 0.55.0 on Windows.
> 
> 



More information about the poppler mailing list