[poppler] pdftotext font information

obsidian . obsidian9993 at gmail.com
Tue May 8 16:57:39 UTC 2018


Hi Suzuki,

no I was wondering because you said:
> I'm not saying "this is ready to use, please use"

I tried it and it seems to be working ok. XML output would be much better
though since that's what pdftotext's output is.
Is there an quick/easy way to achieve the same with XML output?


On Tue, May 8, 2018 at 7:21 PM, suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
wrote:

> Sorry, I would not have sufficient time to work with poppler until the end
> of
> this month.
> my patch for the poppler-dump was just a proof of concept, and I have not
> used
> much yet.
> have you experienced something?
>
> obsidian . wrote:
> > Hi Suzuki,
> >
> > have you noticed any problems while using the patched poppler-dump
> utility?
> >
> >
> >
> > On Tue, May 8, 2018 at 2:25 AM, obsidian . <obsidian9993 at gmail.com<
> mailto:obsidian9993 at gmail.com>> wrote:
> > Thanks Suzuki.
> >
> > I was looking for something more tried, tested and "stable".
> > I'm kind of surprised there's no other way to output char level
> information.
> >
> > On Sat, May 5, 2018 at 9:41 AM, Adam Reichold <adam.reichold at t-online.de
> <mailto:adam.reichold at t-online.de>> wrote:
> > Hello again,
> >
> > so I obviously forgot the attachment... |:-\ Sorry for that.
> >
> > Regards,
> > Adam
> >
> > Am 05.05.2018 um 08:16 schrieb Adam Reichold:
> >> Hello mpsuzuki,
> >>
> >> attached is a version of your patch with some inline comments.
> >>
> >> Generally speaking, I would say that some well-defined format like JSON
> >> or YAML would be preferable to the ad-hoc encoding?
> >>
> >> Best regards,
> >> Adam
> >>
> >> Am 03.05.2018 um 13:50 schrieb suzuki toshiya:
> >>> Current poppler-dump (a testing tool of cpp-frontend) has no feature to
> >>> demonstrate per-character bbox feature.
> >>> Attached patch adds the option to demonstrate it (I'm not saying "this
> is ready
> >>> to use, please use", I want to understand your request and whether
> existing
> >>> features could cover some part of your requests).
> >>>
> >>> The patched poppler-dump can work like this:
> >>>
> >>> $ cpp/tests/poppler-dump --show-glyph-list test.pdf
> >>> Page 1/1:
> >>> ---
> >>> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 )
> >>>         [0] @ ( x=72 y=72.624 w=13.344 h=21.6 )
> >>>         [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 )
> >>>         [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 )
> >>>         [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 )
> >>>         [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 )
> >>>         [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 )
> >>> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 )
> >>>         [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 )
> >>>         [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 )
> >>>         [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 )
> >>>         [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 )
> >>>         [4] @ ( x=180.648 y=72.624 w=6 h=21.6 )
> >>>         [5] @ ( x=186.648 y=72.624 w=6 h=21.6 )
> >>>         [6] @ ( x=192.648 y=72.624 w=6 h=21.6 )
> >>> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 )
> >>>         [0] @ ( x=72 y=112.428 w=3.996 h=10.8 )
> >>>         [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 )
> >>> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 )
> >>>         [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 )
> >>>         [1] @ ( x=86.328 y=112.428 w=6 h=10.8 )
> >>>         [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 )
> >>>         [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 )
> >>> ...
> >>>
> >>> Regards,
> >>> mpsuzuki
> >>>
> >>> suzuki toshiya wrote:
> >>>> Dear obsidian,
> >>>>
> >>>> Too many posts about similar issues :-)
> >>>> I'm not sure whether poppler maintainers are interested in the
> enhancement of
> >>>> pdftotext,
> >>>> but recently Jeroen and I were working with cpp-frontend to have
> similar features.
> >>>>
> >>>> in the latest version of poppler,
> >>>> cpp-frontend has a feature to retrieve the list of words with
> bounding box,
> >>>> and it can retrieve the bounding box for each glyph in the word.
> >>>>
> >>>> --
> >>>>
> >>>> also I proposed a patch to retrieve the font family and point size:
> >>>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html
> >>>>
> >>>> it might be waiting the maintainers review. the discussion and result
> would be
> >>>> found at here:
> >>>> https://github.com/ropensci/pdftools/issues/29
> >>>>
> >>>> --
> >>>>
> >>>>> - style, i.e. none, bold, italic
> >>>> if the document producer has a bold font and used in the document,
> aslike
> >>>> Helvetica-Bold,
> >>>> it would be found by the family name.
> >>>> but if the document producer has no bold font and let the word
> processor
> >>>> software synthesize the embolden fonts,
> >>>> it would be difficult for the PDF renderer to recognize it as
> embolden font,
> >>>> because the embolding is done by showing same glyph with subtle
> shifting.
> >>>> Simple PDF renderers would be unable to distinguish "normal font but
> layered"
> >>>> and "embolden font".
> >>>>
> >>>> Regards,
> >>>> mpsuzuki
> >>>>
> >>>> obsidian . wrote:
> >>>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.
> >>>>>
> >>>>> Here's a sample line from the output:
> >>>>>     <word xMin="359.852025" yMin="462.548936" xMax="365.689478"
> yMax="467.681498">foo</word>
> >>>>>
> >>>>> Is there a way to get font information for every word like:
> >>>>> - font family, e.g. Verdana
> >>>>> - style, i.e. none, bold, italic
> >>>>> - size, e.g. font size 9
> >>>>>
> >>>>> I'm using pdftotext version 0.55.0 on Windows.
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> poppler mailing list
> >>>> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
> >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> poppler mailing list
> >>>> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
> >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>
> >>
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
> >> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>
> > https://lists.freedesktop.org/mailman/listinfo/poppler
> >
> >
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180508/95362b12/attachment-0001.html>


More information about the poppler mailing list