<div dir="ltr">Hi Suzuki,<div><br></div><div>no I was wondering because you said:</div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">> I'm not saying "this is ready </span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">to use, please use"</span><br></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">I tried it and it seems to be working ok. XML output would be much better though since that's what pdftotext's output is.</span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Is there an quick/easy way to achieve the same with XML output? </span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, May 8, 2018 at 7:21 PM, suzuki toshiya <span dir="ltr"><<a href="mailto:mpsuzuki@hiroshima-u.ac.jp" target="_blank">mpsuzuki@hiroshima-u.ac.jp</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Sorry, I would not have sufficient time to work with poppler until the end of<br>
this month.<br>
my patch for the poppler-dump was just a proof of concept, and I have not used<br>
much yet.<br>
have you experienced something?<br>
<span class=""><br>
obsidian . wrote:<br>
> Hi Suzuki,<br>
> <br>
> have you noticed any problems while using the patched poppler-dump utility?<br>
> <br>
> <br>
> <br>
</span><span class="">> On Tue, May 8, 2018 at 2:25 AM, obsidian . <<a href="mailto:obsidian9993@gmail.com">obsidian9993@gmail.com</a><<wbr>mailto:<a href="mailto:obsidian9993@gmail.com">obsidian9993@gmail.com</a>><wbr>> wrote:<br>
> Thanks Suzuki.<br>
> <br>
> I was looking for something more tried, tested and "stable".<br>
> I'm kind of surprised there's no other way to output char level information.<br>
> <br>
</span><div><div class="h5">> On Sat, May 5, 2018 at 9:41 AM, Adam Reichold <<a href="mailto:adam.reichold@t-online.de">adam.reichold@t-online.de</a><<wbr>mailto:<a href="mailto:adam.reichold@t-online.de">adam.reichold@t-online.<wbr>de</a>>> wrote:<br>
> Hello again,<br>
> <br>
> so I obviously forgot the attachment... |:-\ Sorry for that.<br>
> <br>
> Regards,<br>
> Adam<br>
> <br>
> Am 05.05.2018 um 08:16 schrieb Adam Reichold:<br>
>> Hello mpsuzuki,<br>
>><br>
>> attached is a version of your patch with some inline comments.<br>
>><br>
>> Generally speaking, I would say that some well-defined format like JSON<br>
>> or YAML would be preferable to the ad-hoc encoding?<br>
>><br>
>> Best regards,<br>
>> Adam<br>
>><br>
>> Am 03.05.2018 um 13:50 schrieb suzuki toshiya:<br>
>>> Current poppler-dump (a testing tool of cpp-frontend) has no feature to<br>
>>> demonstrate per-character bbox feature.<br>
>>> Attached patch adds the option to demonstrate it (I'm not saying "this is ready<br>
>>> to use, please use", I want to understand your request and whether existing<br>
>>> features could cover some part of your requests).<br>
>>><br>
>>> The patched poppler-dump can work like this:<br>
>>><br>
>>> $ cpp/tests/poppler-dump --show-glyph-list test.pdf<br>
>>> Page 1/1:<br>
>>> ---<br>
>>> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 )<br>
>>>         [0] @ ( x=72 y=72.624 w=13.344 h=21.6 )<br>
>>>         [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 )<br>
>>>         [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 )<br>
>>>         [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 )<br>
>>>         [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 )<br>
>>>         [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 )<br>
>>> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 )<br>
>>>         [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 )<br>
>>>         [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 )<br>
>>>         [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 )<br>
>>>         [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 )<br>
>>>         [4] @ ( x=180.648 y=72.624 w=6 h=21.6 )<br>
>>>         [5] @ ( x=186.648 y=72.624 w=6 h=21.6 )<br>
>>>         [6] @ ( x=192.648 y=72.624 w=6 h=21.6 )<br>
>>> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 )<br>
>>>         [0] @ ( x=72 y=112.428 w=3.996 h=10.8 )<br>
>>>         [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 )<br>
>>> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 )<br>
>>>         [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 )<br>
>>>         [1] @ ( x=86.328 y=112.428 w=6 h=10.8 )<br>
>>>         [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 )<br>
>>>         [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 )<br>
>>> ...<br>
>>><br>
>>> Regards,<br>
>>> mpsuzuki<br>
>>><br>
>>> suzuki toshiya wrote:<br>
>>>> Dear obsidian,<br>
>>>><br>
>>>> Too many posts about similar issues :-)<br>
>>>> I'm not sure whether poppler maintainers are interested in the enhancement of<br>
>>>> pdftotext,<br>
>>>> but recently Jeroen and I were working with cpp-frontend to have similar features.<br>
>>>><br>
>>>> in the latest version of poppler,<br>
>>>> cpp-frontend has a feature to retrieve the list of words with bounding box,<br>
>>>> and it can retrieve the bounding box for each glyph in the word.<br>
>>>><br>
>>>> --<br>
>>>><br>
>>>> also I proposed a patch to retrieve the font family and point size:<br>
>>>> <a href="https://lists.freedesktop.org/archives/poppler/2018-April/013035.html" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>archives/poppler/2018-April/<wbr>013035.html</a><br>
>>>><br>
>>>> it might be waiting the maintainers review. the discussion and result would be<br>
>>>> found at here:<br>
>>>> <a href="https://github.com/ropensci/pdftools/issues/29" rel="noreferrer" target="_blank">https://github.com/ropensci/<wbr>pdftools/issues/29</a><br>
>>>><br>
>>>> --<br>
>>>><br>
>>>>> - style, i.e. none, bold, italic<br>
>>>> if the document producer has a bold font and used in the document, aslike<br>
>>>> Helvetica-Bold,<br>
>>>> it would be found by the family name.<br>
>>>> but if the document producer has no bold font and let the word processor<br>
>>>> software synthesize the embolden fonts,<br>
>>>> it would be difficult for the PDF renderer to recognize it as embolden font,<br>
>>>> because the embolding is done by showing same glyph with subtle shifting.<br>
>>>> Simple PDF renderers would be unable to distinguish "normal font but layered"<br>
>>>> and "embolden font".<br>
>>>><br>
>>>> Regards,<br>
>>>> mpsuzuki<br>
>>>><br>
>>>> obsidian . wrote:<br>
>>>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.<br>
>>>>><br>
>>>>> Here's a sample line from the output:<br>
>>>>>     <word xMin="359.852025" yMin="462.548936" xMax="365.689478" yMax="467.681498">foo</word><br>
>>>>><br>
>>>>> Is there a way to get font information for every word like:<br>
>>>>> - font family, e.g. Verdana<br>
>>>>> - style, i.e. none, bold, italic<br>
>>>>> - size, e.g. font size 9<br>
>>>>><br>
>>>>> I'm using pdftotext version 0.55.0 on Windows.<br>
>>>>><br>
>>>>><br>
>>>> ______________________________<wbr>_________________<br>
>>>> poppler mailing list<br>
</div></div>>>>> <a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a><<wbr>mailto:<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.<wbr>freedesktop.org</a>><br>
<span class="">>>>> <a href="https://lists.freedesktop.org/mailman/listinfo/poppler" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/poppler</a><br>
>>>><br>
>>>><br>
>>>> ______________________________<wbr>_________________<br>
>>>> poppler mailing list<br>
</span>>>>> <a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a><<wbr>mailto:<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.<wbr>freedesktop.org</a>><br>
<span class="">>>>> <a href="https://lists.freedesktop.org/mailman/listinfo/poppler" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/poppler</a><br>
>><br>
>><br>
>> ______________________________<wbr>_________________<br>
>> poppler mailing list<br>
</span>>> <a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a><<wbr>mailto:<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.<wbr>freedesktop.org</a>><br>
<span class="">>> <a href="https://lists.freedesktop.org/mailman/listinfo/poppler" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/poppler</a><br>
>><br>
> <br>
> ______________________________<wbr>_________________<br>
> poppler mailing list<br>
</span>> <a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a><<wbr>mailto:<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.<wbr>freedesktop.org</a>><br>
> <a href="https://lists.freedesktop.org/mailman/listinfo/poppler" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/poppler</a><br>
> <br>
> <br>
> <br>
> <br>
> <br>
<br>
</blockquote></div><br></div>