[poppler] How to recognize the Japan Font.

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Wed May 29 06:46:19 UTC 2019


To other subscribers: sorry, the main point of the discussion might be going out of poppler...

Dear Zhong,

I think OCR in opensource softwares are insufficient to recognize CJK text
(saying more precisely, the engine itself could be good, but nobody trained it
sufficiently and no pre-trained engine is widely distributed). Have you tried
Google Drive's OCR feature?

If somebody say something like "in such case, TrueType's glyph outline data
should be perfectly preserved in this document, and the font used in such PDF
are often widely used font in Microsoft Windows, so there would be a method
to identify the character by their outlines, without infinite font database!",
I would agree and appreciate them, but I don't have anything in "ready-to-use"
status.

Regards,
mpsuzuki

On 2019/05/29 15:28, Zhong, Steven wrote:
> Hi Suzuki-san,
> 
> Thanks for your help.  I have recognize it using the tesseract (OCR), but sometimes the result of OCR is not good.  So I think if can extract it to text,  it will be better.   Thanks again.
> 
> -----Original Message-----
> From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
> Sent: 2019年5月29日 14:24
> To: Zhong, Steven <Steven.Zhong at fil.com>; 'poppler at lists.freedesktop.org' <poppler at lists.freedesktop.org>
> Cc: Leonard Rosenthol <lrosenth at adobe.com>
> Subject: Re: [poppler] How to recognize the Japan Font.
> 
> Dear Zhong,
> 
> Oooooh, I apologize that I gave wrong comment. I've confirmed that this PDF cannot be searched even if I give it to Adobe products (it does not mean a data protection, I guess it is caused by a poor workflow to generate PDF). At present I cannot suggest easy method to extract the text from this PDF - maybe OCR is the easiest?
> 
> Regards,
> mpsuzuki
> 
> On 2019/05/29 15:10, Zhong, Steven wrote:
>> Hi Suzuki-san,
>>
>> We have install the latest poppler and poppler-data.  But the result is the same.
>>
>> By the way , we can't copy the content correctly from the PDF on win 10 through Ctrl + C , Ctrl +V.     Thanks
>>
>> root at 08a02db0d267:/home/vcap/app/pop/bin# ./pdfinfo -v pdfinfo version
>> 0.77.0 Copyright 2005-2019 The Poppler Developers -
>> https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__jpn01.safelinks.p&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C0ac2d97be75f4b75ea6608d6e3fed945%7Cc40454ddb2634926868d8e12640d3750%7C1%7C1%7C636947081055371277&sdata=ea7db8CNORWkfgWUairAVchOiB4FYBUHK4PYehxq360%3D&reserved=0
>> rotection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fpoppler.freedesktop.
>> org-26amp-3Bdata-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C
>> 09d6dc10535b413ef22308d6e3fc6de0-257Cc40454ddb2634926868d8e12640d3750-
>> 257C1-257C1-257C636947070675056492-26amp-3Bsdata-3DdizuW5UOqRDQtbDVXf2
>> gpXv5xjWGXt-252FCb6ii4ySUi-252FA-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=
>> SsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6
>> MNuEh4qjVsbZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=Xy
>> n_-enhlRg4uRPJdoxoWPN33MO8ugMmRFgZXEdwhI4&e=
>> Copyright 1996-2011 Glyph & Cog, LLC
>> root at 08a02db0d267:/home/vcap/app/pop/bin#
>>
>>
>>
>> head: cannot open '10' for reading: No such file or directory ==> sss
>> <== 㻝㻛㻥
>>
>>>>
>> タᐃ᪥䠖㻞㻜㻝㻡ᖺ㻝㻞᭶㻣᪥
>> ಙクᮇ㛫䠖㻞㻜㻝㻡ᖺ㻝㻞᭶㻣᪥䛛䜙㻞㻜㻟㻝ᖺ㻥᭶㻞㻡᪥䜎䛷
>> Ỵ⟬᪥䠖ཎ๎䛸䛧䛶ẖᖺ㻥᭶㻞㻡᪥䠄ఇᴗ᪥䛾ሙྜ䛿⩣Ⴀᴗ᪥䠅
>> 䈜ᙜヱᐇ⦼䛿㐣ཤ䛾䜒䛾䛷䛒䜚䚸ᑗ᮶䛾㐠⏝ᡂᯝ➼䜢ಖド䛩䜛䜒䛾䛷䛿䛒䜚䜎䛫䜣䚹
>>
>> 䕔ᇶ‽౯㢠䞉⣧㈨⏘⥲㢠䛾᥎⛣
>> root at 08a02db0d267:/home/vcap/app/pop/bin#
>>
>>
>> -----Original Message-----
>> From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
>> Sent: 2019年5月29日 12:25
>> To: 'poppler at lists.freedesktop.org' <poppler at lists.freedesktop.org>
>> Cc: Leonard Rosenthol <lrosenth at adobe.com>; Zhong, Steven
>> <Steven.Zhong at fil.com>
>> Subject: Re: [poppler] How to recognize the Japan Font.
>>
>> Hi Zhong,
>>
>> As Leonard pointed, the fonts are embedded in the document. My comments are 3 points.
>>
>> * maybe you should install poppler-data package including the mapping tables from Adobe CID (please google or baidu to understand what it is) to character encoding.
>> * but your poppler 0.62.0 might be too old to find matching poppler-data package.
>> * I suggest to upgrade poppler and install poppler-data.
>>
>> Regards,
>> mpsuzuki
>>
>> On 2019/05/29 13:01, Leonard Rosenthol wrote:
>>> The font is embedded in the PDF – but that is only for the purposes of rendering.
>>> [cid:image001.png at 01D51616.3E8E4360]
>>>
>>> Leonard
>>>
>>> From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of
>>> "Zhong, Steven" <Steven.Zhong at fil.com>
>>> Date: Wednesday, May 29, 2019 at 11:58 AM
>>> To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
>>> Subject: [poppler] How to recognize the Japan Font.
>>>
>>> Hi All,
>>>
>>> I want to convert the PDF that you can refer the link
>>> https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__jpn01.safelinks&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C0ac2d97be75f4b75ea6608d6e3fed945%7Cc40454ddb2634926868d8e12640d3750%7C1%7C1%7C636947081055371277&sdata=g3XnEOhm9C9eQiqLOdqLKvcNKKddDmzEKD19SDUiKJM%3D&reserved=0.
>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofp
>>> oint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fwww.fidelity.jp-5Fs
>>> -26amp-3Bdata-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C09
>>> d6dc10535b413ef22308d6e3fc6de0-257Cc40454ddb2634926868d8e12640d3750-2
>>> 57C1-257C1-257C636947070675056492-26amp-3Bsdata-3DiwYEPUoN7tPgEUgMHus
>>> quSyleS21dcyUTfhn9T2IS74-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=SsZxQMf
>>> aWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4
>>> qjVsbZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=RiL7Rbx
>>> 9D0ATRSThNSLycraaa90647fDNMzMMsStiRg&e=
>>> tatic_pdf_fund_5111893-2DFD30BA_Reports_Monthly_FD30BA-2DMF-2D201904.
>>> p
>>> df&d=DwIFaQ&c=SsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv1
>>> 6
>>> eg2LZ2DjciLqO6MNuEh4qjVsbZJ_K528M&m=_RhRce5ysnSgbiIYDiT8YGyVac5MdwtW2
>>> Q
>>> AH434ax9Q&s=AAqXcYzH07HTqKJ-c6oM8j4kWBfgxzKIVxD65Hu328Y&e=<https://ur
>>> ldefense.proofpoint.com/v2/url?u=https-3A__url&d=DwIFaQ&c=SsZxQMfaWJ1
>>> sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4qjVs
>>> bZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=OuVcWm0oGLr
>>> ipj1jZpr-ciak66m4e62GoQBZyKNcgdg&e=
>>> defense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.protection.
>>> o
>>> utlook.com_-3Furl-3Dhttps-253A-252F-252Fwww.fidelity.jp-252Fstatic-25
>>> 2
>>> Fpdf-252Ffund-252F5111893-2DFD30BA-252FReports-252FMonthly-252FFD30BA
>>> -
>>> 2DMF-2D201904.pdf-26data-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.a
>>> c
>>> .jp-257C16e02f2420ec400edcd408d6e3ea576e-257Cc40454ddb2634926868d8e12
>>> 6
>>> 40d3750-257C1-257C0-257C636946992969364339-26sdata-3DCa0Lhw6vFQtBt7u5
>>> O
>>> mscsZlbFzTfkQC0rQAQASsgCNo-253D-26reserved-3D0&d=DwIFaQ&c=SsZxQMfaWJ1
>>> s
>>> SVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4qjVsb
>>> Z
>>> J_K528M&m=_RhRce5ysnSgbiIYDiT8YGyVac5MdwtW2QAH434ax9Q&s=YVKkBYgdhgptu
>>> A
>>> LB6Prm09Um2ul5LdCAlSEPijOtTNo&e=>
>>>
>>> But cant read it correctly ,  I find the Font is
>>> MS-PGothic-90ms-RKSJ-H Encoding is Identify-H
>>>
>>> Convert to txt is like below.        I guess it is font missing.    How to install the font and to read it currently.     Many Thanks
>>> ᅜෆ⥲⏕⏘䠄㻳㻰㻼䠅ᡂ㛗⋡䛜๓ᅄ༙ᮇ䛸ྠỈ‽䛻䛺䜚䚸୰ᅜᬒẼ䛻ᗏධ䜜ឤ䛜ฟጞ䜑䛯䛣䛸䜒㈙䛔Ᏻᚰឤ䛻䛴䛺
>>>
>>>
>>> vcap at e0779423-b47e-499c-4c1b-4ecd:~/app/pop/bin$ ./pdfinfo -v pdfinfo
>>> version 0.62.0 Copyright 2005-2017 The Poppler Developers -
>>> https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__jpn01.safelinks&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C0ac2d97be75f4b75ea6608d6e3fed945%7Cc40454ddb2634926868d8e12640d3750%7C1%7C1%7C636947081055371277&sdata=g3XnEOhm9C9eQiqLOdqLKvcNKKddDmzEKD19SDUiKJM%3D&reserved=0.
>>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofp
>>> oint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fpoppler.freedeskto-2
>>> 6amp-3Bdata-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C09d6
>>> dc10535b413ef22308d6e3fc6de0-257Cc40454ddb2634926868d8e12640d3750-257
>>> C1-257C1-257C636947070675066483-26amp-3Bsdata-3DN-252BqNP9qA9qw6Rs-25
>>> 2BUnlKxaSo9HGspgKcO2Wrv2ALjdfw-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=S
>>> sZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6
>>> MNuEh4qjVsbZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=F
>>> y6eKP-_bWf-ozvhv8lc5rjtBEG1_MU0Uy_JsKylOSU&e=
>>> p.org&d=DwIFaQ&c=SsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3
>>> x
>>> v16eg2LZ2DjciLqO6MNuEh4qjVsbZJ_K528M&m=_RhRce5ysnSgbiIYDiT8YGyVac5Mdw
>>> t
>>> W2QAH434ax9Q&s=rMRVesSKrqPMQNmKpZ9oOO2FhiZY5fDFo4xJVQl34gs&e=<https:/
>>> /
>>> urldefense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.protecti
>>> o
>>> n.outlook.com_-3Furl-3Dhttp-253A-252F-252Fpoppler.freedesktop.org-26d
>>> a
>>> ta-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C16e02f2420ec4
>>> 0
>>> 0edcd408d6e3ea576e-257Cc40454ddb2634926868d8e12640d3750-257C1-257C0-2
>>> 5
>>> 7C636946992969374325-26sdata-3Dxq-252FKaib2f9WujNOEGxTm-252FtQoWlyAd0
>>> d
>>> -252BIvFAxWMM8yw-253D-26reserved-3D0&d=DwIFaQ&c=SsZxQMfaWJ1sSVfloc5FV
>>> G
>>> ba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4qjVsbZJ_K528M&m
>>> =
>>> _RhRce5ysnSgbiIYDiT8YGyVac5MdwtW2QAH434ax9Q&s=qPUF-7sEtuuD4I6Z9atZnYM
>>> 4
>>> WK-1QvxVAVJOxFP3Oro&e=>
>>> Copyright 1996-2011 Glyph & Cog, LLC
>>>
>>> My popper is 0.6.2
>>> vcap at e0779423-b47e-499c-4c1b-4ecd:~/app/pop/bin$ ./pdfinfo -listenc
>>> Available encodings are:
>>> ASCII7
>>> Big5
>>> Big5ascii
>>> EUC-CN
>>> EUC-JP
>>> GBK
>>> ISO-2022-CN
>>> ISO-2022-JP
>>> ISO-2022-KR
>>> ISO-8859-6
>>> ISO-8859-7
>>> ISO-8859-8
>>> ISO-8859-9
>>> KOI8-R
>>> Latin1
>>> Latin2
>>> Shift-JIS
>>> Symbol
>>> TIS-620
>>> UTF-16
>>> UTF-8
>>> Windows-1255
>>> ZapfDingbats
>>>
>>
> 



More information about the poppler mailing list