[poppler] How to recognize the Japan Font.

Zhong, Steven Steven.Zhong at fil.com
Wed May 29 06:28:09 UTC 2019


Hi Suzuki-san,

Thanks for your help.  I have recognize it using the tesseract (OCR), but sometimes the result of OCR is not good.  So I think if can extract it to text,  it will be better.   Thanks again.

-----Original Message-----
From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> 
Sent: 2019年5月29日 14:24
To: Zhong, Steven <Steven.Zhong at fil.com>; 'poppler at lists.freedesktop.org' <poppler at lists.freedesktop.org>
Cc: Leonard Rosenthol <lrosenth at adobe.com>
Subject: Re: [poppler] How to recognize the Japan Font.

Dear Zhong,

Oooooh, I apologize that I gave wrong comment. I've confirmed that this PDF cannot be searched even if I give it to Adobe products (it does not mean a data protection, I guess it is caused by a poor workflow to generate PDF). At present I cannot suggest easy method to extract the text from this PDF - maybe OCR is the easiest?

Regards,
mpsuzuki

On 2019/05/29 15:10, Zhong, Steven wrote:
> Hi Suzuki-san,
> 
> We have install the latest poppler and poppler-data.  But the result is the same.
> 
> By the way , we can't copy the content correctly from the PDF on win 10 through Ctrl + C , Ctrl +V.     Thanks
> 
> root at 08a02db0d267:/home/vcap/app/pop/bin# ./pdfinfo -v pdfinfo version 
> 0.77.0 Copyright 2005-2019 The Poppler Developers - 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.p
> rotection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fpoppler.freedesktop.
> org-26amp-3Bdata-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C
> 09d6dc10535b413ef22308d6e3fc6de0-257Cc40454ddb2634926868d8e12640d3750-
> 257C1-257C1-257C636947070675056492-26amp-3Bsdata-3DdizuW5UOqRDQtbDVXf2
> gpXv5xjWGXt-252FCb6ii4ySUi-252FA-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=
> SsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6
> MNuEh4qjVsbZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=Xy
> n_-enhlRg4uRPJdoxoWPN33MO8ugMmRFgZXEdwhI4&e=
> Copyright 1996-2011 Glyph & Cog, LLC
> root at 08a02db0d267:/home/vcap/app/pop/bin#
> 
> 
> 
> head: cannot open '10' for reading: No such file or directory ==> sss 
> <== 㻝㻛㻥
> 
>> 
> タᐃ᪥䠖㻞㻜㻝㻡ᖺ㻝㻞᭶㻣᪥
> ಙクᮇ㛫䠖㻞㻜㻝㻡ᖺ㻝㻞᭶㻣᪥䛛䜙㻞㻜㻟㻝ᖺ㻥᭶㻞㻡᪥䜎䛷
> Ỵ⟬᪥䠖ཎ๎䛸䛧䛶ẖᖺ㻥᭶㻞㻡᪥䠄ఇᴗ᪥䛾ሙྜ䛿⩣Ⴀᴗ᪥䠅
> 䈜ᙜヱᐇ⦼䛿㐣ཤ䛾䜒䛾䛷䛒䜚䚸ᑗ᮶䛾㐠⏝ᡂᯝ➼䜢ಖド䛩䜛䜒䛾䛷䛿䛒䜚䜎䛫䜣䚹
> 
> 䕔ᇶ‽౯㢠䞉⣧㈨⏘⥲㢠䛾᥎⛣
> root at 08a02db0d267:/home/vcap/app/pop/bin#
> 
> 
> -----Original Message-----
> From: suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>
> Sent: 2019年5月29日 12:25
> To: 'poppler at lists.freedesktop.org' <poppler at lists.freedesktop.org>
> Cc: Leonard Rosenthol <lrosenth at adobe.com>; Zhong, Steven 
> <Steven.Zhong at fil.com>
> Subject: Re: [poppler] How to recognize the Japan Font.
> 
> Hi Zhong,
> 
> As Leonard pointed, the fonts are embedded in the document. My comments are 3 points.
> 
> * maybe you should install poppler-data package including the mapping tables from Adobe CID (please google or baidu to understand what it is) to character encoding.
> * but your poppler 0.62.0 might be too old to find matching poppler-data package.
> * I suggest to upgrade poppler and install poppler-data.
> 
> Regards,
> mpsuzuki
> 
> On 2019/05/29 13:01, Leonard Rosenthol wrote:
>> The font is embedded in the PDF – but that is only for the purposes of rendering.
>> [cid:image001.png at 01D51616.3E8E4360]
>>
>> Leonard
>>
>> From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of 
>> "Zhong, Steven" <Steven.Zhong at fil.com>
>> Date: Wednesday, May 29, 2019 at 11:58 AM
>> To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
>> Subject: [poppler] How to recognize the Japan Font.
>>
>> Hi All,
>>
>> I want to convert the PDF that you can refer the link 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.
>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofp
>> oint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fwww.fidelity.jp-5Fs
>> -26amp-3Bdata-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C09
>> d6dc10535b413ef22308d6e3fc6de0-257Cc40454ddb2634926868d8e12640d3750-2
>> 57C1-257C1-257C636947070675056492-26amp-3Bsdata-3DiwYEPUoN7tPgEUgMHus
>> quSyleS21dcyUTfhn9T2IS74-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=SsZxQMf
>> aWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4
>> qjVsbZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=RiL7Rbx
>> 9D0ATRSThNSLycraaa90647fDNMzMMsStiRg&e=
>> tatic_pdf_fund_5111893-2DFD30BA_Reports_Monthly_FD30BA-2DMF-2D201904.
>> p
>> df&d=DwIFaQ&c=SsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv1
>> 6 
>> eg2LZ2DjciLqO6MNuEh4qjVsbZJ_K528M&m=_RhRce5ysnSgbiIYDiT8YGyVac5MdwtW2
>> Q 
>> AH434ax9Q&s=AAqXcYzH07HTqKJ-c6oM8j4kWBfgxzKIVxD65Hu328Y&e=<https://ur
>> ldefense.proofpoint.com/v2/url?u=https-3A__url&d=DwIFaQ&c=SsZxQMfaWJ1
>> sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4qjVs
>> bZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=OuVcWm0oGLr
>> ipj1jZpr-ciak66m4e62GoQBZyKNcgdg&e=
>> defense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.protection.
>> o
>> utlook.com_-3Furl-3Dhttps-253A-252F-252Fwww.fidelity.jp-252Fstatic-25
>> 2
>> Fpdf-252Ffund-252F5111893-2DFD30BA-252FReports-252FMonthly-252FFD30BA
>> - 
>> 2DMF-2D201904.pdf-26data-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.a
>> c
>> .jp-257C16e02f2420ec400edcd408d6e3ea576e-257Cc40454ddb2634926868d8e12
>> 6 
>> 40d3750-257C1-257C0-257C636946992969364339-26sdata-3DCa0Lhw6vFQtBt7u5
>> O 
>> mscsZlbFzTfkQC0rQAQASsgCNo-253D-26reserved-3D0&d=DwIFaQ&c=SsZxQMfaWJ1
>> s 
>> SVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4qjVsb
>> Z 
>> J_K528M&m=_RhRce5ysnSgbiIYDiT8YGyVac5MdwtW2QAH434ax9Q&s=YVKkBYgdhgptu
>> A
>> LB6Prm09Um2ul5LdCAlSEPijOtTNo&e=>
>>
>> But cant read it correctly ,  I find the Font is 
>> MS-PGothic-90ms-RKSJ-H Encoding is Identify-H
>>
>> Convert to txt is like below.        I guess it is font missing.    How to install the font and to read it currently.     Many Thanks
>> ᅜෆ⥲⏕⏘䠄㻳㻰㻼䠅ᡂ㛗⋡䛜๓ᅄ༙ᮇ䛸ྠỈ‽䛻䛺䜚䚸୰ᅜᬒẼ䛻ᗏධ䜜ឤ䛜ฟጞ䜑䛯䛣䛸䜒㈙䛔Ᏻᚰឤ䛻䛴䛺
>>
>>
>> vcap at e0779423-b47e-499c-4c1b-4ecd:~/app/pop/bin$ ./pdfinfo -v pdfinfo 
>> version 0.62.0 Copyright 2005-2017 The Poppler Developers - 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.
>> protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofp
>> oint.com-252Fv2-252Furl-253Fu-253Dhttp-2D3A-5F-5Fpoppler.freedeskto-2
>> 6amp-3Bdata-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C09d6
>> dc10535b413ef22308d6e3fc6de0-257Cc40454ddb2634926868d8e12640d3750-257
>> C1-257C1-257C636947070675066483-26amp-3Bsdata-3DN-252BqNP9qA9qw6Rs-25
>> 2BUnlKxaSo9HGspgKcO2Wrv2ALjdfw-253D-26amp-3Breserved-3D0&d=DwIFaQ&c=S
>> sZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6
>> MNuEh4qjVsbZJ_K528M&m=dO98ldsGLrKLwbkFdIe_Ohvg3Tox91cbIvhvEc9bkvk&s=F
>> y6eKP-_bWf-ozvhv8lc5rjtBEG1_MU0Uy_JsKylOSU&e=
>> p.org&d=DwIFaQ&c=SsZxQMfaWJ1sSVfloc5FVGba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3
>> x 
>> v16eg2LZ2DjciLqO6MNuEh4qjVsbZJ_K528M&m=_RhRce5ysnSgbiIYDiT8YGyVac5Mdw
>> t 
>> W2QAH434ax9Q&s=rMRVesSKrqPMQNmKpZ9oOO2FhiZY5fDFo4xJVQl34gs&e=<https:/
>> / 
>> urldefense.proofpoint.com/v2/url?u=https-3A__jpn01.safelinks.protecti
>> o 
>> n.outlook.com_-3Furl-3Dhttp-253A-252F-252Fpoppler.freedesktop.org-26d
>> a
>> ta-3D02-257C01-257Cmpsuzuki-2540hiroshima-2Du.ac.jp-257C16e02f2420ec4
>> 0
>> 0edcd408d6e3ea576e-257Cc40454ddb2634926868d8e12640d3750-257C1-257C0-2
>> 5 
>> 7C636946992969374325-26sdata-3Dxq-252FKaib2f9WujNOEGxTm-252FtQoWlyAd0
>> d 
>> -252BIvFAxWMM8yw-253D-26reserved-3D0&d=DwIFaQ&c=SsZxQMfaWJ1sSVfloc5FV
>> G 
>> ba8BA_qR4Jzdt8ol2oSPA&r=tyXS-3xv16eg2LZ2DjciLqO6MNuEh4qjVsbZJ_K528M&m
>> =
>> _RhRce5ysnSgbiIYDiT8YGyVac5MdwtW2QAH434ax9Q&s=qPUF-7sEtuuD4I6Z9atZnYM
>> 4
>> WK-1QvxVAVJOxFP3Oro&e=>
>> Copyright 1996-2011 Glyph & Cog, LLC
>>
>> My popper is 0.6.2
>> vcap at e0779423-b47e-499c-4c1b-4ecd:~/app/pop/bin$ ./pdfinfo -listenc 
>> Available encodings are:
>> ASCII7
>> Big5
>> Big5ascii
>> EUC-CN
>> EUC-JP
>> GBK
>> ISO-2022-CN
>> ISO-2022-JP
>> ISO-2022-KR
>> ISO-8859-6
>> ISO-8859-7
>> ISO-8859-8
>> ISO-8859-9
>> KOI8-R
>> Latin1
>> Latin2
>> Shift-JIS
>> Symbol
>> TIS-620
>> UTF-16
>> UTF-8
>> Windows-1255
>> ZapfDingbats
>>
> 



More information about the poppler mailing list