[poppler] Encoding of font names
suzuki toshiya
mpsuzuki at hiroshima-u.ac.jp
Mon Aug 29 10:51:52 PDT 2011
Hi,
I appreciate your interest & effort about non-Unicode font names!
Albert Astals Cid wrote:
> Today I've been working on trying to fix the names reported by pdffonts for
> non latin1 fonts, I have not got anything very clear while reading the spec,
> but I understood that the BaseFont string is encoded using the /Encoding
> encoding. This has worked fine for some files but not for all like one that
> says
> /BaseFont /#CB#CE#CC#E5
> /Encoding /UniGB-UCS2-H
> If i try to map that to Unicode i get nothing. And Adobe Reader properly maps
> that to 宋体
Although I've not tested comprehensively yet, I guess
Adobe implementation has some heuristic workaround for
the font names coded by legacy localization mechanism.
0xCB 0xCE 0xCC 0xE5 is GB-2312 encoding of 宋体.
# you can check as:
# perl -le '{printf("%c%c%c%c\n", 0xCB, 0xCE, 0xCC, 0xE5);}' | iconv -f gbk -t utf-8
I guess, Adobe implementation processes as following:
1) check font name if it is in hexadecimal syntax "/#xx#xx#xx..."
2) if its encoding is one of the predefined CJK CMaps,
try to decode the font name by
Adobe-CNS1 -> Big5
Adobe-GB1 -> GB-2312 (or GBK)
Adobe-Japan1 or Adobe-Japan2 -> Shift_JIS (or Windows-31J)
Adobe-Korea1 -> Wansung
Fortunately, core part of these legacy localizations are
almost same in MS Windows and Mac OS, the coverage of possible
legacy encoding is not so wide.
> Any idea what is the proper manipulation one has to do over BaseFont to get
> the Unicode value?
I think if we can request iconv for the users who are interested
in non-Unicode or non-ASCII font name, the conversion is not so
difficult.
One of my concern is that I don't know about the handling of non-
CJK (or CJK-but-not-predefined) localized font names, like,
Adobe-Vietnam1, etc.
This is urgent issue? If not, I will try to write some workaround
for this issue.
Regards,
mpsuzuki
More information about the poppler
mailing list