[poppler] Encoding of font names

Mon Aug 29 10:51:52 PDT 2011

Hi,

I appreciate your interest & effort about non-Unicode font names!

Albert Astals Cid wrote:
> Today I've been working on trying to fix the names reported by pdffonts for 
> non latin1 fonts, I have not got anything very clear while reading the spec, 
> but I understood that the BaseFont string is encoded using the /Encoding 
> encoding. This has worked fine for some files but not for all like one that 
> says
> /BaseFont /#CB#CE#CC#E5
> /Encoding /UniGB-UCS2-H
> If i try to map that to Unicode i get nothing. And Adobe Reader properly maps 
> that to 宋体

Although I've not tested comprehensively yet, I guess
Adobe implementation has some heuristic workaround for
the font names coded by legacy localization mechanism.

0xCB 0xCE 0xCC 0xE5 is GB-2312 encoding of 宋体.

# you can check as:
# perl -le '{printf("%c%c%c%c\n", 0xCB, 0xCE, 0xCC, 0xE5);}' | iconv -f gbk -t utf-8

I guess, Adobe implementation processes as following:

1) check font name if it is in hexadecimal syntax "/#xx#xx#xx..."
2) if its encoding is one of the predefined CJK CMaps,
   try to decode the font name by
   Adobe-CNS1 -> Big5
   Adobe-GB1 -> GB-2312 (or GBK)
   Adobe-Japan1 or Adobe-Japan2 -> Shift_JIS (or Windows-31J)
   Adobe-Korea1 -> Wansung

Fortunately, core part of these legacy localizations are
almost same in MS Windows and Mac OS, the coverage of possible
legacy encoding is not so wide.

> Any idea what is the proper manipulation one has to do over BaseFont to get 
> the Unicode value?

I think if we can request iconv for the users who are interested
in non-Unicode or non-ASCII font name, the conversion is not so
difficult.

One of my concern is that I don't know about the handling of non-
CJK (or CJK-but-not-predefined) localized font names, like,
Adobe-Vietnam1, etc.

This is urgent issue? If not, I will try to write some workaround
for this issue.

Regards,
mpsuzuki