[poppler] problem displaying pdf contains Chinese character

Sun Apr 5 15:23:47 PDT 2009

>>>>> "Yuanle" == Yuanle Song <sylecn at gmail.com> writes:

>> :; fc-list 宋体 family familylang file
>> :; fc-list 黑体 family familylang file
>> :; fc-list 仿宋_GB2312 family familylang file
>> :; fc-list 楷体_GB2312 family familylang file

Yuanle> As I said I already have them, according to fc-list.

I was hoping to see the full family and familylang output, especially
for the latter two.

In any case, I hit upon soemthing poppler can do to circumvent this
issue.  If poppler sets the lang parameter when searching for a font
the resulting font set is more likely to have glyphs which will work.

Eg, in this case poppler should ask for the equivilent of:

:; fc-match NAME:lang=zh-cn

or at least for the equivilent of:

:; fc-match NAME:lang=zh

choosing the lang it specifies based on the characters the pdf wants
to use it for.

Yuanle> Please tell me if there is other ways I can get more debug
Yuanle> output or log.

You can try uncompressing the pdf with something like podofo¹ or pdftk².

Then, looking in that file in a pager such as less(1) (or in a text
editor), search for the objects specified by pdffonts.  Eg, for:

,----
| 仿宋_GB2312   TrueType   no  no  no  1600  0
`----

Look in the uncompressed version for the regex /^1600 0 obj/.  Everthing
from that line to the next line matching /^endobj/ should be the /Font
object for 仿宋_GB2312.  It would be useful to see those objects for each
of the four fonts listed in your page1to10 report.

(To be explicit, the first digit-string in the regex is the object
column from the pdffonts outout and the second digit-string is the
ID column.)

The object may include a /FontDescriptor entry.  If the contents of any
interesting entries are of the form:

/string [0-9]+ [0-9]+ R

then that is a reference to another object.  You'll also want to look at
those referenced objects.

As an example, I'm looking at a file which has this object:

,----
| 4 0 obj
| <<
| /Type /Font
| /BaseFont /OZPPOK+LMRoman12-Regular
| /Encoding 6 0 R
| /FirstChar 49
| /FontDescriptor 9 0 R
| /LastChar 122
| /Subtype /Type1
| /Widths 7 0 R
| >>
| endobj
`----

That shows that I need to look at object 9 0 for the /FontDescriptor,
which looks like:

,----
| 9 0 obj
| <<
| /Type /FontDescriptor
| /Ascent 689
| /CapHeight 689
| /CharSet (/a/e/o/one/z)
| /Descent -194
| /Flags 4
| /FontBBox [ -422 -280 1394 1127 ]
| /FontFile 8 0 R
| /FontName /OZPPOK+LMRoman12-Regular
| /ItalicAngle 0
| /StemV 65
| /XHeight 431
| >>
| endobj
`----

The contents of the /Font objects and their children would be useful.

Is the pdf something you can post somewhere?

-JiMC

1] PoDoFo is at:

   http://podofo.sourceforge.net/
   http://sourceforge.net/projects/podofo/

   The command line to uncompress a pdf is:

   :; podofouncompress original.pdf new.pdf

2] PdfTk is at:

   http://www.pdfhacks.com/pdftk

   Its cli is:

   :; pdftk original.pdf output new.pdf uncompress

Both are probably packaged by your distribution.

-- 
James Cloos <cloos at jhcloos.com>         OpenPGP: 1024D/ED7DAEA6