[poppler] PDF files with embedded Chinese fonts

Sun Feb 8 12:16:18 PST 2009

Hi Adamson, and Albert.

On 09/02/2009, at 3:14 AM, Adamson H wrote:

> Yes, I have poppler-data 0.2.0-2. Please take a look at these two
> screenshots http://launchpadlibrarian.net/21701829/Screenshot- 
> evince.png
> and http://launchpadlibrarian.net/21701834/Screenshot-foxit.png for

I see broken characters too, using Apple's Preview to read   
dell440.pdf ,
but not with Adobe Reader v9.x on MacOS X.

(see attached images, of a portion of your PDF).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-adobe.png
Type: application/applefile
Size: 28668 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0002.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-adobe.png
Type: image/png
Size: 31174 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0002.png 
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-preview.png
Type: application/applefile
Size: 28190 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0003.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-preview.png
Type: image/png
Size: 31986 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0003.png 
-------------- next part --------------

This suggests that it is a problem within the fonts themselves,
allowing different possible interpretations by font-rendering
software.

Alternatively, Adobe is using some information that other PDF
renderers, or text-extraction tools, are not using.
To back up this statement, suppose you select the last of
the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text.

Adobe gives:

   ???????????

Now use      pdftotext -raw dell440.pdf
and find the appropriate portion; you'll get:

??????????????????????? 
?????????????????????

in which each ideograph is repeated 4 times over.
This repetition seems to be a common way to get bold-face,
rather than using a separate font.
Copy/Paste from Apple's preview gives the same result as  pdftotext .

Perhaps it is this multiple overstriking that causes the bad display?
If so, how does Adobe know how to get it correct?
What extra information is Adobe using?

I have another example of this kind of thing:

    http://www.maths.mq.edu.au/~ross/poppler/Big5/

Big5-actual.pdf   170k      --- has /ActualText tagging
Big5-actual.txt   97 bytes
Big5-alt.pdf      169k      --- has /Alt tagging
Big5-alt.txt      434 bytes
Big5-notags.pdf   157k      ---  no special tagging
Big5-notags.txt   432 bytes

The corresponding .txt files were obtained using  pdftotext -raw
with Poppler version as follows:

[GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
pdftotext version 0.10.3
Copyright 2005-2009 The Poppler Developers - http:// 
poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC

It is clear just from the file-size of  Big5-actual.txt  that
Poppler isn't extracting the /ActualText  in this case.
Also, if you look at the contents of  Big5-notags.txt  you'll
see the same kind of "multiple-striking" to get the bold effect.

With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
is meant to be mapped to a single Unicode character.
But Poppler has no support for /Alt tagging, which is why
Big5-alt.txt  is practically the same size as  Big5-notags.txt .

With these three PDFs, Adobe Reader cannot extract
the chinese characters from   Big5-notags.pdf
whereas it can do so from   Big5-actual.pdf  and  Big5-alt.pdf
due to the extra tagging.

Apple's Preview and Poppler, on the other hand, can identify
the characters (presumably from information in the fonts or
their encoding arrays --- a CMap is not applicable).
But both extract three copies when the multiple striking occurs,
so are not dealing with the /Alt or /ActualText tags.
Furthermore, Poppler gives nothing for the ideographs
marked with /ActualText tagging.

I've been looking at this kind of thing for some time now,
with tagging and Chinese/Korean/Japanese documents (produced
using pdfTeX) and the result of Text-extraction using different
tools.
   It seems that no-one gets it right all the time, which makes
it really hard to prepare a bug-report --- which software is
the one which is buggy, when all appear to neglect available
information, or process it incorrectly in different ways?

For certain I can say that Poppler has (at least) two bugs:

  1.  /ActualText  doesn't work properly for the content in these
                    Big5-*.pdf  documents;

  2.  /Alt  isn't even recognised by Poppler;
        (there is no coding to support it in either
           TextOutputDev.cc   or   Gfx.cc )

Could the PDF property streams in my PDFs be malformed in some way?
Yes, I've looked at that, and have tried different ways to place
the tagging in them --- these made no difference whatsoever to
the result of text-extraction with the different software tools
that I've tried.

> comparison. I don't have any problems viewing other Chinese PDF  
> files on
> my system. I use pdffonts to check fonts used by the file, and I have
> simsun.ttf installed.
>
> Adamson H

Hope this helps someone identify the problems,
and where/how to fix them.

Cheers,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------