[poppler] PDF files with embedded Chinese fonts

Sun Feb 8 20:36:28 PST 2009

Hi, I opened these Chinese PDF files with Oo3 and found most of those 
broken characters showed up correctly, but still not as good as foxit 
reader or Adobe reader. 
https://bugs.launchpad.net/ubuntu/+source/evince/+bug/159220 is related 
to this bug.

Adamson H

-------- Original Message --------
Subject: Re: [poppler] PDF files with embedded Chinese fonts
From: Ross Moore <ross at ics.mq.edu.au>
To: Adamson H <adamson at polycastle.3322.org>
Date: 02/09/2009 04:16 AM
> Hi Adamson, and Albert.
>
> On 09/02/2009, at 3:14 AM, Adamson H wrote:
>
>> Yes, I have poppler-data 0.2.0-2. Please take a look at these two
>> screenshots http://launchpadlibrarian.net/21701829/Screenshot-evince.png
>> and http://launchpadlibrarian.net/21701834/Screenshot-foxit.png for
>
> I see broken characters too, using Apple's Preview to read dell440.pdf ,
> but not with Adobe Reader v9.x on MacOS X.
>
> (see attached images, of a portion of your PDF).
>
>
> ------------------------------------------------------------------------
>
>
>
> ------------------------------------------------------------------------
>
>
>
> This suggests that it is a problem within the fonts themselves,
> allowing different possible interpretations by font-rendering
> software.
>
> Alternatively, Adobe is using some information that other PDF
> renderers, or text-extraction tools, are not using.
> To back up this statement, suppose you select the last of
> the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text.
>
> Adobe gives:
>
> 以电子邮件发送您的订单
>
>
> Now use pdftotext -raw dell440.pdf
> and find the appropriate portion; you'll get:
>
> 以以以以电电电电子子子子邮邮邮邮件件件件发发发发送送送送您您您您的的的 
> 的订单订单订单订单
>
> in which each ideograph is repeated 4 times over.
> This repetition seems to be a common way to get bold-face,
> rather than using a separate font.
> Copy/Paste from Apple's preview gives the same result as pdftotext .
>
> Perhaps it is this multiple overstriking that causes the bad display?
> If so, how does Adobe know how to get it correct?
> What extra information is Adobe using?
>
>
> I have another example of this kind of thing:
>
> http://www.maths.mq.edu.au/~ross/poppler/Big5/
>
> Big5-actual.pdf 170k --- has /ActualText tagging
> Big5-actual.txt 97 bytes
> Big5-alt.pdf 169k --- has /Alt tagging
> Big5-alt.txt 434 bytes
> Big5-notags.pdf 157k --- no special tagging
> Big5-notags.txt 432 bytes
>
> The corresponding .txt files were obtained using pdftotext -raw
> with Poppler version as follows:
>
> [GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
> pdftotext version 0.10.3
> Copyright 2005-2009 The Poppler Developers - 
> http://poppler.freedesktop.org
> Copyright 1996-2004 Glyph & Cog, LLC
>
>
> It is clear just from the file-size of Big5-actual.txt that
> Poppler isn't extracting the /ActualText in this case.
> Also, if you look at the contents of Big5-notags.txt you'll
> see the same kind of "multiple-striking" to get the bold effect.
>
> With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
> is meant to be mapped to a single Unicode character.
> But Poppler has no support for /Alt tagging, which is why
> Big5-alt.txt is practically the same size as Big5-notags.txt .
>
>
> With these three PDFs, Adobe Reader cannot extract
> the chinese characters from Big5-notags.pdf
> whereas it can do so from Big5-actual.pdf and Big5-alt.pdf
> due to the extra tagging.
>
> Apple's Preview and Poppler, on the other hand, can identify
> the characters (presumably from information in the fonts or
> their encoding arrays --- a CMap is not applicable).
> But both extract three copies when the multiple striking occurs,
> so are not dealing with the /Alt or /ActualText tags.
> Furthermore, Poppler gives nothing for the ideographs
> marked with /ActualText tagging.
>
>
> I've been looking at this kind of thing for some time now,
> with tagging and Chinese/Korean/Japanese documents (produced
> using pdfTeX) and the result of Text-extraction using different
> tools.
> It seems that no-one gets it right all the time, which makes
> it really hard to prepare a bug-report --- which software is
> the one which is buggy, when all appear to neglect available
> information, or process it incorrectly in different ways?
>
> For certain I can say that Poppler has (at least) two bugs:
>
> 1. /ActualText doesn't work properly for the content in these
> Big5-*.pdf documents;
>
> 2. /Alt isn't even recognised by Poppler;
> (there is no coding to support it in either
> TextOutputDev.cc or Gfx.cc )
>
> Could the PDF property streams in my PDFs be malformed in some way?
> Yes, I've looked at that, and have tried different ways to place
> the tagging in them --- these made no difference whatsoever to
> the result of text-extraction with the different software tools
> that I've tried.
>
>
>> comparison. I don't have any problems viewing other Chinese PDF files on
>> my system. I use pdffonts to check fonts used by the file, and I have
>> simsun.ttf installed.
>>
>> Adamson H
>
>
> Hope this helps someone identify the problems,
> and where/how to fix them.
>
> Cheers,
>
> Ross
>
>
> ------------------------------------------------------------------------
> Ross Moore ross at maths.mq.edu.au
> Mathematics Department office: E7A-419
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia 2109 fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>