[poppler] PDF files with embedded Chinese fonts
leonardr at pdfsages.com
Mon Feb 9 11:11:53 PST 2009
The main problem with your files is that the producer is doing a S**TTY job
of font encoding...I don't know what type of font(s) you are starting with,
but the final PDF is produced with dynamically produced, quite poor, Type 1
fonts. In addition, there are _NO_ ToUnicode tables (ISO 332000-1, 9.6.1)
which is what prevents the proper extraction of the content by a tool that
follows the requirements of section 9.10 of ISO 32000-1.
Adobe Acrobat/Reader is able to properly extract the text from the
ActualText version, since the "ActualText" is provided in the tagged content
as defined in section 14.9.4 of ISO 32000-1. However, if you read the
preceeding section (14.9.3) which describes the Alt tag, you will see that
it is not what you think it is - which is why that doesn't work.
It is unfortunate that only Adobe's tools correctly support Tagged PDF and
use those features to provide richer semantic extraction of PDF content. I
would love to see someone add such support to Poppler...
PDF Standards Architect
On Sun, Feb 8, 2009 at 3:16 PM, Ross Moore <ross at ics.mq.edu.au> wrote:
> Hi Adamson, and Albert.
> On 09/02/2009, at 3:14 AM, Adamson H wrote:
> Yes, I have poppler-data 0.2.0-2. Please take a look at these two
>> screenshots http://launchpadlibrarian.net/21701829/Screenshot-evince.png
>> and http://launchpadlibrarian.net/21701834/Screenshot-foxit.png for
> I see broken characters too, using Apple's Preview to read dell440.pdf ,
> but not with Adobe Reader v9.x on MacOS X.
> (see attached images, of a portion of your PDF).
> This suggests that it is a problem within the fonts themselves,
> allowing different possible interpretations by font-rendering
> Alternatively, Adobe is using some information that other PDF
> renderers, or text-extraction tools, are not using.
> To back up this statement, suppose you select the last of
> the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text.
> Adobe gives:
> Now use pdftotext -raw dell440.pdf
> and find the appropriate portion; you'll get:
> in which each ideograph is repeated 4 times over.
> This repetition seems to be a common way to get bold-face,
> rather than using a separate font.
> Copy/Paste from Apple's preview gives the same result as pdftotext .
> Perhaps it is this multiple overstriking that causes the bad display?
> If so, how does Adobe know how to get it correct?
> What extra information is Adobe using?
> I have another example of this kind of thing:
> Big5-actual.pdf 170k --- has /ActualText tagging
> Big5-actual.txt 97 bytes
> Big5-alt.pdf 169k --- has /Alt tagging
> Big5-alt.txt 434 bytes
> Big5-notags.pdf 157k --- no special tagging
> Big5-notags.txt 432 bytes
> The corresponding .txt files were obtained using pdftotext -raw
> with Poppler version as follows:
> [GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
> pdftotext version 0.10.3
> Copyright 2005-2009 The Poppler Developers - http://
> Copyright 1996-2004 Glyph & Cog, LLC
> It is clear just from the file-size of Big5-actual.txt that
> Poppler isn't extracting the /ActualText in this case.
> Also, if you look at the contents of Big5-notags.txt you'll
> see the same kind of "multiple-striking" to get the bold effect.
> With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
> is meant to be mapped to a single Unicode character.
> But Poppler has no support for /Alt tagging, which is why
> Big5-alt.txt is practically the same size as Big5-notags.txt .
> With these three PDFs, Adobe Reader cannot extract
> the chinese characters from Big5-notags.pdf
> whereas it can do so from Big5-actual.pdf and Big5-alt.pdf
> due to the extra tagging.
> Apple's Preview and Poppler, on the other hand, can identify
> the characters (presumably from information in the fonts or
> their encoding arrays --- a CMap is not applicable).
> But both extract three copies when the multiple striking occurs,
> so are not dealing with the /Alt or /ActualText tags.
> Furthermore, Poppler gives nothing for the ideographs
> marked with /ActualText tagging.
> I've been looking at this kind of thing for some time now,
> with tagging and Chinese/Korean/Japanese documents (produced
> using pdfTeX) and the result of Text-extraction using different
> It seems that no-one gets it right all the time, which makes
> it really hard to prepare a bug-report --- which software is
> the one which is buggy, when all appear to neglect available
> information, or process it incorrectly in different ways?
> For certain I can say that Poppler has (at least) two bugs:
> 1. /ActualText doesn't work properly for the content in these
> Big5-*.pdf documents;
> 2. /Alt isn't even recognised by Poppler;
> (there is no coding to support it in either
> TextOutputDev.cc or Gfx.cc )
> Could the PDF property streams in my PDFs be malformed in some way?
> Yes, I've looked at that, and have tried different ways to place
> the tagging in them --- these made no difference whatsoever to
> the result of text-extraction with the different software tools
> that I've tried.
> comparison. I don't have any problems viewing other Chinese PDF files on
>> my system. I use pdffonts to check fonts used by the file, and I have
>> simsun.ttf installed.
>> Adamson H
> Hope this helps someone identify the problems,
> and where/how to fix them.
> Ross Moore ross at maths.mq.edu.au
> Mathematics Department office: E7A-419
> Macquarie University tel: +61 (0)2 9850 8955
> Sydney, Australia 2109 fax: +61 (0)2 9850 8114
> poppler mailing list
> poppler at lists.freedesktop.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the poppler