[poppler] PDF files with embedded Chinese fonts
Ross Moore
ross at ics.mq.edu.au
Sun Feb 8 12:16:18 PST 2009
Hi Adamson, and Albert.
On 09/02/2009, at 3:14 AM, Adamson H wrote:
> Yes, I have poppler-data 0.2.0-2. Please take a look at these two
> screenshots http://launchpadlibrarian.net/21701829/Screenshot-
> evince.png
> and http://launchpadlibrarian.net/21701834/Screenshot-foxit.png for
I see broken characters too, using Apple's Preview to read
dell440.pdf ,
but not with Adobe Reader v9.x on MacOS X.
(see attached images, of a portion of your PDF).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-adobe.png
Type: application/applefile
Size: 28668 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0002.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-adobe.png
Type: image/png
Size: 31174 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0002.png
-------------- next part --------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-preview.png
Type: application/applefile
Size: 28190 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0003.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dell440-preview.png
Type: image/png
Size: 31986 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090209/0f8b3858/attachment-0003.png
-------------- next part --------------
This suggests that it is a problem within the fonts themselves,
allowing different possible interpretations by font-rendering
software.
Alternatively, Adobe is using some information that other PDF
renderers, or text-extraction tools, are not using.
To back up this statement, suppose you select the last of
the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text.
Adobe gives:
???????????
Now use pdftotext -raw dell440.pdf
and find the appropriate portion; you'll get:
???????????????????????
?????????????????????
in which each ideograph is repeated 4 times over.
This repetition seems to be a common way to get bold-face,
rather than using a separate font.
Copy/Paste from Apple's preview gives the same result as pdftotext .
Perhaps it is this multiple overstriking that causes the bad display?
If so, how does Adobe know how to get it correct?
What extra information is Adobe using?
I have another example of this kind of thing:
http://www.maths.mq.edu.au/~ross/poppler/Big5/
Big5-actual.pdf 170k --- has /ActualText tagging
Big5-actual.txt 97 bytes
Big5-alt.pdf 169k --- has /Alt tagging
Big5-alt.txt 434 bytes
Big5-notags.pdf 157k --- no special tagging
Big5-notags.txt 432 bytes
The corresponding .txt files were obtained using pdftotext -raw
with Poppler version as follows:
[GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
pdftotext version 0.10.3
Copyright 2005-2009 The Poppler Developers - http://
poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
It is clear just from the file-size of Big5-actual.txt that
Poppler isn't extracting the /ActualText in this case.
Also, if you look at the contents of Big5-notags.txt you'll
see the same kind of "multiple-striking" to get the bold effect.
With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
is meant to be mapped to a single Unicode character.
But Poppler has no support for /Alt tagging, which is why
Big5-alt.txt is practically the same size as Big5-notags.txt .
With these three PDFs, Adobe Reader cannot extract
the chinese characters from Big5-notags.pdf
whereas it can do so from Big5-actual.pdf and Big5-alt.pdf
due to the extra tagging.
Apple's Preview and Poppler, on the other hand, can identify
the characters (presumably from information in the fonts or
their encoding arrays --- a CMap is not applicable).
But both extract three copies when the multiple striking occurs,
so are not dealing with the /Alt or /ActualText tags.
Furthermore, Poppler gives nothing for the ideographs
marked with /ActualText tagging.
I've been looking at this kind of thing for some time now,
with tagging and Chinese/Korean/Japanese documents (produced
using pdfTeX) and the result of Text-extraction using different
tools.
It seems that no-one gets it right all the time, which makes
it really hard to prepare a bug-report --- which software is
the one which is buggy, when all appear to neglect available
information, or process it incorrectly in different ways?
For certain I can say that Poppler has (at least) two bugs:
1. /ActualText doesn't work properly for the content in these
Big5-*.pdf documents;
2. /Alt isn't even recognised by Poppler;
(there is no coding to support it in either
TextOutputDev.cc or Gfx.cc )
Could the PDF property streams in my PDFs be malformed in some way?
Yes, I've looked at that, and have tried different ways to place
the tagging in them --- these made no difference whatsoever to
the result of text-extraction with the different software tools
that I've tried.
> comparison. I don't have any problems viewing other Chinese PDF
> files on
> my system. I use pdffonts to check fonts used by the file, and I have
> simsun.ttf installed.
>
> Adamson H
Hope this helps someone identify the problems,
and where/how to fix them.
Cheers,
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list