[poppler] PDF files with embedded Chinese fonts

Mon Feb 9 11:11:53 PST 2009

The main problem with your files is that the producer is doing a S**TTY job
of font encoding...I don't know what type of font(s) you are starting with,
but the final PDF is produced with dynamically produced, quite poor, Type 1
fonts.  In addition, there are _NO_ ToUnicode tables (ISO 332000-1, 9.6.1)
 which is what prevents the proper extraction of the content by a tool that
follows the requirements of section 9.10 of ISO 32000-1.
Adobe Acrobat/Reader is able to properly extract the text from the
ActualText version, since the "ActualText" is provided in the tagged content
as defined in section 14.9.4 of ISO 32000-1.  However, if you read the
preceeding section (14.9.3) which describes the Alt tag, you will see that
it is not what you think it is - which is why that doesn't work.

It is unfortunate that only Adobe's tools correctly support Tagged PDF and
use those features to provide richer semantic extraction of PDF content.  I
would love to see someone add such support to Poppler...

Leonard Rosenthol
PDF Standards Architect
Adobe Systems

On Sun, Feb 8, 2009 at 3:16 PM, Ross Moore <ross at ics.mq.edu.au> wrote:

> Hi Adamson, and Albert.
>
> On 09/02/2009, at 3:14 AM, Adamson H wrote:
>
>  Yes, I have poppler-data 0.2.0-2. Please take a look at these two
>> screenshots http://launchpadlibrarian.net/21701829/Screenshot-evince.png
>> and http://launchpadlibrarian.net/21701834/Screenshot-foxit.png for
>>
>
> I see broken characters too, using Apple's Preview to read  dell440.pdf ,
> but not with Adobe Reader v9.x on MacOS X.
>
> (see attached images, of a portion of your PDF).
>
>
>
>
>
> This suggests that it is a problem within the fonts themselves,
> allowing different possible interpretations by font-rendering
> software.
>
> Alternatively, Adobe is using some information that other PDF
> renderers, or text-extraction tools, are not using.
> To back up this statement, suppose you select the last of
> the blue dot-points from dell440.pdf and Copy/Paste to UTF8 text.
>
> Adobe gives:
>
>  以电子邮件发送您的订单
>
>
> Now use      pdftotext -raw dell440.pdf
> and find the appropriate portion; you'll get:
>
> 以以以以电电电电子子子子邮邮邮邮件件件件发发发发送送送送您您您您的的的的订单订单订单订单
>
> in which each ideograph is repeated 4 times over.
> This repetition seems to be a common way to get bold-face,
> rather than using a separate font.
> Copy/Paste from Apple's preview gives the same result as  pdftotext .
>
> Perhaps it is this multiple overstriking that causes the bad display?
> If so, how does Adobe know how to get it correct?
> What extra information is Adobe using?
>
>
> I have another example of this kind of thing:
>
>   http://www.maths.mq.edu.au/~ross/poppler/Big5/
>
> Big5-actual.pdf   170k      --- has /ActualText tagging
> Big5-actual.txt   97 bytes
> Big5-alt.pdf      169k      --- has /Alt tagging
> Big5-alt.txt      434 bytes
> Big5-notags.pdf   157k      ---  no special tagging
> Big5-notags.txt   432 bytes
>
> The corresponding .txt files were obtained using  pdftotext -raw
> with Poppler version as follows:
>
> [GlenMorangie:~/PDFTeX/test-PDFs] rossmoor% pdftotext --help
> pdftotext version 0.10.3
> Copyright 2005-2009 The Poppler Developers - http://
> poppler.freedesktop.org
> Copyright 1996-2004 Glyph & Cog, LLC
>
>
> It is clear just from the file-size of  Big5-actual.txt  that
> Poppler isn't extracting the /ActualText  in this case.
> Also, if you look at the contents of  Big5-notags.txt  you'll
> see the same kind of "multiple-striking" to get the bold effect.
>
> With Big5-alt.pdf (and Big5-actual.pdf) this triple-striking
> is meant to be mapped to a single Unicode character.
> But Poppler has no support for /Alt tagging, which is why
> Big5-alt.txt  is practically the same size as  Big5-notags.txt .
>
>
> With these three PDFs, Adobe Reader cannot extract
> the chinese characters from   Big5-notags.pdf
> whereas it can do so from   Big5-actual.pdf  and  Big5-alt.pdf
> due to the extra tagging.
>
> Apple's Preview and Poppler, on the other hand, can identify
> the characters (presumably from information in the fonts or
> their encoding arrays --- a CMap is not applicable).
> But both extract three copies when the multiple striking occurs,
> so are not dealing with the /Alt or /ActualText tags.
> Furthermore, Poppler gives nothing for the ideographs
> marked with /ActualText tagging.
>
>
> I've been looking at this kind of thing for some time now,
> with tagging and Chinese/Korean/Japanese documents (produced
> using pdfTeX) and the result of Text-extraction using different
> tools.
>  It seems that no-one gets it right all the time, which makes
> it really hard to prepare a bug-report --- which software is
> the one which is buggy, when all appear to neglect available
> information, or process it incorrectly in different ways?
>
> For certain I can say that Poppler has (at least) two bugs:
>
>  1.  /ActualText  doesn't work properly for the content in these
>                   Big5-*.pdf  documents;
>
>  2.  /Alt  isn't even recognised by Poppler;
>       (there is no coding to support it in either
>          TextOutputDev.cc   or   Gfx.cc )
>
> Could the PDF property streams in my PDFs be malformed in some way?
> Yes, I've looked at that, and have tried different ways to place
> the tagging in them --- these made no difference whatsoever to
> the result of text-extraction with the different software tools
> that I've tried.
>
>
>  comparison. I don't have any problems viewing other Chinese PDF files on
>> my system. I use pdffonts to check fonts used by the file, and I have
>> simsun.ttf installed.
>>
>> Adamson H
>>
>
>
> Hope this helps someone identify the problems,
> and where/how to fix them.
>
> Cheers,
>
>        Ross
>
>
> ------------------------------------------------------------------------
> Ross Moore                                       ross at maths.mq.edu.au
> Mathematics Department                           office: E7A-419
> Macquarie University                             tel: +61 (0)2 9850 8955
> Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
> ------------------------------------------------------------------------
>
>
>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20090209/229475c3/attachment-0001.htm