[poppler] PDF files with embedded Chinese fonts

Ross Moore ross at ics.mq.edu.au
Mon Feb 9 15:53:20 PST 2009

Hello Leonard,

On 10/02/2009, at 6:11 AM, Leonard Rosenthol wrote:

> The main problem with your files is that the producer is doing a  
> S**TTY job of font encoding...I don't know what type of font(s) you  
> are starting with, but the final PDF is produced with dynamically  
> produced, quite poor, Type 1 fonts.  In addition, there are _NO_  
> ToUnicode tables (ISO 332000-1, 9.6.1)  which is what prevents the  
> proper extraction of the content by a tool that follows the  
> requirements of section 9.10 of ISO 32000-1.

The issue that I'm mostly interested in is not the quality
(or lack thereof) of the fonts, but the support for /ActualText
tagging. I'll make some relevant comments on the producer below.

> Adobe Acrobat/Reader is able to properly extract the text from the  
> ActualText version, since the "ActualText" is provided in the  
> tagged content as defined in section 14.9.4 of ISO 32000-1.   
> However, if you read the preceeding section (14.9.3) which  
> describes the Alt tag, you will see that it is not what you think  
> it is - which is why that doesn't work.

Of course /ActualText is the appropriate thing to use with my examples,
but the syntax of using /Alt is the same, so I produced those variants
also, just to see what PDF extraction tools would do with them.
So far as Poppler is concerned, I could find no support whatsoever
for /Alt tagging, within its codebase. If this sparks a debate on
just how /Alt should be supported, then so much the better.

These sections of the ISO 32000-1 document (PDF v1.7) are very similar
to the relevant parts of the "PDF Reference" for PDF v1.6.
Both documents provide essentially the same example:
v1.6:  /Span << /Lang (en-us) /Alt (six-point star) >> BDC (✡) Tj EMC

v1.7:  /Span << /Lang (en-us) /Alt (six-point star) >> BDC (A) Tj EMC

In the latter, presumably the font is intended to have a CMap that  
maps "A" to the appropriate dingbat.

What content should be extracted?
Surely that depends upon the nature of the extractor; e.g.
  - a screen-reader would say "six-point star",
  - copying to another sophisticated typesetting program would  
probably want the dingbat,
  - archiving into a database might want both representations.
Almost never would you want the letter 'A' to be extracted,
yet that is what some tools might well give.

> It is unfortunate that only Adobe's tools correctly support Tagged  
> PDF and use those features to provide richer semantic extraction of  
> PDF content.

The producer of my examples is pdfTeX , with experimental
modifications for producing tagged PDF.
We (myself and others) are attempting to develop appropriate
tagging for scientific and multi-lingual documents, both for
accessibility and document structure and content --- including
math formulae.

Supporting /ActualText and /Alt is meant to be the easy part;
but even there it is difficult to advance when there is
inconsistency in what PDF browsers do with these.

>  I would love to see someone add such support to Poppler...

Absolutely. And with pdfTeX also a producer of tagged PDF,
this may help to make it more attractive for people to get
involved with developing such support.

Now as for the fonts in my examples, these are what you get
by default when using LaTeX's CJK package. They are produced
using the "virtual font" mechanism, whereby a single character
(chinese ideograph, say) is built using several pieces drawn
from maybe one, two or more other fonts. Even if CMaps were
provided for the construction pieces, this could not be applied
to the ideograph as a whole --- hence the applicability of an
/ActualText  replacement string. This situation certainly meets
the criterion of being "content that does translate into text
but that is represented in a nonstandard way."

There are many many (tens of thousands ??) of existing documents
that have used  CJK.sty , with more being created all the time
(e.g. by chinese/japanese/korean mathematicians and scientists).
I see adding /ActualText replacements as a means to enable
faithful extraction of their content, and translation of the
non-standard representation into a UTF8 or UTF16 version.
This should be particularly useful in the preparation of
multilingual journals and conference publications, say, where
authors provide their information using whatever encodings
and techniques they are most comfortable with.

Further concerning the fonts, both the v1.6 and v1.7 PDF specs
indicate the having a  /ToUnicode  CMap is "optional".
The fonts in my  Big5/ examples instead have a /Charset in the
/FontDescriptor, and /Encoding arrays, where characters are
named such as:

/CharSet (/uni4E00/uni4E0A/uni4E0D/uni4E2D/uni4E86/uni4E9B/uni4EE5)

/Encoding 256 array
0 1 255 {1 index exch /.notdef put} for
dup 0 /uni4E00 put
dup 10 /uni4E0A put
dup 13 /uni4E0D put
dup 45 /uni4E2D put
dup 134 /uni4E86 put
dup 155 /uni4E9B put
dup 229 /uni4EE5 put
readonly def

It is clear which Unicode points are intended,
and Poppler must be using this information, since it
can extract the text when there is no extra tagging.

> Leonard Rosenthol
> PDF Standards Architect
> Adobe Systems

Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114

More information about the poppler mailing list