[poppler] PDF files with embedded Chinese fonts

Ross Moore ross at ics.mq.edu.au
Tue Feb 10 16:25:45 PST 2009


Hi Leonard,

On 10/02/2009, at 12:20 PM, Leonard Rosenthol wrote:

> It is unfortunate that only Adobe's tools correctly support Tagged  
> PDF and use those features to provide richer semantic extraction of  
> PDF content.
>
>> The producer of my examples is pdfTeX , with experimental
>> modifications for producing tagged PDF.
>
> Excellent - looking forward to seeing it in production...

That's very encouraging. Thank you.

>
>
>> We (myself and others) are attempting to develop appropriate
>> tagging for scientific and multi-lingual documents, both for
>> accessibility and document structure and content --- including
>> math formulae.
>
> Then you should probably take a look at the proposal from the PDF/ 
> UA committee that was accepted for inclusion in ISO 32000-2.  It is  
> a complete mapping of MathML tags into PDF tag structure.  That is  
> how tagging of math should/will be done.

Then this will be a vital document to have.
Is there a place to download this freely?
I've applied for a registration here:  http://pdf.editme.com/PDFUA .

>
> I'd also like to see something similar done for other scientific  
> grammars, such as ChemML...

Sure.

>
>> Supporting /ActualText and /Alt is meant to be the easy part;
>> but even there it is difficult to advance when there is
>> inconsistency in what PDF browsers do with these.
>
> Unfortunately true :(.

>> Now as for the fonts in my examples, these are what you get
>> by default when using LaTeX's CJK package. They are produced
>> using the "virtual font" mechanism, whereby a single character
>> (chinese ideograph, say) is built using several pieces drawn
>> from maybe one, two or more other fonts.
>
>
> YUCKO!

You can say that now; but when these methods were developed,
as much as 10 years ago, this was the best approach available.
It is still used because it worked so well.

>>
>> Even if CMaps were
>> provided for the construction pieces, this could not be applied
>> to the ideograph as a whole --- hence the applicability of an
>> /ActualText  replacement string. This situation certainly meets
>> the criterion of being "content that does translate into text
>> but that is represented in a nonstandard way."
>
>
> Yes, that's exactly what ActualText is for - providing the real  
> text (hopefully in Unicode) for something that is represented via  
> some other graphical method or custom glyphs.


> One of my favorite examples is to use the symbol that Prince tried  
> to use for his name and having the ActualText be "the artist  
> formerly known as Prince".  Other good uses would be for providing  
> simple forms of equations, chemical formulas and the like.

The PDF Reference describes /ActualText as being a character-level  
thing,
not for whole words or phrases --- that is what /Alt was to be for.
Has this view changed now?

>> There are many many (tens of thousands ??) of existing documents
>> that have used  CJK.sty , with more being created all the time
>> (e.g. by chinese/japanese/korean mathematicians and scientists).
>
>
> Doesn't make it right...

It's not a question of right or wrong.
This is the way people work in those fields. Their time is
best spent doing what they are expert at, not keeping up
to date with the technologies of how their work is published,
and learning new methods as that technology changes.

It is our job to ensure that their work, new or old,
remains accessible for decades to come...


>> I see adding /ActualText replacements as a means to enable
>> faithful extraction of their content, and translation of the
>> non-standard representation into a UTF8 or UTF16 version.
>
>
> That would be one way.

  ... by techniques such as this.

> The other, as noted, is to include a ToUnicode CMap - as that is  
> supported by every PDF parser that I am aware of...

This isn't possible in the CJK case using virtual fonts,
since there isn't a 1-1 mapping between glyph and the
fully-constructed character.

It *is* possible for the individual glyphs themselves,
but this isn't the best information for extraction.

>
>> Further concerning the fonts, both the v1.6 and v1.7 PDF specs
>> indicate the having a  /ToUnicode  CMap is "optional".
>
> Yes, but it's a VERY GOOD idea since it solves the problem you are  
> trying to address in a simpler (and more global) way.

Yes I agree it is a very good idea, and I've created /ToUnicode
CMap resources that map TeX's math symbols to the correct code-points.
This is an important 1st step for making math more accessible within
PDFs generated using TeX.

In those cases the "glyph to character relation" is typically
  1-1,  or "1 to many" for some special symbols.

However, with the virtual CJK fonts described above, the glyph
to character relation is "many to 1"; that is, multiple glyphs
are needed to construct a single character(ideograph).

Thus the /ToUnicode resource is not as useful as one would like,
and it would simply be repeating information that is encoded
already in the glyph names. The /ActualText seems to be more
suited to handling this situation.


>
>
> The fonts in my  Big5/ examples instead have a /Charset in the
> /FontDescriptor, and /Encoding arrays, where characters are
> named such as:
>
> /CharSet (/uni4E00/uni4E0A/uni4E0D/uni4E2D/uni4E86/uni4E9B/uni4EE5)
>
> By what standard naming convention did you come up with /uniXXXX?

Not my work -- but an Adobe "standard" has indeed been followed.
In my   http://www.maths.mq.edu.au/~ross/poppler/KS/
examples, the fonts say:

%!PS-AdobeFont-1.0: Umj10 001.001
%%CreationDate: 22.05.98 at 16:38
%%VMusage: 1024 57716
% Generated by Fontographer 4.1
% AGL compliant glyph names added by script hlatex2agl.pl 2005-Jul-27.
% Copyright \(c\) Un, 1998. All rights reserved.
% ADL: 800 200 0


By AGL-compliant, presumably the author means following the rules  
given at:

     http://www.adobe.com/devnet/opentype/archives/glyph.html

viz.  section 2, step 3, 3rd bullet-point:

* otherwise, if the component is of the form "uni" (U+0075 U+006E U 
+0069) followed by a sequence of uppercase hexadecimal digits (0 ..  
9, A .. F, i.e. U+0030 .. U+0039, U+0041 .. U+0046), the length of  
that sequence is a multiple of four, and each group of four digits  
represents a number in the set {0x0000 .. 0xD7FF, 0xE000 .. 0xFFFF},  
then interpret each such number as a Unicode scalar value and map the  
component to the string made of those scalar values. Note that the  
range and digit length restrictions mean that the "uni" prefix can be  
used only with Unicode values from the Basic Multilingual Plane (BMP).

    (some examples and guidelines are given further down.)


> The names in CharSet have to be from the Adobe Glyph List (AGL).

Precisely.
However the versions that I've seen of this, with names listed  
explicitly,
are not fully comprehensive with regard to CJK fonts.
e.g.
(2008)  http://www.adobe.com/devnet/opentype/archives/aglfn.txt
(2002)  http://partners.adobe.com/public/developer/en/opentype/ 
glyphlist.txt
(2006)  http://partners.adobe.com/public/developer/en/opentype/ 
aglfn13.txt

Is there the later version that assigns names to CJK ideographs?
Otherwise the naming scheme used is what should be expected.


> This "uniXXXX" form is not standard and thus unsupported by all  
> products

   Adobe Reader ?

> - though I do know that some products have adopted it as a  
> "shortcut" to doing a correct ToUnicode table.

The /Encoding as a set of glyph names is associated entirely
with the font Resource itself, whereas the /ToUnicode  Resource
is associated with how the font is included within the document.
The latter thus involves an extra layer of management --- a layer
which wasn't needed when the only requirement was to get a document
(PostScript or PDF) that printed well.

TeX software does not yet have the ability to marry  /ToUnicode
resources to fonts which are included in a document by virtue
of the virtual font technique. This is an aspect that I can have
a look at; but as I said above, it isn't as appropriate as you
might think it would be, and the information that this would
provide is actually already available.


>  Leonard


All the best,

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------





More information about the poppler mailing list