[poppler] PDF files with embedded Chinese fonts

Mon Feb 9 17:20:00 PST 2009

On Mon, Feb 9, 2009 at 6:53 PM, Ross Moore <ross at ics.mq.edu.au> wrote:

> These sections of the ISO 32000-1 document (PDF v1.7) are very similar
> to the relevant parts of the "PDF Reference" for PDF v1.6.
>
> What content should be extracted?
> Surely that depends upon the nature of the extractor; e.g.
>  - a screen-reader would say "six-point star",
>  - copying to another sophisticated typesetting program would probably want
> the dingbat,
>  - archiving into a database might want both representations.
> Almost never would you want the letter 'A' to be extracted,
> yet that is what some tools might well give.
>

Alt is used ONLY for the purposes of screen readers (or potentially a
tooltip), just as it is for HTML & web browsers.  It would NEVER be used for
text extraction.

>
>> It is unfortunate that only Adobe's tools correctly support Tagged PDF and
>> use those features to provide richer semantic extraction of PDF content.
>>
>
> The producer of my examples is pdfTeX , with experimental
> modifications for producing tagged PDF.

Excellent - looking forward to seeing it in production...

We (myself and others) are attempting to develop appropriate
> tagging for scientific and multi-lingual documents, both for
> accessibility and document structure and content --- including
> math formulae.
>

Then you should probably take a look at the proposal from the PDF/UA
committee that was accepted for inclusion in ISO 32000-2.  It is a complete
mapping of MathML tags into PDF tag structure.  That is how tagging of math
should/will be done.

I'd also like to see something similar done for other scientific grammars,
such as ChemML...

> Supporting /ActualText and /Alt is meant to be the easy part;
> but even there it is difficult to advance when there is
> inconsistency in what PDF browsers do with these.
>

Unfortunately true :(.

Now as for the fonts in my examples, these are what you get
> by default when using LaTeX's CJK package. They are produced
> using the "virtual font" mechanism, whereby a single character
> (chinese ideograph, say) is built using several pieces drawn
> from maybe one, two or more other fonts.

YUCKO!

> Even if CMaps were
> provided for the construction pieces, this could not be applied
> to the ideograph as a whole --- hence the applicability of an
> /ActualText  replacement string. This situation certainly meets
> the criterion of being "content that does translate into text
> but that is represented in a nonstandard way."
>

Yes, that's exactly what ActualText is for - providing the real text
(hopefully in Unicode) for something that is represented via some other
graphical method or custom glyphs.  One of my favorite examples is to use
the symbol that Prince tried to use for his name and having the ActualText
be "the artist formerly known as Prince".  Other good uses would be for
providing simple forms of equations, chemical formulas and the like.

> There are many many (tens of thousands ??) of existing documents
> that have used  CJK.sty , with more being created all the time
> (e.g. by chinese/japanese/korean mathematicians and scientists).

Doesn't make it right...

I see adding /ActualText replacements as a means to enable
> faithful extraction of their content, and translation of the
> non-standard representation into a UTF8 or UTF16 version.

That would be one way.   The other, as noted, is to include a ToUnicode CMap
- as that is supported by every PDF parser that I am aware of...

> Further concerning the fonts, both the v1.6 and v1.7 PDF specs
> indicate the having a  /ToUnicode  CMap is "optional".
>

Yes, but it's a VERY GOOD idea since it solves the problem you are trying to
address in a simpler (and more global) way.

> The fonts in my  Big5/ examples instead have a /Charset in the
> /FontDescriptor, and /Encoding arrays, where characters are
> named such as:
>
> /CharSet (/uni4E00/uni4E0A/uni4E0D/uni4E2D/uni4E86/uni4E9B/uni4EE5)
>

By what standard naming convention did you come up with /uniXXXX?   The
names in CharSet have to be from the Adobe Glyph List (AGL).  This "uniXXXX"
form is not standard and thus unsupported by all products - though I do know
that some products have adopted it as a "shortcut" to doing a correct
ToUnicode table.

Leonard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20090209/c82fecd5/attachment.html