<div class="gmail_quote">On Mon, Feb 9, 2009 at 6:53 PM, Ross Moore <span dir="ltr"><<a href="mailto:ross@ics.mq.edu.au">ross@ics.mq.edu.au</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
These sections of the ISO 32000-1 document (PDF v1.7) are very similar<br>
to the relevant parts of the "PDF Reference" for PDF v1.6.<br><br>
What content should be extracted?<br>
Surely that depends upon the nature of the extractor; e.g.<br>
- a screen-reader would say "six-point star",<br>
- copying to another sophisticated typesetting program would probably want the dingbat,<br>
- archiving into a database might want both representations.<br>
Almost never would you want the letter 'A' to be extracted,<br>
yet that is what some tools might well give.<div class="Ih2E3d"></div></blockquote><div><br></div><div>Alt is used ONLY for the purposes of screen readers (or potentially a tooltip), just as it is for HTML & web browsers. It would NEVER be used for text extraction.</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
It is unfortunate that only Adobe's tools correctly support Tagged PDF and use those features to provide richer semantic extraction of PDF content.<br>
</blockquote>
<br></div>
The producer of my examples is pdfTeX , with experimental<br>
modifications for producing tagged PDF.</blockquote><div><br></div><div>Excellent - looking forward to seeing it in production...</div><div> <br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
We (myself and others) are attempting to develop appropriate<br>
tagging for scientific and multi-lingual documents, both for<br>
accessibility and document structure and content --- including<br>
math formulae.<br></blockquote><div><br></div><div>Then you should probably take a look at the proposal from the PDF/UA committee that was accepted for inclusion in ISO 32000-2. It is a complete mapping of MathML tags into PDF tag structure. That is how tagging of math should/will be done.</div>
<div><br></div><div>I'd also like to see something similar done for other scientific grammars, such as ChemML...</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Supporting /ActualText and /Alt is meant to be the easy part;<br>
but even there it is difficult to advance when there is<br>
inconsistency in what PDF browsers do with these.<div class="Ih2E3d"></div></blockquote><div><br></div><div>Unfortunately true :(.</div><div> </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="Ih2E3d">Now as for the fonts in my examples, these are what you get<br></div>
by default when using LaTeX's CJK package. They are produced<br>
using the "virtual font" mechanism, whereby a single character<br>
(chinese ideograph, say) is built using several pieces drawn<br>
from maybe one, two or more other fonts. </blockquote><div><br></div><div>YUCKO!</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Even if CMaps were<br>
provided for the construction pieces, this could not be applied<br>
to the ideograph as a whole --- hence the applicability of an<br>
/ActualText replacement string. This situation certainly meets<br>
the criterion of being "content that does translate into text<br>
but that is represented in a nonstandard way."<br></blockquote><div><br></div><div>Yes, that's exactly what ActualText is for - providing the real text (hopefully in Unicode) for something that is represented via some other graphical method or custom glyphs. One of my favorite examples is to use the symbol that Prince tried to use for his name and having the ActualText be "the artist formerly known as Prince". Other good uses would be for providing simple forms of equations, chemical formulas and the like.</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">There are many many (tens of thousands ??) of existing documents<br>
that have used CJK.sty , with more being created all the time<br>
(e.g. by chinese/japanese/korean mathematicians and scientists).</blockquote><div><br></div><div>Doesn't make it right...</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I see adding /ActualText replacements as a means to enable<br>
faithful extraction of their content, and translation of the<br>
non-standard representation into a UTF8 or UTF16 version.</blockquote><div><br></div><div>That would be one way. The other, as noted, is to include a ToUnicode CMap - as that is supported by every PDF parser that I am aware of...</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Further concerning the fonts, both the v1.6 and v1.7 PDF specs<br>
indicate the having a /ToUnicode CMap is "optional".<br></blockquote><div><br></div><div>Yes, but it's a VERY GOOD idea since it solves the problem you are trying to address in a simpler (and more global) way.</div>
<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
The fonts in my Big5/ examples instead have a /Charset in the<br>
/FontDescriptor, and /Encoding arrays, where characters are<br>
named such as:<br>
<br>
/CharSet (/uni4E00/uni4E0A/uni4E0D/uni4E2D/uni4E86/uni4E9B/uni4EE5)<br></blockquote><div><br></div><div>By what standard naming convention did you come up with /uniXXXX? The names in CharSet have to be from the Adobe Glyph List (AGL). This "uniXXXX" form is not standard and thus unsupported by all products - though I do know that some products have adopted it as a "shortcut" to doing a correct ToUnicode table. </div>
<div><br></div><div><br></div><div>Leonard </div></div>