<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; "><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN><B>From: </B><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN> <A href="mailto:leonardr@pdfsages.com">leonardr@pdfsages.com</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN><B>Subject: </B><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN><B>Re: [poppler] On PDF Text Extraction</B></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN><B>Date: </B><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>September 26, 2007 8:29:19 AM EDT</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN><B>To: </B><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN> <A href="mailto:behdad@behdad.org">behdad@behdad.org</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">[Some comments - inline]</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">On Sep 18, 2007, at 7:08 PM, Behdad Esfahbod wrote:</DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Before I started research that led to this thread, I wrote some</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">stuff about this, which I now see does not work. Specifically,</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">ActualText is not supported in poppler (and possibly other</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">extractors) at all, so that cannot be part of a portable</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">solution.</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>However, use of ActualText is a good idea for a variety of other reasons and is being recommended as part of PDF/A-2's new "Unicode compliance level" for those cases where ToUnicode doesn't suffice.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> - It's crucial for the above algorithm to work that a ToUnicode</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> entry mapping a glyph to an empty string works. That is, a</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> glyph that maps to zero Unicode characters.</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>I will verify this, but I am pretty sure that this is invalid. You MUST have at least one character on the right side of the mapping.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> - Every font may need to have an "empty" glyph, that is most</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> useful if zero-width. This is to be able to include things</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> like U+200C ZERO WIDTH NON-JOINER in the extracted text.</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>I don't understand this. What is the point of having "empty text or glyphs" in the PDF? It's simply not necessary.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>And if you insist on doing this, do NOT use .notdef</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> The main problem is that PDF doesn't have an easy way to</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> convey bidi information. There is a ReversedText property</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> but it belongs to Tagged PDF part of the spec which is far</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> from supported.</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>There is also the WritingMode tag, which is important not just for RTL but also vertical text. This is an EXTREMELY important tag used when dealing with mixed direction text - or a page consisting of various "block level" elements with varying writing modes.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>You could also use the Lang entry, but that's not really about bidi...</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> - A problem about using composite fonts is that when you find</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> out that you need a composite font (that is, more than 255</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> glyphs of the font should be subsetted), it's too late to</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> switch, since you have already output PDF code the previous</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> glyphs as single-byte codepoints. So one ends up using</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> composite fonts unconditionally (exception is, if the</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> original font has less than 256 glyphs, there's no point in</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> using composite fonts at all.). This slightly wastes space</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> as each codepoint will be encoded as four hex bytes instead</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> of two.</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>You could also do some pre-processing of the text, prior to rendering, to determine the complete glyph/code-point complement necessary and then make decisions.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> However, one can use a UTF-8 scheme such that the</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> first 128 glyphs are accessed using one byte and so on. This</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> way the PDF generator can use a simple font for subsets with</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> less than 128 glyphs. However, Adrian is telling me that</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> Poppler only supports UCS-2 Identity CMap mapping for CID</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> font codepoint encoding. So this may not be feasible.</DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>The only way to do this would be to actually use TWO separate font subsets - one that was single byte (for code points < 128) and one that was double-byte This is because the CMap, ToUnicode tables, etc all expect (according to the PDF Reference) that you only use a single encoding for the entire font - so it's either 1 byte or two.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> - Shall we use standard encodings if all the used glyphs in a</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> subset are in a well-supported standard encoding? May be</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> worth the slight optimization. Also may make generated</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> PS/PDF more readable for the case of simple ASCII text.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>I would definitely do this! Makes for smaller PDFs for "Roman-only" documents.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> - Also occurred to me that in PDF almost all objects can come</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> after hey are referenced. Does this mean we can write out</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> pages as we go and avoid writing to a temp file that we</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> currently do?</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>Of course!! Most PDF generators do this - no temp files required.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> -Some cairo API may be added to allow TaggedPDF</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "> marking from higher level. Something like:</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>cairo_pdf_marked_content_sequence_t</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>cairo_pdf_surface_begin/end_marked_content()</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>Cairo may wish to support this for other reasons, since marked content is used in PDF for a variety of features including optional content (aka Layers), object properties and more.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Anyway, wow, 400 lines. Thanks for reading this far. I'm going</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">to do a presentation on PDF text extraction at the Linux</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Foundation OpenPrinting Summit next week in Montréal, mostly based on</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">this mail:</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><BR></BLOCKQUOTE><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN class="Apple-tab-span" style="white-space:pre">        </SPAN>If you'd like it reviewed - feel free to send me a copy...</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Leonard Rosenthol</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">PDF Standards Evangelist</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Adobe Systems</DIV><BR></BODY></HTML>