[poppler] On PDF Text Extraction

Wed Sep 26 05:29:44 PDT 2007

	From: 	  leonardr at pdfsages.com
	Subject: 	Re: [poppler] On PDF Text Extraction
	Date: 	September 26, 2007 8:29:19 AM EDT
	To: 	  behdad at behdad.org

[Some comments - inline]

On Sep 18, 2007, at 7:08 PM, Behdad Esfahbod wrote:

> Before I started research that led to this thread, I wrote some
> stuff about this, which I now see does not work.  Specifically,
> ActualText is not supported in poppler (and possibly other
> extractors) at all, so that cannot be part of a portable
> solution.
>

	However, use of ActualText is a good idea for a variety of other  
reasons and is being recommended as part of PDF/A-2's new "Unicode  
compliance level" for those cases where ToUnicode doesn't suffice.

>   - It's crucial for the above algorithm to work that a ToUnicode
>     entry mapping a glyph to an empty string works.  That is, a
>     glyph that maps to zero Unicode characters.
>

	I will verify this, but I am pretty sure that this is invalid.  You  
MUST have at least one character on the right side of the mapping.

>   - Every font may need to have an "empty" glyph, that is most
>     useful if zero-width.  This is to be able to include things
>     like U+200C ZERO WIDTH NON-JOINER in the extracted text.
>

	I don't understand this.  What is the point of having "empty text or  
glyphs" in the PDF?  It's simply not necessary.

	And if you insist on doing this, do NOT use .notdef

>     The main problem is that PDF doesn't have an easy way to
>     convey bidi information.  There is a ReversedText property
>     but it belongs to Tagged PDF part of the spec which is far
>     from supported.
>

	There is also the WritingMode tag, which is important not just for  
RTL but also vertical text.  This is an EXTREMELY important tag used  
when dealing with mixed direction text - or a page consisting of  
various "block level" elements with varying writing modes.

	You could also use the Lang entry, but that's not really about bidi...

>   - A problem about using composite fonts is that when you find
>     out that you need a composite font (that is, more than 255
>     glyphs of the font should be subsetted), it's too late to
>     switch, since you have already output PDF code the previous
>     glyphs as single-byte codepoints.  So one ends up using
>     composite fonts unconditionally (exception is, if the
>     original font has less than 256 glyphs, there's no point in
>     using composite fonts at all.).  This slightly wastes space
>     as each codepoint will be encoded as four hex bytes instead
>     of two.
>

	You could also do some pre-processing of the text, prior to  
rendering, to determine the complete glyph/code-point complement  
necessary and then make decisions.

>    However, one can use a UTF-8 scheme such that the
>     first 128 glyphs are accessed using one byte and so on.  This
>     way the PDF generator can use a simple font for subsets with
>     less than 128 glyphs.  However, Adrian is telling me that
>     Poppler only supports UCS-2 Identity CMap mapping for CID
>     font codepoint encoding.  So this may not be feasible.
>

	The only way to do this would be to actually use TWO separate font  
subsets - one that was single byte (for code points < 128) and one  
that was double-byte  This is because the CMap, ToUnicode tables, etc  
all expect (according to the PDF Reference) that you only use a  
single encoding for the entire font - so it's either 1 byte or two.

>   - Shall we use standard encodings if all the used glyphs in a
>     subset are in a well-supported standard encoding?  May be
>     worth the slight optimization.  Also may make generated
>     PS/PDF more readable for the case of simple ASCII text.
>
>

	I would definitely do this!   Makes for smaller PDFs for "Roman- 
only" documents.

>   - Also occurred to me that in PDF almost all objects can come
>     after hey are referenced.  Does this mean we can write out
>     pages as we go and avoid writing to a temp file that we
>     currently do?
>
>
	Of course!!   Most PDF generators do this - no temp files required.

>   -Some cairo API may be added to allow TaggedPDF
>     marking from higher level.  Something like:
>
> 	cairo_pdf_marked_content_sequence_t
> 	cairo_pdf_surface_begin/end_marked_content()
>
>

	Cairo may wish to support this for other reasons, since marked  
content is used in PDF for a variety of features including optional  
content (aka Layers), object properties and more.

> Anyway, wow, 400 lines.  Thanks for reading this far.  I'm going
> to do a presentation on PDF text extraction at the Linux
> Foundation OpenPrinting Summit next week in Montréal, mostly based on
> this mail:
>
>
	If you'd like it reviewed - feel free to send me a copy...

Leonard Rosenthol
PDF Standards Evangelist
Adobe Systems

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20070926/fbc6a324/attachment.htm