[poppler] On PDF Text Extraction
Leonard Rosenthol
leonardr at pdfsages.com
Wed Sep 26 05:29:44 PDT 2007
From: leonardr at pdfsages.com
Subject: Re: [poppler] On PDF Text Extraction
Date: September 26, 2007 8:29:19 AM EDT
To: behdad at behdad.org
[Some comments - inline]
On Sep 18, 2007, at 7:08 PM, Behdad Esfahbod wrote:
> Before I started research that led to this thread, I wrote some
> stuff about this, which I now see does not work. Specifically,
> ActualText is not supported in poppler (and possibly other
> extractors) at all, so that cannot be part of a portable
> solution.
>
However, use of ActualText is a good idea for a variety of other
reasons and is being recommended as part of PDF/A-2's new "Unicode
compliance level" for those cases where ToUnicode doesn't suffice.
> - It's crucial for the above algorithm to work that a ToUnicode
> entry mapping a glyph to an empty string works. That is, a
> glyph that maps to zero Unicode characters.
>
I will verify this, but I am pretty sure that this is invalid. You
MUST have at least one character on the right side of the mapping.
> - Every font may need to have an "empty" glyph, that is most
> useful if zero-width. This is to be able to include things
> like U+200C ZERO WIDTH NON-JOINER in the extracted text.
>
I don't understand this. What is the point of having "empty text or
glyphs" in the PDF? It's simply not necessary.
And if you insist on doing this, do NOT use .notdef
> The main problem is that PDF doesn't have an easy way to
> convey bidi information. There is a ReversedText property
> but it belongs to Tagged PDF part of the spec which is far
> from supported.
>
There is also the WritingMode tag, which is important not just for
RTL but also vertical text. This is an EXTREMELY important tag used
when dealing with mixed direction text - or a page consisting of
various "block level" elements with varying writing modes.
You could also use the Lang entry, but that's not really about bidi...
> - A problem about using composite fonts is that when you find
> out that you need a composite font (that is, more than 255
> glyphs of the font should be subsetted), it's too late to
> switch, since you have already output PDF code the previous
> glyphs as single-byte codepoints. So one ends up using
> composite fonts unconditionally (exception is, if the
> original font has less than 256 glyphs, there's no point in
> using composite fonts at all.). This slightly wastes space
> as each codepoint will be encoded as four hex bytes instead
> of two.
>
You could also do some pre-processing of the text, prior to
rendering, to determine the complete glyph/code-point complement
necessary and then make decisions.
> However, one can use a UTF-8 scheme such that the
> first 128 glyphs are accessed using one byte and so on. This
> way the PDF generator can use a simple font for subsets with
> less than 128 glyphs. However, Adrian is telling me that
> Poppler only supports UCS-2 Identity CMap mapping for CID
> font codepoint encoding. So this may not be feasible.
>
The only way to do this would be to actually use TWO separate font
subsets - one that was single byte (for code points < 128) and one
that was double-byte This is because the CMap, ToUnicode tables, etc
all expect (according to the PDF Reference) that you only use a
single encoding for the entire font - so it's either 1 byte or two.
> - Shall we use standard encodings if all the used glyphs in a
> subset are in a well-supported standard encoding? May be
> worth the slight optimization. Also may make generated
> PS/PDF more readable for the case of simple ASCII text.
>
>
I would definitely do this! Makes for smaller PDFs for "Roman-
only" documents.
> - Also occurred to me that in PDF almost all objects can come
> after hey are referenced. Does this mean we can write out
> pages as we go and avoid writing to a temp file that we
> currently do?
>
>
Of course!! Most PDF generators do this - no temp files required.
> -Some cairo API may be added to allow TaggedPDF
> marking from higher level. Something like:
>
> cairo_pdf_marked_content_sequence_t
> cairo_pdf_surface_begin/end_marked_content()
>
>
Cairo may wish to support this for other reasons, since marked
content is used in PDF for a variety of features including optional
content (aka Layers), object properties and more.
> Anyway, wow, 400 lines. Thanks for reading this far. I'm going
> to do a presentation on PDF text extraction at the Linux
> Foundation OpenPrinting Summit next week in Montréal, mostly based on
> this mail:
>
>
If you'd like it reviewed - feel free to send me a copy...
Leonard Rosenthol
PDF Standards Evangelist
Adobe Systems
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/poppler/attachments/20070926/fbc6a324/attachment.htm
More information about the poppler
mailing list