[poppler] On PDF Text Extraction

Fri Oct 19 21:36:53 PDT 2007

On Wed, 2007-09-26 at 08:29 -0400, Leonard Rosenthol wrote:

> [Some comments - inline]

Thanks Leonard for the message.  Comments inline.

> On Sep 18, 2007, at 7:08 PM, Behdad Esfahbod wrote:
> 
> > Before I started research that led to this thread, I wrote some
> > stuff about this, which I now see does not work.  Specifically,
> > ActualText is not supported in poppler (and possibly other
> > extractors) at all, so that cannot be part of a portable
> > solution.

> 
> However, use of ActualText is a good idea for a variety of other
> reasons and is being recommended as part of PDF/A-2's new "Unicode
> compliance level" for those cases where ToUnicode doesn't suffice.

Sure ActualText is useful.  It's just solves different problems.

> >   - It's crucial for the above algorithm to work that a ToUnicode
> >     entry mapping a glyph to an empty string works.  That is, a
> >     glyph that maps to zero Unicode characters.

> 
> I will verify this, but I am pretty sure that this is invalid.  You
> MUST have at least one character on the right side of the mapping.

Ok.

> >   - Every font may need to have an "empty" glyph, that is most
> >     useful if zero-width.  This is to be able to include things
> >     like U+200C ZERO WIDTH NON-JOINER in the extracted text.
> > 
> 
> 
> I don't understand this.  What is the point of having "empty text or
> glyphs" in the PDF?  It's simply not necessary.

Empty text was an implementation detail of my approach, but empty glyph
is useful to represent things like tab character or U+200C ZERO-WIDTH
NON-JOINER for example.

> And if you insist on doing this, do NOT use .notdef

Definitely not.

> >     The main problem is that PDF doesn't have an easy way to
> >     convey bidi information.  There is a ReversedText property
> >     but it belongs to Tagged PDF part of the spec which is far
> >     from supported.
> > 
> 
> 
> There is also the WritingMode tag, which is important not just for RTL
> but also vertical text.  This is an EXTREMELY important tag used when
> dealing with mixed direction text - or a page consisting of various
> "block level" elements with varying writing modes.

I'm not sure how useful that is.  For example, a single paragraph may
have both left-to-right lines and right-to-left, so, such a paragraph
doesn't fit any of LrTb or RlTb.  But then again, I've not read that
part of the spec very closely.

> You could also use the Lang entry, but that's not really about bidi...
> 
> 
> 
> 
> 
> >   - A problem about using composite fonts is that when you find
> >     out that you need a composite font (that is, more than 255
> >     glyphs of the font should be subsetted), it's too late to
> >     switch, since you have already output PDF code the previous
> >     glyphs as single-byte codepoints.  So one ends up using
> >     composite fonts unconditionally (exception is, if the
> >     original font has less than 256 glyphs, there's no point in
> >     using composite fonts at all.).  This slightly wastes space
> >     as each codepoint will be encoded as four hex bytes instead
> >     of two.
> > 
> 
> 
> You could also do some pre-processing of the text, prior to rendering,
> to determine the complete glyph/code-point complement necessary and
> then make decisions.
> 
> 
> 
> 
> 
> >    However, one can use a UTF-8 scheme such that the
> >     first 128 glyphs are accessed using one byte and so on.  This
> >     way the PDF generator can use a simple font for subsets with
> >     less than 128 glyphs.  However, Adrian is telling me that
> >     Poppler only supports UCS-2 Identity CMap mapping for CID
> >     font codepoint encoding.  So this may not be feasible.
> > 
> 
> 
> The only way to do this would be to actually use TWO separate font
> subsets - one that was single byte (for code points < 128) and one
> that was double-byte  This is because the CMap, ToUnicode tables, etc
> all expect (according to the PDF Reference) that you only use a single
> encoding for the entire font - so it's either 1 byte or two.

A single "encoding" doesn't mean either 1 byte or two as far as I
understand.  You can define your custom encoding that maps bytes < 128
to codepoints 0..127 and for other bytes, consumes two or more bytes.
Actually this is exactly what the sample CMap in Adobe tech report 5014
works.  Quoting from Section 5.2:

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: procset CIDInit
%%IncludeResource: procset CIDInit
%%BeginResource: CMap 83pv-RKSJ-H
%%Title: (83pv-RKSJ-H Adobe Japan1 0)
%%Version: 1

...

4 begincodespacerange
  <00>   <80>
  <8140> <9ffc>
  <a0>   <df>
  <e040> <fbfc>
endcodespacerange

1 beginnotdefrange
<00> <1f> 1
endnotdefrange

100 begincidrange
<20> <7e>1
<8140> <817e> 633
<8180> <81ac> 696
<81b8> <81bf> 741
<81c8> <81ce> 749

...

endcidrange

> >   - Shall we use standard encodings if all the used glyphs in a
> >     subset are in a well-supported standard encoding?  May be
> >     worth the slight optimization.  Also may make generated
> >     PS/PDF more readable for the case of simple ASCII text.
> 
> I would definitely do this!   Makes for smaller PDFs for "Roman-only"
> documents.

Yeah, makes sense.

> >   - Also occurred to me that in PDF almost all objects can come
> >     after hey are referenced.  Does this mean we can write out
> >     pages as we go and avoid writing to a temp file that we
> >     currently do?

> Of course!!   Most PDF generators do this - no temp files required.

Cool.

> >   -Some cairo API may be added to allow TaggedPDF
> >     marking from higher level.  Something like:
> > 
> > 
> > cairo_pdf_marked_content_sequence_t
> > cairo_pdf_surface_begin/end_marked_content()
> > 
> > 
> > 
> 
> 
> Cairo may wish to support this for other reasons, since marked content
> is used in PDF for a variety of features including optional content
> (aka Layers), object properties and more.

Right.  As I said, main problem with it is that we don't have any
concrete use cases right now, so we can't know if any proposed API is
right or wrong.

> > Anyway, wow, 400 lines.  Thanks for reading this far.  I'm going
> > to do a presentation on PDF text extraction at the Linux
> > Foundation OpenPrinting Summit next week in Montréal, mostly based
> > on
> > this mail:
> > 
> > 
> > 
> If you'd like it reviewed - feel free to send me a copy...

Thanks for the offer.  Ended up putting few words in the slides.  They
can be found here:

http://www.linux-foundation.org/images/8/80/Textextraction_slides_small.pdf

> Leonard Rosenthol
> PDF Standards Evangelist
> Adobe Systems

-- 
behdad
http://behdad.org/

"Those who would give up Essential Liberty to purchase a little
 Temporary Safety, deserve neither Liberty nor Safety."
        -- Benjamin Franklin, 1759