[poppler] Actual text and 0.6 branch

Sat Feb 14 16:59:28 PST 2009

Hi Jonathan,

I think there is a serious problem in Poppler.
However, it's possible that maybe there is something
else wrong on my Mac.
Before reporting it as a bug, would you please confirm
(under Mac OS, or Linux, whatever you have)
that what I say below isn't just local to my system.

Cheers & thanks,

	Ross

Hi Albert,

This revisits a thread from December 2007, where you
report adding a patch to support /ActualText .
See also:
   [Poppler-bugs] [Bug 13573] Poppler does not support ActualText

I'm now creating PDFs with /ActualText strings for CJK ideographs.
These strings are given in big-endian UTF-16 format.
Using  pdftotext  to extract the text, what I find is that:

  a)  some, but not all, UTF-16 byte-pairs produce an extractable
      character.

  b)  Whenever the *first* byte of the pair is in the upper range
       128--255 then the whole character is omitted.

     For example, with the PDF string:  (˛ˇt»»tt»)
     the text extracted using Adobe Reader is  瓈존瓈
     but Poppler produces  珈珈 , which exhibits two errors.

   Firstly, ...

     the portion '»t' has been extracted as '', the empty string,
     between the chinese ideographs.

     In alternative representations, this is:
      (<FE><FF>t<C8><C8>tt<C8>)  producing  <E7><8F><88><E7><8F><88> ,
     where  t<C8> representing  't»'  extracts to
       <E7><8F><88>  which is  珈 .

   Secondly, ...

  c) There is an error in the translation of UTF-16 characters
     into UTF-8. For example,  the above  t<C8>  should actually
     convert in UTF-8 to   <E7><93><88>   which is  瓈 ,
     as done by Adobe and other software.

     The <E7><8F><88> is what correctly comes from  s<C8> ;
     the top-order byte is being mistranslated by -1.

Further comments.

  d)  little-endian UTF-16 strings are not supported at all.
      There's no coding to swap the byte order within an extracted
      string.

      Instead the byte-order mark in  (ˇ˛»tt»»t) isn't recognised,
      so Poppler extracts just the letters 'ttt'.

  e)  octal codes can be used, contrary to a question that I raised
      in bug report 20013 .
      There my testing was with codes which produced 1st bytes
      within the upper range, so the difficulties were the same
      as in b) above.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: KS-fakeactual.pdf
Type: application/pdf
Size: 70927 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090215/d0798b9b/attachment-0001.pdf 
-------------- next part --------------

The attached PDF has been 'faked' so that each of the Korean
ideographs has been tagged with /ActualText with the example
described above.

BTW, the example PDF
   http://www.unicode.org/udhr/d/udhr_san.pdf
used to test the /ActualText support involved only characters
in the range  Ux0A..  so that the problem (b) with higher range
characters did not occur; and nor does (c) for this range.

Hope this helps.

	Ross

------------------------------------------------------------------------
Ross Moore                                       ross at maths.mq.edu.au
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------