[Poppler-bugs] [Bug 20013] pdftotext doesn't support /Alt nor / ActualText with octal content

Tue Feb 17 10:59:14 PST 2009

http://bugs.freedesktop.org/show_bug.cgi?id=20013

Ross Moore <ross at maths.mq.edu.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #22702|application/pdf             |text/plain
          mime type|                            |
  Attachment #22702|0                           |1
           is patch|                            |

--- Comment #1 from Ross Moore <ross at maths.mq.edu.au>  2009-02-17 10:59:13 PST ---
(From update of attachment 22702)
This bug is due to improper extraction of the text in the /ActualText entry.
Here is a better description of the effects observed.

I'm now creating PDFs with /ActualText strings for CJK ideographs.
These strings are given in big-endian UTF-16 format.
Using  pdftotext  to extract the text, what I find is that:

 a)  some, but not all, UTF-16 byte-pairs produce an extractable
     character.

 b)  Whenever the *first* byte of the pair is in the upper range
      128--255 then the whole character is omitted.

    For example, with the PDF string:  (˛ˇt»»tt»)
    the text extracted using Adobe Reader is  瓈존瓈
    but Poppler produces  珈珈 , which exhibits two errors.

  Firstly, ...

    the portion '»t' has been extracted as '', the empty string,
    between the chinese ideographs.

    In alternative representations, this is:
     (<FE><FF>t<C8><C8>tt<C8>)  producing  <E7><8F><88><E7><8F><88> ,
    where  t<C8> representing  't»'  extracts to
      <E7><8F><88>  which is  珈 .

  Secondly, ...

 c) There is an error in the translation of UTF-16 characters
    into UTF-8. For example,  the above  t<C8>  should actually
    convert in UTF-8 to   <E7><93><88>   which is  瓈 ,
    as done by Adobe and other software.

    The <E7><8F><88> is what correctly comes from  s<C8> ;
    the top-order byte is being mistranslated by -1.

Further comment.

  d)  octal codes can be used, contrary to a question that I raised
     in bug report 20013 .
     There my testing was with codes which produced 1st bytes
     within the upper range, so the difficulties were the same
     as in b) above.

 e)  the example PDF  http://www.unicode.org/udhr/d/udhr_san.pdf
      used to test the /ActualText support involved only characters
      in the range  Ux0A..  so that the problem (b) with higher range
      characters did not occur; and nor does (c) for this range.

Here's a patch that fixes the problem.
The new line of coding is based upon similar methods used in 
     poppler/Outline.cc .

*** TextOutputDev-prev.cc       Wed Feb 18 04:59:28 2009
--- TextOutputDev.cc    Wed Feb 18 05:42:22 2009
*************** void TextOutputDev::endMarkedContent(Gfx
*** 4657,4663 ****
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1];

        text->addChar(state,
                    actualText_x, actualText_y,
--- 4657,4663 ----
        length = length/2 - 1;
        uni = new Unicode[length];
        for (i = 0 ; i < length; i++)
!       uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff);

        text->addChar(state,
                    actualText_x, actualText_y,

-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.