[poppler] Actual text and 0.6 branch

Albert Astals Cid aacid at kde.org
Thu Feb 26 14:00:00 PST 2009


Ross, i'm not sure i understand this mail at all, you speak about 0.6 branch 
that is OLD.

Is it related to http://bugs.freedesktop.org/show_bug.cgi?id=20013 ?

Albert

A Diumenge, 15 de febrer de 2009, Ross Moore va escriure:
> Hi Jonathan,
>
> I think there is a serious problem in Poppler.
> However, it's possible that maybe there is something
> else wrong on my Mac.
> Before reporting it as a bug, would you please confirm
> (under Mac OS, or Linux, whatever you have)
> that what I say below isn't just local to my system.
>
> Cheers & thanks,
>
> 	Ross
>
>
> Hi Albert,
>
> This revisits a thread from December 2007, where you
> report adding a patch to support /ActualText .
> See also:
>    [Poppler-bugs] [Bug 13573] Poppler does not support ActualText
>
>
> I'm now creating PDFs with /ActualText strings for CJK ideographs.
> These strings are given in big-endian UTF-16 format.
> Using  pdftotext  to extract the text, what I find is that:
>
>   a)  some, but not all, UTF-16 byte-pairs produce an extractable
>       character.
>
>   b)  Whenever the *first* byte of the pair is in the upper range
>        128--255 then the whole character is omitted.
>
>      For example, with the PDF string:  (˛ˇt»»tt»)
>      the text extracted using Adobe Reader is  瓈존瓈
>      but Poppler produces  珈珈 , which exhibits two errors.
>
>    Firstly, ...
>
>      the portion '»t' has been extracted as '', the empty string,
>      between the chinese ideographs.
>
>      In alternative representations, this is:
>       (<FE><FF>t<C8><C8>tt<C8>)  producing  <E7><8F><88><E7><8F><88> ,
>      where  t<C8> representing  't»'  extracts to
>        <E7><8F><88>  which is  珈 .
>
>    Secondly, ...
>
>   c) There is an error in the translation of UTF-16 characters
>      into UTF-8. For example,  the above  t<C8>  should actually
>      convert in UTF-8 to   <E7><93><88>   which is  瓈 ,
>      as done by Adobe and other software.
>
>      The <E7><8F><88> is what correctly comes from  s<C8> ;
>      the top-order byte is being mistranslated by -1.
>
>
> Further comments.
>
>   d)  little-endian UTF-16 strings are not supported at all.
>       There's no coding to swap the byte order within an extracted
>       string.
>
>       Instead the byte-order mark in  (ˇ˛»tt»»t) isn't recognised,
>       so Poppler extracts just the letters 'ttt'.
>
>
>   e)  octal codes can be used, contrary to a question that I raised
>       in bug report 20013 .
>       There my testing was with codes which produced 1st bytes
>       within the upper range, so the difficulties were the same
>       as in b) above.




More information about the poppler mailing list