[poppler] Actual text and 0.6 branch
Albert Astals Cid
aacid at kde.org
Thu Feb 26 14:00:00 PST 2009
Ross, i'm not sure i understand this mail at all, you speak about 0.6 branch
that is OLD.
Is it related to http://bugs.freedesktop.org/show_bug.cgi?id=20013 ?
Albert
A Diumenge, 15 de febrer de 2009, Ross Moore va escriure:
> Hi Jonathan,
>
> I think there is a serious problem in Poppler.
> However, it's possible that maybe there is something
> else wrong on my Mac.
> Before reporting it as a bug, would you please confirm
> (under Mac OS, or Linux, whatever you have)
> that what I say below isn't just local to my system.
>
> Cheers & thanks,
>
> Ross
>
>
> Hi Albert,
>
> This revisits a thread from December 2007, where you
> report adding a patch to support /ActualText .
> See also:
> [Poppler-bugs] [Bug 13573] Poppler does not support ActualText
>
>
> I'm now creating PDFs with /ActualText strings for CJK ideographs.
> These strings are given in big-endian UTF-16 format.
> Using pdftotext to extract the text, what I find is that:
>
> a) some, but not all, UTF-16 byte-pairs produce an extractable
> character.
>
> b) Whenever the *first* byte of the pair is in the upper range
> 128--255 then the whole character is omitted.
>
> For example, with the PDF string: (˛ˇt»»tt»)
> the text extracted using Adobe Reader is 瓈존瓈
> but Poppler produces 珈珈 , which exhibits two errors.
>
> Firstly, ...
>
> the portion '»t' has been extracted as '', the empty string,
> between the chinese ideographs.
>
> In alternative representations, this is:
> (<FE><FF>t<C8><C8>tt<C8>) producing <E7><8F><88><E7><8F><88> ,
> where t<C8> representing 't»' extracts to
> <E7><8F><88> which is 珈 .
>
> Secondly, ...
>
> c) There is an error in the translation of UTF-16 characters
> into UTF-8. For example, the above t<C8> should actually
> convert in UTF-8 to <E7><93><88> which is 瓈 ,
> as done by Adobe and other software.
>
> The <E7><8F><88> is what correctly comes from s<C8> ;
> the top-order byte is being mistranslated by -1.
>
>
> Further comments.
>
> d) little-endian UTF-16 strings are not supported at all.
> There's no coding to swap the byte order within an extracted
> string.
>
> Instead the byte-order mark in (ˇ˛»tt»»t) isn't recognised,
> so Poppler extracts just the letters 'ttt'.
>
>
> e) octal codes can be used, contrary to a question that I raised
> in bug report 20013 .
> There my testing was with codes which produced 1st bytes
> within the upper range, so the difficulties were the same
> as in b) above.
More information about the poppler
mailing list