[poppler] Actual text and 0.6 branch
Ross Moore
ross at ics.mq.edu.au
Sat Feb 14 16:59:28 PST 2009
Hi Jonathan,
I think there is a serious problem in Poppler.
However, it's possible that maybe there is something
else wrong on my Mac.
Before reporting it as a bug, would you please confirm
(under Mac OS, or Linux, whatever you have)
that what I say below isn't just local to my system.
Cheers & thanks,
Ross
Hi Albert,
This revisits a thread from December 2007, where you
report adding a patch to support /ActualText .
See also:
[Poppler-bugs] [Bug 13573] Poppler does not support ActualText
I'm now creating PDFs with /ActualText strings for CJK ideographs.
These strings are given in big-endian UTF-16 format.
Using pdftotext to extract the text, what I find is that:
a) some, but not all, UTF-16 byte-pairs produce an extractable
character.
b) Whenever the *first* byte of the pair is in the upper range
128--255 then the whole character is omitted.
For example, with the PDF string: (˛ˇt»»tt»)
the text extracted using Adobe Reader is 瓈존瓈
but Poppler produces 珈珈 , which exhibits two errors.
Firstly, ...
the portion '»t' has been extracted as '', the empty string,
between the chinese ideographs.
In alternative representations, this is:
(<FE><FF>t<C8><C8>tt<C8>) producing <E7><8F><88><E7><8F><88> ,
where t<C8> representing 't»' extracts to
<E7><8F><88> which is 珈 .
Secondly, ...
c) There is an error in the translation of UTF-16 characters
into UTF-8. For example, the above t<C8> should actually
convert in UTF-8 to <E7><93><88> which is 瓈 ,
as done by Adobe and other software.
The <E7><8F><88> is what correctly comes from s<C8> ;
the top-order byte is being mistranslated by -1.
Further comments.
d) little-endian UTF-16 strings are not supported at all.
There's no coding to swap the byte order within an extracted
string.
Instead the byte-order mark in (ˇ˛»tt»»t) isn't recognised,
so Poppler extracts just the letters 'ttt'.
e) octal codes can be used, contrary to a question that I raised
in bug report 20013 .
There my testing was with codes which produced 1st bytes
within the upper range, so the difficulties were the same
as in b) above.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: KS-fakeactual.pdf
Type: application/pdf
Size: 70927 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20090215/d0798b9b/attachment-0001.pdf
-------------- next part --------------
The attached PDF has been 'faked' so that each of the Korean
ideographs has been tagged with /ActualText with the example
described above.
BTW, the example PDF
http://www.unicode.org/udhr/d/udhr_san.pdf
used to test the /ActualText support involved only characters
in the range Ux0A.. so that the problem (b) with higher range
characters did not occur; and nor does (c) for this range.
Hope this helps.
Ross
------------------------------------------------------------------------
Ross Moore ross at maths.mq.edu.au
Mathematics Department office: E7A-419
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia 2109 fax: +61 (0)2 9850 8114
------------------------------------------------------------------------
More information about the poppler
mailing list