[Poppler-bugs] [Bug 20013] pdftotext doesn't support /Alt nor / ActualText with octal content
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Tue Feb 17 10:59:14 PST 2009
http://bugs.freedesktop.org/show_bug.cgi?id=20013
Ross Moore <ross at maths.mq.edu.au> changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #22702|application/pdf |text/plain
mime type| |
Attachment #22702|0 |1
is patch| |
--- Comment #1 from Ross Moore <ross at maths.mq.edu.au> 2009-02-17 10:59:13 PST ---
(From update of attachment 22702)
This bug is due to improper extraction of the text in the /ActualText entry.
Here is a better description of the effects observed.
I'm now creating PDFs with /ActualText strings for CJK ideographs.
These strings are given in big-endian UTF-16 format.
Using pdftotext to extract the text, what I find is that:
a) some, but not all, UTF-16 byte-pairs produce an extractable
character.
b) Whenever the *first* byte of the pair is in the upper range
128--255 then the whole character is omitted.
For example, with the PDF string: (˛ˇt»»tt»)
the text extracted using Adobe Reader is 瓈존瓈
but Poppler produces 珈珈 , which exhibits two errors.
Firstly, ...
the portion '»t' has been extracted as '', the empty string,
between the chinese ideographs.
In alternative representations, this is:
(<FE><FF>t<C8><C8>tt<C8>) producing <E7><8F><88><E7><8F><88> ,
where t<C8> representing 't»' extracts to
<E7><8F><88> which is 珈 .
Secondly, ...
c) There is an error in the translation of UTF-16 characters
into UTF-8. For example, the above t<C8> should actually
convert in UTF-8 to <E7><93><88> which is 瓈 ,
as done by Adobe and other software.
The <E7><8F><88> is what correctly comes from s<C8> ;
the top-order byte is being mistranslated by -1.
Further comment.
d) octal codes can be used, contrary to a question that I raised
in bug report 20013 .
There my testing was with codes which produced 1st bytes
within the upper range, so the difficulties were the same
as in b) above.
e) the example PDF http://www.unicode.org/udhr/d/udhr_san.pdf
used to test the /ActualText support involved only characters
in the range Ux0A.. so that the problem (b) with higher range
characters did not occur; and nor does (c) for this range.
Here's a patch that fixes the problem.
The new line of coding is based upon similar methods used in
poppler/Outline.cc .
*** TextOutputDev-prev.cc Wed Feb 18 04:59:28 2009
--- TextOutputDev.cc Wed Feb 18 05:42:22 2009
*************** void TextOutputDev::endMarkedContent(Gfx
*** 4657,4663 ****
length = length/2 - 1;
uni = new Unicode[length];
for (i = 0 ; i < length; i++)
! uni[i] = (uniString[2 + i*2]<<8) + uniString[2 + i*2+1];
text->addChar(state,
actualText_x, actualText_y,
--- 4657,4663 ----
length = length/2 - 1;
uni = new Unicode[length];
for (i = 0 ; i < length; i++)
! uni[i] = ((uniString[2 + i*2] & 0xff)<<8)|(uniString[3 + i*2] & 0xff);
text->addChar(state,
actualText_x, actualText_y,
--
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list