[poppler] [PATCH] Update to PDFDocEncoding Table

Michael Vrable mvrable at cs.ucsd.edu
Wed Feb 13 21:59:25 PST 2008


Carlos (I believe) pointed me at a document with a form-editing bug, at 
http://bugzilla.gnome.org/show_bug.cgi?id=365807.  The text in the 
upper-right corner is actually a multi-line form field.  If you click on 
that text, only the first is made available for editing.  However, 
editing the field to include additional lines still works.

The problem has to do with the conversion of strings from PDFDocEncoding 
to Unicode.  The lookup table for the conversion does not know what to 
do with a carriage return, and so maps it to U+0000.  When passed up to 
evince for editing, the null character ends the string early, at the 
first newline.  The value of the field is initially stored in 
PDFDocEncoding; when we edit it, we store the results back as a Unicode 
string.

The fix: add carriage return and a few other characters to the 
PDFDocEncoding table.  Map them to the corresponding Unicode characters 
(same numeric value).  In this patch, I'm only adding mappings for 
whitespace characters, not all control characters.  I contemplated 
adding mappings for all control characters, but it's not possible to do 
a complete job since some bytes <0x20 are used for glyphs already.

While making this change, I also updated the table so that any unknown 
characters are now mapped to U+FFFD (conventionally used to represent a 
character that couldn't be converted) instead of U+0000.  This should 
prevent an unknown character in a PDFDocEncoding string from being 
turned into a null in the future.

--Michael Vrable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdfdocencoding.patch
Type: text/x-diff
Size: 3216 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080213/ee3fdb91/attachment.patch 


More information about the poppler mailing list