[poppler] [PATCH and RFC] Bugfixes, Improved Forms Support for Unicode

Sat Feb 2 21:27:53 PST 2008

The root cause https://bugs.freedesktop.org/show_bug.cgi?id=12808 is 
that the code for rendering form fields in poppler didn't properly deal 
with input strings provided in UTF-16: the string was treated as an 
8-bit string, and the byte-order-mark at the front was included in the 
length calculation.

I started off trying to create a simple fix for this problem, but 
eventually ended up significantly rewriting the code for displaying form 
fields to fix other problems that I found, eventually working to add 
near full support for Unicode inputs.

Since these changes are large, I don't expect this patch to go in right 
away.  But please, provide feedback.  My work in based on git commit 
6f11ef660540.

There are two patches.  The first, character-encoding-fixes.patch, is a 
couple of fairly trivial fixes that I came across while working on the 
larger patch.  It can go in at any time if it looks good.

The second patch, unicode-forms-support.patch, is the main part of the 
work and the patch I'd like comments on.  Most new functionality is in 
the new Annot::layoutText function.  It performs a few steps:
   - Converts input in PDFDocEncoding or UTF-16 to the font's encoding
   - Computes the width of the text on the page
   - Optionally breaks the text at the specified width, for multi-line
     form fields
All of this ended up in the same function since finding break-points for 
lines is easiest to do on the input encoding, where spaces and newlines 
are easier to recognize than in whatever encoding the font uses, but the 
width of text is easiest to compute when re-encoding the text string.

The main missing element for full Unicode handling is the writing out of 
text for CID-keyed fonts.  There is currently be support for taking 
Unicode characters as input and finding the appropriate character code 
in the font to show it.  However, there isn't code for writing out the 
correct sequence of bytes to show that character (doing so should be 
trivial for an identity CMap, but isn't added quite yet).

Also missing: support for Unicode text outside the BMP, using surrogate 
pairs.

I've done some limited testing with these patches (in evince), and it 
definitely work better for me than before.  However, I don't currently 
have PDFs for testing many features, so pointers to any good test forms 
are appreciated!

Features tested:
   - Accented characters; typographic characters such as bullets, quotes
   - Left, center, right alignment of single-line fields
   - Checkboxes work as before
   - Single-line comb fields still work
Not tested:
   - Multi-line fields (my test form doesn't have them)
   - Form fields with composite fonts (no test forms; code still needs a
     tiny bit of work)

--Michael Vrable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: character-encoding-fixes.patch
Type: text/x-diff
Size: 1416 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080202/077be27b/attachment-0002.patch 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode-forms-support.patch
Type: text/x-diff
Size: 22376 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080202/077be27b/attachment-0003.patch 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080202/077be27b/attachment-0001.pgp