[poppler] [PATCH and RFC] Bugfixes, Improved Forms Support for Unicode
Michael Vrable
mvrable at cs.ucsd.edu
Sat Feb 2 21:27:53 PST 2008
The root cause https://bugs.freedesktop.org/show_bug.cgi?id=12808 is
that the code for rendering form fields in poppler didn't properly deal
with input strings provided in UTF-16: the string was treated as an
8-bit string, and the byte-order-mark at the front was included in the
length calculation.
I started off trying to create a simple fix for this problem, but
eventually ended up significantly rewriting the code for displaying form
fields to fix other problems that I found, eventually working to add
near full support for Unicode inputs.
Since these changes are large, I don't expect this patch to go in right
away. But please, provide feedback. My work in based on git commit
6f11ef660540.
There are two patches. The first, character-encoding-fixes.patch, is a
couple of fairly trivial fixes that I came across while working on the
larger patch. It can go in at any time if it looks good.
The second patch, unicode-forms-support.patch, is the main part of the
work and the patch I'd like comments on. Most new functionality is in
the new Annot::layoutText function. It performs a few steps:
- Converts input in PDFDocEncoding or UTF-16 to the font's encoding
- Computes the width of the text on the page
- Optionally breaks the text at the specified width, for multi-line
form fields
All of this ended up in the same function since finding break-points for
lines is easiest to do on the input encoding, where spaces and newlines
are easier to recognize than in whatever encoding the font uses, but the
width of text is easiest to compute when re-encoding the text string.
The main missing element for full Unicode handling is the writing out of
text for CID-keyed fonts. There is currently be support for taking
Unicode characters as input and finding the appropriate character code
in the font to show it. However, there isn't code for writing out the
correct sequence of bytes to show that character (doing so should be
trivial for an identity CMap, but isn't added quite yet).
Also missing: support for Unicode text outside the BMP, using surrogate
pairs.
I've done some limited testing with these patches (in evince), and it
definitely work better for me than before. However, I don't currently
have PDFs for testing many features, so pointers to any good test forms
are appreciated!
Features tested:
- Accented characters; typographic characters such as bullets, quotes
- Left, center, right alignment of single-line fields
- Checkboxes work as before
- Single-line comb fields still work
Not tested:
- Multi-line fields (my test form doesn't have them)
- Form fields with composite fonts (no test forms; code still needs a
tiny bit of work)
--Michael Vrable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: character-encoding-fixes.patch
Type: text/x-diff
Size: 1416 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080202/077be27b/attachment-0002.patch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode-forms-support.patch
Type: text/x-diff
Size: 22376 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080202/077be27b/attachment-0003.patch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20080202/077be27b/attachment-0001.pgp
More information about the poppler
mailing list