[poppler] reconsideration about iconv() in cpp frontend

Sun Apr 15 03:08:03 UTC 2018

Hi,

Now I'm reworking with cpp-frontend encoding issue, and found that:
previous patch fixes the issue for text drawn in a PDF, but does
not fix the broken metadata if it is coded by PDFDocEncoding
(sample PDF is attached).

During the rework, I'm becoming questionable about the usage of
iconv() in cpp frontend, by following reasons.

1) currently, the usage of iconv() in cpp frontend is only the
conversion of Unicode: UTF-8 <-> UTF-16. This is completely
mathematical conversion, using iconv() could be overkill.

in fact, poppler has its own conversion routines (see UTF.h),
which are independent from iconv().

2) iconv() requires the allocated buffer, but does not help
the calculation of the required buffer size.

because this difficulty, current usages of iconv() have ugly
fallbacks for the cases that prepared buffer is too short,
like this:

    size_t ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char,
&str_data, &str_len_left);
    if ((ir == (size_t)-1) && (errno == E2BIG)) {
        const size_t delta = str_data - &str[0];
        str_len_left += str.size();
        str.resize(str.size() * 2);
        str_data = &str[delta];
        ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char, &str_data,
&str_len_left);
        if (ir == (size_t)-1) {
            return byte_array();
        }
    }
    str.resize(str.size() - str_len_left);
    return str;

maybe due to this difficulty, glib or Qt frontends do not use
iconv() directly, they use better utility functions in these
frameworks.

the functions in UTF.h have the functions to calculate the
required buffer size from given UTF-8 or UTF-16 data.

3) iconv() receives the buffer filled by, or to be filled by
UTF-16 data. currently, casting by reinterpret_cast<char *>()
is used to interchange between ustring (the array of unsigned
short) and a byte buffer, but it is not theoretically portable.
On the systems whose short is greater than 16-bit, or its
alignment is greater than 16-bit, the casted byte buffer would
have unexpected data from MSB bits or padding, and it could
make iconv() confused. using uint16_t (as the functions in UTF.h)
would be safer.

To do in a theoretically portable way, we should allocate the
byte buffer for UTF-16 data, and copy to or copy from it,
instead of the casting. This would be uneasy for the maintenance.

==

How could I resolve this issue? There might be a few candidates.

A) use the functions in UTF.h, and change ustring class
to fit them.
- pros: easiest maintainability, minimum duplication.
- cons: API changed, and the memory management by gmem.h
would be introduced in cpp frontend, it would be less
consistent in cpp frontend internal.

B) rewrite the functions in UTF.h by template programming
and have the interfaces fitting to current ustring().
- pros: minimum duplication, good consistency, API kept.
- cons: maintainability would be decreased slightly.

C) write diversions of the functions in UTF.h in cpp frontend.
- pros: acceptable maintainability, good consistency, API kept.
- cons: duplication of almost same code into 2 places.

I hope to hear the comments from experts.

Regards,
mpsuzuki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PDFDocEncoding.pdf
Type: application/pdf
Size: 13619 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180415/f294f088/attachment.pdf>