[poppler] reconsideration about iconv() in cpp frontend

Adam Reichold adam.reichold at t-online.de
Sun Apr 15 09:15:56 UTC 2018


Hello,

Am 15.04.2018 um 05:08 schrieb suzuki toshiya:
> Hi,
> 
> Now I'm reworking with cpp-frontend encoding issue, and found that:
> previous patch fixes the issue for text drawn in a PDF, but does
> not fix the broken metadata if it is coded by PDFDocEncoding
> (sample PDF is attached).
> 
> During the rework, I'm becoming questionable about the usage of
> iconv() in cpp frontend, by following reasons.
> 
> 1) currently, the usage of iconv() in cpp frontend is only the
> conversion of Unicode: UTF-8 <-> UTF-16. This is completely
> mathematical conversion, using iconv() could be overkill.
> 
> in fact, poppler has its own conversion routines (see UTF.h),
> which are independent from iconv().
> 
> 2) iconv() requires the allocated buffer, but does not help
> the calculation of the required buffer size.
> 
> because this difficulty, current usages of iconv() have ugly
> fallbacks for the cases that prepared buffer is too short,
> like this:
> 
>     size_t ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char,
> &str_data, &str_len_left);
>     if ((ir == (size_t)-1) && (errno == E2BIG)) {
>         const size_t delta = str_data - &str[0];
>         str_len_left += str.size();
>         str.resize(str.size() * 2);
>         str_data = &str[delta];
>         ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char, &str_data,
> &str_len_left);
>         if (ir == (size_t)-1) {
>             return byte_array();
>         }
>     }
>     str.resize(str.size() - str_len_left);
>     return str;
> 
> maybe due to this difficulty, glib or Qt frontends do not use
> iconv() directly, they use better utility functions in these
> frameworks.
> 
> the functions in UTF.h have the functions to calculate the
> required buffer size from given UTF-8 or UTF-16 data.
> 
> 3) iconv() receives the buffer filled by, or to be filled by
> UTF-16 data. currently, casting by reinterpret_cast<char *>()
> is used to interchange between ustring (the array of unsigned
> short) and a byte buffer, but it is not theoretically portable.
> On the systems whose short is greater than 16-bit, or its
> alignment is greater than 16-bit, the casted byte buffer would
> have unexpected data from MSB bits or padding, and it could
> make iconv() confused. using uint16_t (as the functions in UTF.h)
> would be safer.
> 
> To do in a theoretically portable way, we should allocate the
> byte buffer for UTF-16 data, and copy to or copy from it,
> instead of the casting. This would be uneasy for the maintenance.
> 
> ==
> 
> How could I resolve this issue? There might be a few candidates.
> 
> A) use the functions in UTF.h, and change ustring class
> to fit them.
> - pros: easiest maintainability, minimum duplication.
> - cons: API changed, and the memory management by gmem.h
> would be introduced in cpp frontend, it would be less
> consistent in cpp frontend internal.
> 
> B) rewrite the functions in UTF.h by template programming
> and have the interfaces fitting to current ustring().
> - pros: minimum duplication, good consistency, API kept.
> - cons: maintainability would be decreased slightly.
> 
> C) write diversions of the functions in UTF.h in cpp frontend.
> - pros: acceptable maintainability, good consistency, API kept.
> - cons: duplication of almost same code into 2 places.

I think what is best here is hard to decide without seeing actual code
that at least starts in one of these directions.

Concerning ustring, I think an API change is not possible except for
changing unsigned short to uint16_t since it already was broken where
sizeof(unsigned short) != 2.

Best regards,
Adam

> I hope to hear the comments from experts.
> 
> Regards,
> mpsuzuki
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180415/4649388e/attachment-0001.sig>


More information about the poppler mailing list