[poppler] BOM and endianness of poppler::ustring class

Sun Jul 24 13:49:30 PDT 2011

Pino can I expect you to handle this or do I need to step in?

Albert

A Divendres, 10 de juny de 2011, suzuki toshiya vàreu escriure:
> Hi poppler-cpp maintainers,
> 
> During the offline discussion with Sun Ho Park, I found
> some difficult points to understand in poppler-cpp, about
> poppler::ustring class. Please let me ask some questions
> about whether the found features are designed /or not,
> before drafting some patches.
> 
> 1) poppler::ustring object should have BOM always, or sometimes, or no BOM always?
> 
> Checking the implementation of poppler::ustring::from_latin1(),
> like this,
> 
> ustring ustring::from_latin1(const std::string &str)
> {
>     const size_type l = str.size();
>     if (!l) {
>         return ustring();
>     }
>     const char *c = str.data();
>     ustring ret(l, 0);
>     for (size_type i = 0; i < l; ++i) {
>         ret[i] = *c++;
>     }
>     return ret;
> }
> 
> I think no BOM is inserted.
> 
> On the other hand, checking the implementation of poppler::ustring::from_utf8(),
> it uses iconv() for code conversion to UTF-16, aslike:
> 
> ustring ustring::from_utf8(const char *str, int len)
> {
>     if (len <= 0) {
>         len = std::strlen(str);
>         if (len <= 0) {
>             return ustring();
>         }
>     }
> 
>     MiniIconv ic("UTF-16", "UTF-8");
>     if (!ic.is_valid()) {
>         return ustring();
>     }
> 
>     ustring ret(len * 2, 0);
>     char *ret_data = reinterpret_cast<char *>(&ret[0]);
>     char *str_data = const_cast<char *>(str);
>     size_t str_len_char = len;
>     size_t ret_len_left = ret.size();
>     size_t ir = iconv(ic, (ICONV_CONST char **)&str_data, &str_len_char, &ret_data, &ret_len_left);
>     if ((ir == (size_t)-1) && (errno == E2BIG)) {
>         const size_t delta = ret_data - reinterpret_cast<char *>(&ret[0]);
>         ret_len_left += ret.size();
>         ret.resize(ret.size() * 2);
>         ret_data = reinterpret_cast<char *>(&ret[delta]);
>         ir = iconv(ic, (ICONV_CONST char **)&str_data, &str_len_char, &ret_data, &ret_len_left);
>         if (ir == (size_t)-1) {
>             return ustring();
>         }
>     }
>     if (ret_len_left >= 0) {
>         ret.resize(ret.size() - ret_len_left);
>     }
> 
>     return ret;
> }
> 
> The encoding conversion is done by iconv(), but the output encoding is
> simply specified as "UTF-16". Because it does not specifies the endian,
> some iconv() implementations inserts BOM at the beginning. So, some
> ustring can have BOM, other ustring have no BOM. This ambiguity was
> designed feature?
> 
> 2) poppler::ustring is designed to be endian-free?
> 
> The native unit of poppler::ustring class is an unsigned short.
> As seen in the iconv() invocation in poppler::ustring::from_utf8(),
> the endian of the content in ret_data[] is dependent with the
> implementation of iconv(), because it is used as an array of char
> after reinterpret_cast. Thus, the content of poppler::ustring
> could be dependent with iconv(), and, the endian of the architecture.
> 
> On my GNU/Linux running on i386 and iconv in GNU libc,
> iconv( "UTF-16", "UTF-8" ) conversion makes little endian UTF-16
> byte sequence, and writing it to an array of char, and cast to
> the array of unsigned short (stored in little endian order),
> the Unicode value is correct.
> 
> But, if the endian of iconv() for UTF-16 is different from
> the native architecture, the conversion makes invalid value.
> For example, if I use manually installed GNU libiconv 1.8,
> its default UTF-16 is big endian. I'm not sure if there is
> widely accepted standard which requests UTF-16 for iconv() must
> be consistent with the endian of the system architecture,
> so I think checking the consistency is safer, and will improve
> the portability of poppler-cpp. (Anyway, poppler-cpp is
> disabled on the system without native iconv())
> 
> Also, it is possible to care about the irregular system
> whose "short" is not 16-bit.
> 
> Regards,
> mpsuzuki
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler