[poppler] BOM and endianness of poppler::ustring class
Albert Astals Cid
aacid at kde.org
Sun Jul 24 13:49:30 PDT 2011
Pino can I expect you to handle this or do I need to step in?
Albert
A Divendres, 10 de juny de 2011, suzuki toshiya vàreu escriure:
> Hi poppler-cpp maintainers,
>
> During the offline discussion with Sun Ho Park, I found
> some difficult points to understand in poppler-cpp, about
> poppler::ustring class. Please let me ask some questions
> about whether the found features are designed /or not,
> before drafting some patches.
>
> 1) poppler::ustring object should have BOM always, or sometimes, or no BOM always?
>
> Checking the implementation of poppler::ustring::from_latin1(),
> like this,
>
> ustring ustring::from_latin1(const std::string &str)
> {
> const size_type l = str.size();
> if (!l) {
> return ustring();
> }
> const char *c = str.data();
> ustring ret(l, 0);
> for (size_type i = 0; i < l; ++i) {
> ret[i] = *c++;
> }
> return ret;
> }
>
> I think no BOM is inserted.
>
> On the other hand, checking the implementation of poppler::ustring::from_utf8(),
> it uses iconv() for code conversion to UTF-16, aslike:
>
> ustring ustring::from_utf8(const char *str, int len)
> {
> if (len <= 0) {
> len = std::strlen(str);
> if (len <= 0) {
> return ustring();
> }
> }
>
> MiniIconv ic("UTF-16", "UTF-8");
> if (!ic.is_valid()) {
> return ustring();
> }
>
> ustring ret(len * 2, 0);
> char *ret_data = reinterpret_cast<char *>(&ret[0]);
> char *str_data = const_cast<char *>(str);
> size_t str_len_char = len;
> size_t ret_len_left = ret.size();
> size_t ir = iconv(ic, (ICONV_CONST char **)&str_data, &str_len_char, &ret_data, &ret_len_left);
> if ((ir == (size_t)-1) && (errno == E2BIG)) {
> const size_t delta = ret_data - reinterpret_cast<char *>(&ret[0]);
> ret_len_left += ret.size();
> ret.resize(ret.size() * 2);
> ret_data = reinterpret_cast<char *>(&ret[delta]);
> ir = iconv(ic, (ICONV_CONST char **)&str_data, &str_len_char, &ret_data, &ret_len_left);
> if (ir == (size_t)-1) {
> return ustring();
> }
> }
> if (ret_len_left >= 0) {
> ret.resize(ret.size() - ret_len_left);
> }
>
> return ret;
> }
>
> The encoding conversion is done by iconv(), but the output encoding is
> simply specified as "UTF-16". Because it does not specifies the endian,
> some iconv() implementations inserts BOM at the beginning. So, some
> ustring can have BOM, other ustring have no BOM. This ambiguity was
> designed feature?
>
> 2) poppler::ustring is designed to be endian-free?
>
> The native unit of poppler::ustring class is an unsigned short.
> As seen in the iconv() invocation in poppler::ustring::from_utf8(),
> the endian of the content in ret_data[] is dependent with the
> implementation of iconv(), because it is used as an array of char
> after reinterpret_cast. Thus, the content of poppler::ustring
> could be dependent with iconv(), and, the endian of the architecture.
>
> On my GNU/Linux running on i386 and iconv in GNU libc,
> iconv( "UTF-16", "UTF-8" ) conversion makes little endian UTF-16
> byte sequence, and writing it to an array of char, and cast to
> the array of unsigned short (stored in little endian order),
> the Unicode value is correct.
>
> But, if the endian of iconv() for UTF-16 is different from
> the native architecture, the conversion makes invalid value.
> For example, if I use manually installed GNU libiconv 1.8,
> its default UTF-16 is big endian. I'm not sure if there is
> widely accepted standard which requests UTF-16 for iconv() must
> be consistent with the endian of the system architecture,
> so I think checking the consistency is safer, and will improve
> the portability of poppler-cpp. (Anyway, poppler-cpp is
> disabled on the system without native iconv())
>
> Also, it is possible to care about the irregular system
> whose "short" is not 16-bit.
>
> Regards,
> mpsuzuki
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list