[poppler] BOM and endianness of poppler::ustring class

Fri Jun 10 01:13:16 PDT 2011

Hi poppler-cpp maintainers,

During the offline discussion with Sun Ho Park, I found
some difficult points to understand in poppler-cpp, about
poppler::ustring class. Please let me ask some questions
about whether the found features are designed /or not,
before drafting some patches.

1) poppler::ustring object should have BOM always, or sometimes, or no BOM always?

Checking the implementation of poppler::ustring::from_latin1(),
like this,

ustring ustring::from_latin1(const std::string &str)
{
    const size_type l = str.size();
    if (!l) {
        return ustring();
    }
    const char *c = str.data();
    ustring ret(l, 0);
    for (size_type i = 0; i < l; ++i) {
        ret[i] = *c++;
    }
    return ret;
}

I think no BOM is inserted.

On the other hand, checking the implementation of poppler::ustring::from_utf8(),
it uses iconv() for code conversion to UTF-16, aslike:

ustring ustring::from_utf8(const char *str, int len)
{
    if (len <= 0) {
        len = std::strlen(str);
        if (len <= 0) {
            return ustring();
        }
    }

    MiniIconv ic("UTF-16", "UTF-8");
    if (!ic.is_valid()) {
        return ustring();
    }

    ustring ret(len * 2, 0);
    char *ret_data = reinterpret_cast<char *>(&ret[0]);
    char *str_data = const_cast<char *>(str);
    size_t str_len_char = len;
    size_t ret_len_left = ret.size();
    size_t ir = iconv(ic, (ICONV_CONST char **)&str_data, &str_len_char, &ret_data, &ret_len_left);
    if ((ir == (size_t)-1) && (errno == E2BIG)) {
        const size_t delta = ret_data - reinterpret_cast<char *>(&ret[0]);
        ret_len_left += ret.size();
        ret.resize(ret.size() * 2);
        ret_data = reinterpret_cast<char *>(&ret[delta]);
        ir = iconv(ic, (ICONV_CONST char **)&str_data, &str_len_char, &ret_data, &ret_len_left);
        if (ir == (size_t)-1) {
            return ustring();
        }
    }
    if (ret_len_left >= 0) {
        ret.resize(ret.size() - ret_len_left);
    }

    return ret;
}

The encoding conversion is done by iconv(), but the output encoding is
simply specified as "UTF-16". Because it does not specifies the endian,
some iconv() implementations inserts BOM at the beginning. So, some
ustring can have BOM, other ustring have no BOM. This ambiguity was
designed feature?

2) poppler::ustring is designed to be endian-free?

The native unit of poppler::ustring class is an unsigned short.
As seen in the iconv() invocation in poppler::ustring::from_utf8(),
the endian of the content in ret_data[] is dependent with the
implementation of iconv(), because it is used as an array of char
after reinterpret_cast. Thus, the content of poppler::ustring
could be dependent with iconv(), and, the endian of the architecture.

On my GNU/Linux running on i386 and iconv in GNU libc,
iconv( "UTF-16", "UTF-8" ) conversion makes little endian UTF-16
byte sequence, and writing it to an array of char, and cast to
the array of unsigned short (stored in little endian order),
the Unicode value is correct.

But, if the endian of iconv() for UTF-16 is different from
the native architecture, the conversion makes invalid value.
For example, if I use manually installed GNU libiconv 1.8,
its default UTF-16 is big endian. I'm not sure if there is
widely accepted standard which requests UTF-16 for iconv() must
be consistent with the endian of the system architecture,
so I think checking the consistency is safer, and will improve
the portability of poppler-cpp. (Anyway, poppler-cpp is
disabled on the system without native iconv())

Also, it is possible to care about the irregular system
whose "short" is not 16-bit.

Regards,
mpsuzuki