[poppler] reconsideration about iconv() in cpp frontend
Adam Reichold
adam.reichold at t-online.de
Sat May 5 06:22:39 UTC 2018
Hello mpsuzuki,
Am 26.04.2018 um 19:03 schrieb suzuki toshiya:
> Dear Adam,
>
> Thank you for review & comment, and sorry for lated response.
>
> Adam Reichold wrote:
>> I like it. Some minor remarks:
>>
>> * Is this the only user of iconv within Poppler? If so, we could remove
>> it from the build system as wel.
>
> OK.
>
>> * In "ustring::to_utf8":
>> - "utf16_end" is only used to compute the size once, maybe just use
>> "size()" to call "utf16ToUtf8"
>
> It is slightly different, it is used to compute the length without BOM.
> Also the first UTF-16 character to be passed to utf16ToUtf8() is changed
> to skip the BOM. To eliminate it, the code would be like this. This
> is better?
>
> byte_array ustring::to_utf8() const
> {
> if (!size()) {
> return byte_array();
> }
if (empty()) would be more direct.
> const uint16_t* utf16_buff = data();
> const bool hasBOM = (utf16_buff[0] == 0xFEFF);
Why not modify the pointer once?
if (utf16_buff[0] == 0xFEFF) {
++uft16_buff;
}
> const int utf8_len = utf16CountUtf8Bytes(utf16_buff + (hasBOM ? 1 : 0));
>
> byte_array ret(utf8_len + 1, '\0');
> utf8_len = utf16ToUtf8(utf16_buff + (hasBOM ? 1 : 0),
> reinterpret_cast<char *>(ret.data()),
> utf8_len + 1,
> size() - (hasBOM ? 1 : 0));
> ret.resize(utf8_len);
>
> return ret;
> }
>
>> - "utf16ToUtf8" returns how many bytes were written, so no need to
>> call "std::strlen" again
>
> OK.
>
>> * In "ustring::to_latin1":
>> - for readability, I'd suggest either iterating both "me" and "ret"
>> using a parallel index "i" or via pointer arithmetic, but not mixing it
>
> how about this?
> std::string ustring::to_latin1() const
> {
> if (!size()) {
> return std::string();
> }
>
> const bool hasBOM = (*data() == 0xFEFF);
> std::string ret(size() + (hasBOM ? 0 : 1), '\0');
> iterator itl_utf16 = begin() + (hasBOM ? 1 : 0);
> iterator itl_ret = ret.begin();
>
> for (; itl_utf16 < end(); itl_utf16++, itl_latin1++) {
> *itl_ret = (char)*itl_utf16;
> }
Again, I would suggest just modifing a copy of data() to skip the first
element if it is a BOM. Also iterators are idiomatically checked using
!=, i.e. itl_utf16 != end().
Best regards,
Adam.
> return ret;
> }
>
>
>> * In "has_bom_utf8":
>> - The spacing of the first "if" is a bit off
>> - Instead of "if(foo) { return true; } return false;", just do "return
>> foo;"
>
> OK.
>
>> * In "ustring::from_utf8":
>> - Three times "skip BOM" is probably a bit on the noisy side ;-)
>
> OK.
>
>> - Why no additional call to "ret.resize" as for "to_utf8"? Or why the
>> additional call there?
>
> ret.resize() added.
>
>> Just for my understanding, this also contains another important fix that
>> is unrelated to the iconv usage?
>>
>> - u = data[i] & 0xff;
>> + u = pdfDocEncoding[ *data & 0xff ];
>
> Ah, yes. I had once explained in the beginning of this thread
> for reworking, but slipped to mention it for the patch.
>
>> Maybe the changes to "detail::unicode_GooString_to_ustring" could be a
>> separate commit?
>
> OK, it is separated to fix-cpp-frontend-endian-issue_20180427-0130-b.
> Others are summarized in fix-cpp-frontend-endian-issue_20180427-0130-a.
>
> Regards,
> mpsuzuki
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180505/cc3ea990/attachment-0001.sig>
More information about the poppler
mailing list