[poppler] reconsideration about iconv() in cpp frontend

Sat May 5 06:22:39 UTC 2018

Hello mpsuzuki,

Am 26.04.2018 um 19:03 schrieb suzuki toshiya:
> Dear Adam,
> 
> Thank you for review & comment, and sorry for lated response.
> 
> Adam Reichold wrote:
>> I like it. Some minor remarks:
>>
>> * Is this the only user of iconv within Poppler? If so, we could remove
>> it from the build system as wel.
> 
> OK.
> 
>> * In "ustring::to_utf8":
>>   - "utf16_end" is only used to compute the size once, maybe just use
>> "size()" to call "utf16ToUtf8"
> 
> It is slightly different, it is used to compute the length without BOM.
> Also the first UTF-16 character to be passed to utf16ToUtf8() is changed
> to skip the BOM. To eliminate it, the code would be like this. This
> is better?
> 
> byte_array ustring::to_utf8() const
> {
>     if (!size()) {
>         return byte_array();
>     }

if (empty()) would be more direct.

>     const uint16_t* utf16_buff = data();
>     const bool hasBOM = (utf16_buff[0] == 0xFEFF);

Why not modify the pointer once?

if (utf16_buff[0] == 0xFEFF) {
  ++uft16_buff;
}

>     const int utf8_len = utf16CountUtf8Bytes(utf16_buff + (hasBOM ? 1 : 0));
> 
>     byte_array ret(utf8_len + 1, '\0');
>     utf8_len = utf16ToUtf8(utf16_buff + (hasBOM ? 1 : 0),
>                         reinterpret_cast<char *>(ret.data()),
>                         utf8_len + 1,
>                         size() - (hasBOM ? 1 : 0));
>     ret.resize(utf8_len);
> 
>     return ret;
> }
> 
>>   - "utf16ToUtf8" returns how many bytes were written, so no need to
>> call "std::strlen" again
> 
> OK.
> 
>> * In "ustring::to_latin1":
>>   - for readability, I'd suggest either iterating both "me" and "ret"
>> using a parallel index "i" or via pointer arithmetic, but not mixing it
> 
> how about this?
> std::string ustring::to_latin1() const
> {
>     if (!size()) {
>         return std::string();
>     }
> 
>     const bool hasBOM = (*data() == 0xFEFF);
>     std::string ret(size() + (hasBOM ? 0 : 1), '\0');
>     iterator itl_utf16 = begin() + (hasBOM ? 1 : 0);
>     iterator itl_ret = ret.begin();
> 
>     for (; itl_utf16 < end(); itl_utf16++, itl_latin1++) {
>         *itl_ret = (char)*itl_utf16;
>     }

Again, I would suggest just modifing a copy of data() to skip the first
element if it is a BOM. Also iterators are idiomatically checked using
!=, i.e. itl_utf16 != end().

Best regards,
Adam.

>     return ret;
> }
> 
> 
>> * In "has_bom_utf8":
>>   - The spacing of the first "if" is a bit off
>>   - Instead of "if(foo) { return true; } return false;", just do "return
>> foo;"
> 
> OK.
> 
>> * In "ustring::from_utf8":
>>   - Three times "skip BOM" is probably a bit on the noisy side ;-)
> 
> OK.
> 
>>   - Why no additional call to "ret.resize" as for "to_utf8"? Or why the
>> additional call there?
> 
> ret.resize() added.
> 
>> Just for my understanding, this also contains another important fix that
>> is unrelated to the iconv usage?
>>
>> -            u = data[i] & 0xff;
>> +            u = pdfDocEncoding[ *data & 0xff ];
> 
> Ah, yes. I had once explained in the beginning of this thread
> for reworking, but slipped to mention it for the patch.
> 
>> Maybe the changes to "detail::unicode_GooString_to_ustring" could be a
>> separate commit?
> 
> OK, it is separated to fix-cpp-frontend-endian-issue_20180427-0130-b.
> Others are summarized in fix-cpp-frontend-endian-issue_20180427-0130-a.
> 
> Regards,
> mpsuzuki
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180505/cc3ea990/attachment-0001.sig>