[poppler] reconsideration about iconv() in cpp frontend

Mon Apr 16 17:11:01 UTC 2018

Hello again,

I really think that poppler::ustring defined as

std::basic_string<unsigned short>

is already broken where sizeof(unsigned short) != 2 and hence should be
redefined to

std::basic_string<std::uint16_t>

without considering that an API break. This then also allows to just use
these buffers acquired using ustring::data to be used with the
conversion functions from UTF.h.

Best regards, Adam.

Am 16.04.2018 um 18:44 schrieb suzuki toshiya:
> Adam Reichold wrote:
>>> A) use the functions in UTF.h, and change ustring class
>>> to fit them.
> 
>>> B) rewrite the functions in UTF.h by template programming
>>> and have the interfaces fitting to current ustring().
> 
>>> C) write diversions of the functions in UTF.h in cpp frontend.
>>
>> I think what is best here is hard to decide without seeing actual code
>> that at least starts in one of these directions.
> 
> Indeed. In fact, yesterday I was trying to write a sample
> patch for B) direction, but I got stuck. Because current UTF.h
> is written in C-style (*p++ = c style copy/concatenation).
> Thus, if I try to pass std::basic_string or std::vector
> object, I have to make yet another template (and different
> implementations) for the concatenation.
> 
> Following is current my proposal, which still use reinterpret_cast<>(),
> but if it is not good (e.g. unsigned short != uint16_t), it allocates
> the buffer and free it internally. By replacing iconv() by UTF.h
> functions, I keep from the cast between unsigned short (with unknown
> byte order) array and char array.
> 
> Please let me explain...
> 
>> --- a/cpp/poppler-global.cpp
>> +++ b/cpp/poppler-global.cpp
>> @@ -26,6 +26,7 @@
>>  #include "poppler-private.h"
>>  
>>  #include "DateInfo.h"
>> +#include "UTF.h"
>>  
>>  #include <algorithm>
> 
> Include "UTF.h" to use UTF-8 <-> UTF-16 helpers in there.
> 
>> @@ -34,31 +35,8 @@
>>  #include <ios>
>>  #include <iostream>
>>  
>> -#include <iconv.h>
>> -
>>  #include "config.h"
>>  
>> -namespace
>> -{
>> -
>> -struct MiniIconv
>> -{
>> -    MiniIconv(const char *to_code, const char *from_code)
>> -        : i_(iconv_open(to_code, from_code))
>> -    {}
>> -    ~MiniIconv()
>> -    { if (is_valid()) iconv_close(i_); }
>> -    MiniIconv(const MiniIconv &) = delete;
>> -    MiniIconv& operator=(const MiniIconv &) = delete;
>> -    bool is_valid() const
>> -    { return i_ != (iconv_t)-1; }
>> -    operator iconv_t() const
>> -    { return i_; }
>> -    iconv_t i_;
>> -};
>> -
>> -}
>> -
> 
> Remove iconv() related functions.
> 
>> @@ -226,28 +204,31 @@ byte_array ustring::to_utf8() const
>>          return byte_array();
>>      }
>>  
>> -    MiniIconv ic("UTF-8", "UTF-16");
>> -    if (!ic.is_valid()) {
>> -        return byte_array();
> 
> Almost replacement of ustring::to_utf8(). See below.
> 
>> +#if 0xFFFFU == USHRT_MAX
>> +    const uint16_t* utf16_buff = reinterpret_cast<const uint16_t *>(data());
>> +#else
> 
> above is the popular case that uint16_t == unsigned int.
> simple casting is enough to get uint16_t array from
> ustring object.
> 
>> +    const uint16_t* utf16_begin = new uint16_t[utf16_len];
>> +    const uint16_t* utf16_buff = utf16_begin;
>> +    const unsigned short *me = data();
>> +   
>> +    for (int i = 0; i < size(); i++) {
>> +        utf16_buff[i] = (uint16_t)me[i];
>>      }
> 
> [snip the removal of existing code]
> 
>> +#endif
> 
> above is the irregular case that uint16_t != unsigned int.
> allocate temporal array of uint16_t and copy the contents.
> for skipping of BOM in the beginning, the heaf of allocated
> buffer is recorded in utf16_begin, but the UTF-16 data to
> be converted starts from utf16_buff.
> 
>> +    const uint16_t* utf16_end = utf16_buff + size();
>> +    if (utf16_buff[0] == 0xFEFF) {
>> +        utf16_buff ++;
>>      }
> 
> in abovem utf16_buff is shifted if the beginning of the
> buffer is BOM.
> 
>> -    str.resize(str.size() - str_len_left);
>> -    return str;
>> +
>> +    const int utf8_len = utf16CountUtf8Bytes(utf16_buff);
>> +    byte_array ret(utf8_len + 1, '\0');
> 
> count the required buffer size by utf16CountUtf8Bytes in UTF.h,
> and allocate the byte_array to be returned. To assue the null
> termination, the size is incremented and filled by '\0'.
> 
>> +    utf16ToUtf8(utf16_buff, reinterpret_cast<char *>(ret.data()), utf8_len + 1, utf16_end - utf16_buff);
>> +    ret.resize(std::strlen(ret.data()));
> 
> convert by utf16ToUtf8(), and shrink to fit converted result.
> 
>> +
>> +#if 0xFFFFU != USHRT_MAX
>> +    delete utf16_begin;
>> +#endif
>> +    return ret;
>>  }
> 
> if irregular case, the temporal buffer should be freed.
> 
>> @@ -256,15 +237,29 @@ std::string ustring::to_latin1() const
>>          return std::string();
>>      }
>>  
>> -    const size_type mylength = size();
>> -    std::string ret(mylength, '\0');
>>      const value_type *me = data();
>> -    for (size_type i = 0; i < mylength; ++i) {
>> +    const value_type *me_end = me + size();
>> +    if (*me == 0xFEFF) {
>> +        me ++;
>> +    }
>> +    std::string ret(me_end - me, '\0');
>> +    for (size_type i = 0; me < me_end; ++i) {
>>          ret[i] = (char)*me++;
>>      }
>>      return ret;
>>  }
> 
> original implementation of ustring::to_latin() does not
> care for BOM. now BOM is checked and skipped if there is.
> 
>> +static bool has_bom_utf8(const char *c, int len)
>> +{
>> +    if ( 3 > len ) {
>> +        return false;
>> +    }
>> +    if (c[0] == 0xEF && c[1] == 0xBB && c[2] == 0xBF) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>>  ustring ustring::from_utf8(const char *str, int len)
>>  {
>>      if (len <= 0) {
> 
> even if ustring::from_utf8 receives UTF-8 string with BOM,
> created ustring object should not include BOM. has_bom_utf8()
> is an utility function for it.
> 
>> @@ -273,31 +268,27 @@ ustring ustring::from_utf8(const char *str, int len)
>>              return ustring();
>>          }
>>      }
>> -
>> -    MiniIconv ic("UTF-16", "UTF-8");
>> -    if (!ic.is_valid()) {
>> -        return ustring();
> 
> remove iconv() related function.
> 
>> +    int utf16_count = utf8CountUtf16CodeUnits(str);
>> +    if (has_bom_utf8(str, len)) {
>> +        str += 3;       // skip BOM
>> +        len -= 3;       // skip BOM
>> +        utf16_count --; // skip BOM
>>      }
> 
> if UTF-8 starts with BOM, it is skiped, the length of
> UTF-8 buffer is decremented (3 octets), the length of
> UTF-16 buffer is decremented (16bit).
> 
>> +    ustring ret(utf16_count + 1, 0);
> 
> allocate ustring object to be returned. to assure the
> U+0000l termination, the size is incremented.
> 
>> +#if 0xFFFFU == USHRT_MAX
>> +    const uint16_t* utf16_buff = reinterpret_cast<const uint16_t *>(ret.data());
> 
> this is for popular case where uint16_t == unsigned short.
> simple cast is enough to make the array of uint16_t.
> 
>> +#else
>> +    const uint16_t* utf16_buff = new uint16_t[utf16_count + 1](0);
>> +#endif
> 
> this is for irregular case. allocate temporal buffer to store
> uint16_t data.
> 
>> +    utf8ToUtf16(str, (uint16_t*)utf16_buff, utf16_count + 1, len);
> 
> fill utf16_buff by converted UTF-16 data.
> 
>> +#if 0xFFFFU != USHRT_MAX
>> +    for (int i = 0; i < utf16_count; i ++) {
>> +        ret[i] = (unsigned short)utf16_buff[i];
>>      }
> 
> in irregular case, the content should be copied to the object
> to be returned.
> 
>> +    delete utf16_buff;
>> +#endif
> 
> the buffer for irregular case is freed here.
> 
>>      return ret;
>>  }
>>  
>> --- a/cpp/poppler-page.cpp
>> +++ b/cpp/poppler-page.cpp
>> @@ -355,7 +365,7 @@ std::vector<text_box> page::text_list() const
>>              TextWord *word = word_list->get(i);
>>  
>>              std::unique_ptr<GooString> gooWord{word->getText()};
>> -            ustring ustr = detail::unicode_GooString_to_ustring(gooWord.get());
>> +            ustring ustr = ustring::from_utf8(gooWord->getCString());
> 
> here, gooWord is the object created by TextOutputDev,
> so it is always UTF-8. it should be passed to ustring::from_utf8().
> 
>> --- a/cpp/poppler-private.cpp
>> +++ b/cpp/poppler-private.cpp
>> @@ -23,6 +23,7 @@
>>  #include "poppler-private.h"
>>  
>>  #include "GooString.h"
>> +#include "PDFDocEncoding.h"
>>  #include "Page.h"
>>  
>>  #include <ctime>
> 
> the strings for PDF metadata is UTF-16BE with BOM or PDFDocEncoding.
> to use PDFDocEncoding helper, include PDFDocEncoding.h
> 
>> @@ -58,16 +59,16 @@ rectf detail::pdfrectangle_to_rectf(const PDFRectangle &pdfrect)
>>  
>>  ustring detail::unicode_GooString_to_ustring(const GooString *str)
>>  {
>> -    const char *data = str->getCString();
>>      const int len = str->getLength();
>> +    const char *data = str->getCString();
>> +    const char *data_end = data + len;
> 
> input data may or may not have BOM, so the length of UTF-8 string
> to be converted may or may not be len. to clarify the end of buffer,
> data_end is calculated, instead of the index "i" in the existing
> implementation.
> 
>> -    int i = 0;
>>      bool is_unicode = false;
>>      if ((data[0] & 0xff) == 0xfe && (len > 1 && (data[1] & 0xff) == 0xff)) {
>>          is_unicode = true;
>> -        i = 2;
>> +        data += 2; // skip BOM
>>      }
> 
> instead of the index i, the beginning of UTF-16BE buffer is shifted
> to skip BOM.
> 
>> -    ustring::size_type ret_len = len - i;
>> +    ustring::size_type ret_len = data_end - data;
>>      if (is_unicode) {
>>          ret_len >>= 1;
>>      }
> 
> ditto.
> 
>> @@ -75,15 +76,15 @@ ustring detail::unicode_GooString_to_ustring(const GooString *str)
>>      size_t ret_index = 0;
>>      ustring::value_type u;
>>      if (is_unicode) {
>> -        while (i < len) {
>> -            u = ((data[i] & 0xff) << 8) | (data[i + 1] & 0xff);
>> -            i += 2;
>> +        while (data < data_end) {
>> +            u = ((*data & 0xff) << 8) | (*(data + 1) & 0xff);
>> +            data += 2;
>>              ret[ret_index++] = u;
>>          }
>>      } else {
>> -        while (i < len) {
>> -            u = data[i] & 0xff;
>> -            ++i;
>> +        while (data < data_end) {
>> +            u = pdfDocEncoding[ *data & 0xff ];
>> +            data ++;
>>              ret[ret_index++] = u;
>>          }
>>      }
> 
> use *data instead of data[i].
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180416/6ca852b9/attachment.sig>