[poppler] reconsideration about iconv() in cpp frontend

Mon Apr 16 16:44:20 UTC 2018

Adam Reichold wrote:
>> A) use the functions in UTF.h, and change ustring class
>> to fit them.

>> B) rewrite the functions in UTF.h by template programming
>> and have the interfaces fitting to current ustring().

>> C) write diversions of the functions in UTF.h in cpp frontend.
> 
> I think what is best here is hard to decide without seeing actual code
> that at least starts in one of these directions.

Indeed. In fact, yesterday I was trying to write a sample
patch for B) direction, but I got stuck. Because current UTF.h
is written in C-style (*p++ = c style copy/concatenation).
Thus, if I try to pass std::basic_string or std::vector
object, I have to make yet another template (and different
implementations) for the concatenation.

Following is current my proposal, which still use reinterpret_cast<>(),
but if it is not good (e.g. unsigned short != uint16_t), it allocates
the buffer and free it internally. By replacing iconv() by UTF.h
functions, I keep from the cast between unsigned short (with unknown
byte order) array and char array.

Please let me explain...

> --- a/cpp/poppler-global.cpp
> +++ b/cpp/poppler-global.cpp
> @@ -26,6 +26,7 @@
>  #include "poppler-private.h"
>  
>  #include "DateInfo.h"
> +#include "UTF.h"
>  
>  #include <algorithm>

Include "UTF.h" to use UTF-8 <-> UTF-16 helpers in there.

> @@ -34,31 +35,8 @@
>  #include <ios>
>  #include <iostream>
>  
> -#include <iconv.h>
> -
>  #include "config.h"
>  
> -namespace
> -{
> -
> -struct MiniIconv
> -{
> -    MiniIconv(const char *to_code, const char *from_code)
> -        : i_(iconv_open(to_code, from_code))
> -    {}
> -    ~MiniIconv()
> -    { if (is_valid()) iconv_close(i_); }
> -    MiniIconv(const MiniIconv &) = delete;
> -    MiniIconv& operator=(const MiniIconv &) = delete;
> -    bool is_valid() const
> -    { return i_ != (iconv_t)-1; }
> -    operator iconv_t() const
> -    { return i_; }
> -    iconv_t i_;
> -};
> -
> -}
> -

Remove iconv() related functions.

> @@ -226,28 +204,31 @@ byte_array ustring::to_utf8() const
>          return byte_array();
>      }
>  
> -    MiniIconv ic("UTF-8", "UTF-16");
> -    if (!ic.is_valid()) {
> -        return byte_array();

Almost replacement of ustring::to_utf8(). See below.

> +#if 0xFFFFU == USHRT_MAX
> +    const uint16_t* utf16_buff = reinterpret_cast<const uint16_t *>(data());
> +#else

above is the popular case that uint16_t == unsigned int.
simple casting is enough to get uint16_t array from
ustring object.

> +    const uint16_t* utf16_begin = new uint16_t[utf16_len];
> +    const uint16_t* utf16_buff = utf16_begin;
> +    const unsigned short *me = data();
> +   
> +    for (int i = 0; i < size(); i++) {
> +        utf16_buff[i] = (uint16_t)me[i];
>      }

[snip the removal of existing code]

> +#endif

above is the irregular case that uint16_t != unsigned int.
allocate temporal array of uint16_t and copy the contents.
for skipping of BOM in the beginning, the heaf of allocated
buffer is recorded in utf16_begin, but the UTF-16 data to
be converted starts from utf16_buff.

> +    const uint16_t* utf16_end = utf16_buff + size();
> +    if (utf16_buff[0] == 0xFEFF) {
> +        utf16_buff ++;
>      }

in abovem utf16_buff is shifted if the beginning of the
buffer is BOM.

> -    str.resize(str.size() - str_len_left);
> -    return str;
> +
> +    const int utf8_len = utf16CountUtf8Bytes(utf16_buff);
> +    byte_array ret(utf8_len + 1, '\0');

count the required buffer size by utf16CountUtf8Bytes in UTF.h,
and allocate the byte_array to be returned. To assue the null
termination, the size is incremented and filled by '\0'.

> +    utf16ToUtf8(utf16_buff, reinterpret_cast<char *>(ret.data()), utf8_len + 1, utf16_end - utf16_buff);
> +    ret.resize(std::strlen(ret.data()));

convert by utf16ToUtf8(), and shrink to fit converted result.

> +
> +#if 0xFFFFU != USHRT_MAX
> +    delete utf16_begin;
> +#endif
> +    return ret;
>  }

if irregular case, the temporal buffer should be freed.

> @@ -256,15 +237,29 @@ std::string ustring::to_latin1() const
>          return std::string();
>      }
>  
> -    const size_type mylength = size();
> -    std::string ret(mylength, '\0');
>      const value_type *me = data();
> -    for (size_type i = 0; i < mylength; ++i) {
> +    const value_type *me_end = me + size();
> +    if (*me == 0xFEFF) {
> +        me ++;
> +    }
> +    std::string ret(me_end - me, '\0');
> +    for (size_type i = 0; me < me_end; ++i) {
>          ret[i] = (char)*me++;
>      }
>      return ret;
>  }

original implementation of ustring::to_latin() does not
care for BOM. now BOM is checked and skipped if there is.

> +static bool has_bom_utf8(const char *c, int len)
> +{
> +    if ( 3 > len ) {
> +        return false;
> +    }
> +    if (c[0] == 0xEF && c[1] == 0xBB && c[2] == 0xBF) {
> +        return true;
> +    }
> +    return false;
> +}
> +
>  ustring ustring::from_utf8(const char *str, int len)
>  {
>      if (len <= 0) {

even if ustring::from_utf8 receives UTF-8 string with BOM,
created ustring object should not include BOM. has_bom_utf8()
is an utility function for it.

> @@ -273,31 +268,27 @@ ustring ustring::from_utf8(const char *str, int len)
>              return ustring();
>          }
>      }
> -
> -    MiniIconv ic("UTF-16", "UTF-8");
> -    if (!ic.is_valid()) {
> -        return ustring();

remove iconv() related function.

> +    int utf16_count = utf8CountUtf16CodeUnits(str);
> +    if (has_bom_utf8(str, len)) {
> +        str += 3;       // skip BOM
> +        len -= 3;       // skip BOM
> +        utf16_count --; // skip BOM
>      }

if UTF-8 starts with BOM, it is skiped, the length of
UTF-8 buffer is decremented (3 octets), the length of
UTF-16 buffer is decremented (16bit).

> +    ustring ret(utf16_count + 1, 0);

allocate ustring object to be returned. to assure the
U+0000l termination, the size is incremented.

> +#if 0xFFFFU == USHRT_MAX
> +    const uint16_t* utf16_buff = reinterpret_cast<const uint16_t *>(ret.data());

this is for popular case where uint16_t == unsigned short.
simple cast is enough to make the array of uint16_t.

> +#else
> +    const uint16_t* utf16_buff = new uint16_t[utf16_count + 1](0);
> +#endif

this is for irregular case. allocate temporal buffer to store
uint16_t data.

> +    utf8ToUtf16(str, (uint16_t*)utf16_buff, utf16_count + 1, len);

fill utf16_buff by converted UTF-16 data.

> +#if 0xFFFFU != USHRT_MAX
> +    for (int i = 0; i < utf16_count; i ++) {
> +        ret[i] = (unsigned short)utf16_buff[i];
>      }

in irregular case, the content should be copied to the object
to be returned.

> +    delete utf16_buff;
> +#endif

the buffer for irregular case is freed here.

>      return ret;
>  }
>  
> --- a/cpp/poppler-page.cpp
> +++ b/cpp/poppler-page.cpp
> @@ -355,7 +365,7 @@ std::vector<text_box> page::text_list() const
>              TextWord *word = word_list->get(i);
>  
>              std::unique_ptr<GooString> gooWord{word->getText()};
> -            ustring ustr = detail::unicode_GooString_to_ustring(gooWord.get());
> +            ustring ustr = ustring::from_utf8(gooWord->getCString());

here, gooWord is the object created by TextOutputDev,
so it is always UTF-8. it should be passed to ustring::from_utf8().

> --- a/cpp/poppler-private.cpp
> +++ b/cpp/poppler-private.cpp
> @@ -23,6 +23,7 @@
>  #include "poppler-private.h"
>  
>  #include "GooString.h"
> +#include "PDFDocEncoding.h"
>  #include "Page.h"
>  
>  #include <ctime>

the strings for PDF metadata is UTF-16BE with BOM or PDFDocEncoding.
to use PDFDocEncoding helper, include PDFDocEncoding.h

> @@ -58,16 +59,16 @@ rectf detail::pdfrectangle_to_rectf(const PDFRectangle &pdfrect)
>  
>  ustring detail::unicode_GooString_to_ustring(const GooString *str)
>  {
> -    const char *data = str->getCString();
>      const int len = str->getLength();
> +    const char *data = str->getCString();
> +    const char *data_end = data + len;

input data may or may not have BOM, so the length of UTF-8 string
to be converted may or may not be len. to clarify the end of buffer,
data_end is calculated, instead of the index "i" in the existing
implementation.

> -    int i = 0;
>      bool is_unicode = false;
>      if ((data[0] & 0xff) == 0xfe && (len > 1 && (data[1] & 0xff) == 0xff)) {
>          is_unicode = true;
> -        i = 2;
> +        data += 2; // skip BOM
>      }

instead of the index i, the beginning of UTF-16BE buffer is shifted
to skip BOM.

> -    ustring::size_type ret_len = len - i;
> +    ustring::size_type ret_len = data_end - data;
>      if (is_unicode) {
>          ret_len >>= 1;
>      }

ditto.

> @@ -75,15 +76,15 @@ ustring detail::unicode_GooString_to_ustring(const GooString *str)
>      size_t ret_index = 0;
>      ustring::value_type u;
>      if (is_unicode) {
> -        while (i < len) {
> -            u = ((data[i] & 0xff) << 8) | (data[i + 1] & 0xff);
> -            i += 2;
> +        while (data < data_end) {
> +            u = ((*data & 0xff) << 8) | (*(data + 1) & 0xff);
> +            data += 2;
>              ret[ret_index++] = u;
>          }
>      } else {
> -        while (i < len) {
> -            u = data[i] & 0xff;
> -            ++i;
> +        while (data < data_end) {
> +            u = pdfDocEncoding[ *data & 0xff ];
> +            data ++;
>              ret[ret_index++] = u;
>          }
>      }

use *data instead of data[i].

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mps-fix-cpp-encoding-issue_20180417-0010.diff.xz
Type: application/octet-stream
Size: 2132 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180417/a051f674/attachment-0001.obj>