[Spice-devel] [PATCH spice-gtk 3/4] util: add unix2dos and dos2unix

Sat Aug 24 05:23:11 PDT 2013

Hi,

On 08/24/2013 02:17 PM, Marc-André Lureau wrote:
> On Sat, Aug 24, 2013 at 12:20 PM, Hans de Goede <hdegoede at redhat.com> wrote:
>> Hi,
>>
>>
>> On 08/23/2013 10:25 PM, Marc-André Lureau wrote:
>>>
>>> Convert line endings from/to LF/CRLF, in utf8.
>>> ---
>>>    gtk/spice-util-priv.h |   2 +
>>>    gtk/spice-util.c      | 122
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 124 insertions(+)
>>>
>>> diff --git a/gtk/spice-util-priv.h b/gtk/spice-util-priv.h
>>> index ee5a42d..cc559dc 100644
>>> --- a/gtk/spice-util-priv.h
>>> +++ b/gtk/spice-util-priv.h
>>> @@ -29,6 +29,8 @@ gboolean spice_strv_contains(const GStrv strv, const
>>> gchar *str);
>>>    gchar* spice_uuid_to_string(const guint8 uuid[16]);
>>>    const gchar* spice_yes_no(gboolean value);
>>>    guint16 spice_make_scancode(guint scancode, gboolean release);
>>> +gchar* spice_unix2dos(const gchar *str, gssize len, GError **error);
>>> +gchar* spice_dos2unix(const gchar *str, gssize len, GError **error);
>>>
>>>    #if GLIB_CHECK_VERSION(2,32,0)
>>>    #define STATIC_MUTEX            GMutex
>>> diff --git a/gtk/spice-util.c b/gtk/spice-util.c
>>> index 774a145..be10edc 100644
>>> --- a/gtk/spice-util.c
>>> +++ b/gtk/spice-util.c
>>> @@ -19,6 +19,7 @@
>>>    #ifdef HAVE_CONFIG_H
>>>    # include "config.h"
>>>    #endif
>>> +
>>>    #include <stdlib.h>
>>>    #include <string.h>
>>>    #include <glib-object.h>
>>> @@ -245,3 +246,124 @@ guint16 spice_make_scancode(guint scancode, gboolean
>>> release)
>>>
>>>        g_return_val_if_reached(0);
>>>    }
>>> +
>>> +typedef enum {
>>> +    NEWLINE_TYPE_LF,
>>> +    NEWLINE_TYPE_CR_LF
>>> +} NewlineType;
>>> +
>>> +static gssize get_line(const gchar *str, gsize len,
>>> +                       NewlineType type, gsize *nl_len,
>>> +                       GError **error)
>>> +{
>>> +    const gchar *p = str;
>>> +    gsize nl = 0;
>>> +
>>> +    if (type == NEWLINE_TYPE_CR_LF) {
>>> +        while ((p - str) < len) {
>>> +            p = g_utf8_strchr(p, len, '\r');
>>> +            if (!p)
>>> +                break;
>>> +            p = g_utf8_next_char(p);
>>> +            if (g_utf8_get_char(p) == '\n') {
>>> +                len = (p - str) - 1;
>>> +                nl = 2;
>>> +                break;
>>> +            }
>>> +        }
>>> +    } else {
>>> +        p = g_utf8_strchr(str, len, '\n');
>>> +        if (p) {
>>> +            len = p - str;
>>> +            nl = 1;
>>> +        }
>>> +    }
>>
>>
>> This looks way more complicated then it needs to be, in UTF-8
>> 0x00 - 0x7f only are valid as a single-byte sequence. multi-byte
>> encoded characters will never contain 0x00 - 0x7f. UTF-8 was designed
>> this way, is so that existing string parsing code for non multi-byte
>> encodings, which make look for example for ' " = or LF characters does
>> not break when parsing strings with multi-byte characters in there.
>>
>> TL;DR: LF and CR will never be part of a multi byte character, so
>> you can simple do: strstr(str, "\r\n") to find the CRLF.
>
> g_utf8_strchr is implemented using a regular strstr. Speed shouldn't
> be different here. I prefer to use utf8 functions on utf8 strings.

Right, but I'm suggesting using strstr instead of strchr replacing:

 >>> +        while ((p - str) < len) {
 >>> +            p = g_utf8_strchr(p, len, '\r');
 >>> +            if (!p)
 >>> +                break;
 >>> +            p = g_utf8_next_char(p);
 >>> +            if (g_utf8_get_char(p) == '\n') {
 >>> +                len = (p - str) - 1;
 >>> +                nl = 2;
 >>> +                break;
 >>> +            }
 >>> +        }

With a single strstr(str, "\r\n") call. Which is one heck of
a lot more readable, the above loop gives me a head-ache, and
I'm quite sure you won't be able to remember what exactly
it is doing in a couple of months from now either, where as
a simple strstr(str, "\r\n") is quite obvious. I really believe
we should take the KISS approach here.

I see that there is no g_utf8_strstr, likely simply because there
would be no difference between a g_utf8_strstr and regular strstr.

>>> +
>>> +    if (!g_utf8_validate(str, len, NULL)) {
>>> +        g_set_error_literal(error, G_CONVERT_ERROR,
>>> +                            G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
>>> +                            "Invalid byte sequence in conversion input");
>>> +        return -1;
>>> +    }
>>
>>
>> And once you simply treat this as a regular C-string without worrying
>> about multi-byte encodings you can also drop this.
>
> Actually, during implementation, I have encountered/produced invalid
> utf8 that will break later on in gtk+, so I prefer to validate the
> production.

Ok.

Regards,

Hans