[PATCH] sms-part-3gpp: decode Unicode SMS with non-BMP code points

Fri Dec 15 09:15:51 UTC 2017

On Fri, Dec 15, 2017 at 9:45 AM, Ben Chan <benchan at chromium.org> wrote:
> Depsite 3GPP TS 23.038 specifies that Unicode SMS messages are encoded in
> UCS-2, UTF-16 encoding is commonly used instead on many modern platforms to
> allow encoding code points that fall outside the Basic Multilingual Plane
> (BMP), such as Emoji. Most of the UCS-2 code points are identical to their
> equivalent UTF-16 code points.  In UTF-16, non-BMP code points are encoded in a
> pair of surrogate code points (i.e. a high surrogate in 0xD800..0xDBFF,
> followed by a low surrogate in 0xDC00..0xDFFF). An isolated surrogate code
> point has no general interpretation in UTF-16, but could be a valid (though
> unmapped) code point in UCS-2.
>
> This patch modifies the 3GPP SMS decoding to first try UTF-16BE and then fall
> back to UCS-2BE on failure. If both fail, an empty string is returned
> instead of a NULL pointer.
> ---
> Hi Aleksander and Dan,
>
> I found that ModemManager failed to decode SMS with Emoji and thus researched
> into SMS encoding used by modern mobile platforms. It seems like UTF-16BE is
> used in practice instead of UCS-2 as specified by the 3GPP spec. It seems like
> we can practically use UTF-16BE for most cases as most of the UCS-2 code points
> are identical to their equivalent UTF-16 code points. The UCS-2 fallback is
> mostly to handle isolated surrogate code points, which should be rare given
> that they don't seem to map to a "character" in UCS-2. Given that the 3GPP spec
> assumes UCS-2BE, I also assume UTF-16BE by default as I haven't observed any
> byte order mark (BOM) in SMS so far.
>
> That said, I've only tested with some modems and US carriers. If you've some
> spare cycle, could you try sending SMS with Emoji and other unicode characters
> to your modems with your local carrier and see if MM correctly decodes the SMS
> with / without this patch.
>

Probably a good change; will test it once I have a chance.

But regarding the last fallback, why is it better to return an empty
string than NULL? I assume that we return NULL to indicate an error in
parsing, which is what happens here if we cannot do UCS2 or UTF-16
translations.

> Thanks,
> Ben
>
>
>  src/mm-sms-part-3gpp.c | 24 ++++++++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/src/mm-sms-part-3gpp.c b/src/mm-sms-part-3gpp.c
> index 0b59b247..f7beaf61 100644
> --- a/src/mm-sms-part-3gpp.c
> +++ b/src/mm-sms-part-3gpp.c
> @@ -247,8 +247,28 @@ sms_decode_text (const guint8 *text, int len, MMSmsEncoding encoding, int bit_of
>          mm_dbg ("   Got UTF-8 text: '%s'", utf8);
>          g_free (unpacked);
>      } else if (encoding == MM_SMS_ENCODING_UCS2) {
> -        mm_dbg ("Converting SMS part text from UCS-2BE to UTF8...");
> -        utf8 = g_convert ((char *) text, len, "UTF8", "UCS-2BE", NULL, NULL, NULL);
> +        /* Depsite 3GPP TS 23.038 specifies that Unicode SMS messages are
> +         * encoded in UCS-2, UTF-16 encoding is commonly used instead on many
> +         * modern platforms to allow encoding code points that fall outside the
> +         * Basic Multilingual Plane (BMP), such as Emoji. Most of the UCS-2
> +         * code points are identical to their equivalent UTF-16 code points.
> +         * In UTF-16, non-BMP code points are encoded in a pair of surrogate
> +         * code points (i.e. a high surrogate in 0xD800..0xDBFF, followed by a
> +         * low surrogate in 0xDC00..0xDFFF). An isolated surrogate code point
> +         * has no general interpretation in UTF-16, but could be a valid
> +         * (though unmapped) code point in UCS-2. Here we first try to decode
> +         * the SMS message in UTF-16BE, and if that fails, fall back to decode
> +         * in UCS-2BE.
> +         */
> +        mm_dbg ("Converting SMS part text from UTF16BE to UTF8...");
> +        utf8 = g_convert ((const gchar *) text, len, "UTF8", "UTF16BE", NULL, NULL, NULL);
> +        if (!utf8) {
> +            mm_dbg ("Converting SMS part text from UCS-2BE to UTF8...");
> +            utf8 = g_convert ((const gchar *) text, len, "UTF8", "UCS-2BE", NULL, NULL, NULL);
> +        }
> +        if (!utf8)
> +            utf8 = g_strdup ("");
> +
>          mm_dbg ("   Got UTF-8 text: '%s'", utf8);
>      } else {
>          g_warn_if_reached ();
> --
> 2.15.1.504.g5279b80103-goog
>

-- 
Aleksander
https://aleksander.es