[PATCH] sms-part-3gpp: decode Unicode SMS with non-BMP code points
Aleksander Morgado
aleksander at aleksander.es
Fri Dec 15 09:15:51 UTC 2017
On Fri, Dec 15, 2017 at 9:45 AM, Ben Chan <benchan at chromium.org> wrote:
> Depsite 3GPP TS 23.038 specifies that Unicode SMS messages are encoded in
> UCS-2, UTF-16 encoding is commonly used instead on many modern platforms to
> allow encoding code points that fall outside the Basic Multilingual Plane
> (BMP), such as Emoji. Most of the UCS-2 code points are identical to their
> equivalent UTF-16 code points. In UTF-16, non-BMP code points are encoded in a
> pair of surrogate code points (i.e. a high surrogate in 0xD800..0xDBFF,
> followed by a low surrogate in 0xDC00..0xDFFF). An isolated surrogate code
> point has no general interpretation in UTF-16, but could be a valid (though
> unmapped) code point in UCS-2.
>
> This patch modifies the 3GPP SMS decoding to first try UTF-16BE and then fall
> back to UCS-2BE on failure. If both fail, an empty string is returned
> instead of a NULL pointer.
> ---
> Hi Aleksander and Dan,
>
> I found that ModemManager failed to decode SMS with Emoji and thus researched
> into SMS encoding used by modern mobile platforms. It seems like UTF-16BE is
> used in practice instead of UCS-2 as specified by the 3GPP spec. It seems like
> we can practically use UTF-16BE for most cases as most of the UCS-2 code points
> are identical to their equivalent UTF-16 code points. The UCS-2 fallback is
> mostly to handle isolated surrogate code points, which should be rare given
> that they don't seem to map to a "character" in UCS-2. Given that the 3GPP spec
> assumes UCS-2BE, I also assume UTF-16BE by default as I haven't observed any
> byte order mark (BOM) in SMS so far.
>
> That said, I've only tested with some modems and US carriers. If you've some
> spare cycle, could you try sending SMS with Emoji and other unicode characters
> to your modems with your local carrier and see if MM correctly decodes the SMS
> with / without this patch.
>
Probably a good change; will test it once I have a chance.
But regarding the last fallback, why is it better to return an empty
string than NULL? I assume that we return NULL to indicate an error in
parsing, which is what happens here if we cannot do UCS2 or UTF-16
translations.
> Thanks,
> Ben
>
>
> src/mm-sms-part-3gpp.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/src/mm-sms-part-3gpp.c b/src/mm-sms-part-3gpp.c
> index 0b59b247..f7beaf61 100644
> --- a/src/mm-sms-part-3gpp.c
> +++ b/src/mm-sms-part-3gpp.c
> @@ -247,8 +247,28 @@ sms_decode_text (const guint8 *text, int len, MMSmsEncoding encoding, int bit_of
> mm_dbg (" Got UTF-8 text: '%s'", utf8);
> g_free (unpacked);
> } else if (encoding == MM_SMS_ENCODING_UCS2) {
> - mm_dbg ("Converting SMS part text from UCS-2BE to UTF8...");
> - utf8 = g_convert ((char *) text, len, "UTF8", "UCS-2BE", NULL, NULL, NULL);
> + /* Depsite 3GPP TS 23.038 specifies that Unicode SMS messages are
> + * encoded in UCS-2, UTF-16 encoding is commonly used instead on many
> + * modern platforms to allow encoding code points that fall outside the
> + * Basic Multilingual Plane (BMP), such as Emoji. Most of the UCS-2
> + * code points are identical to their equivalent UTF-16 code points.
> + * In UTF-16, non-BMP code points are encoded in a pair of surrogate
> + * code points (i.e. a high surrogate in 0xD800..0xDBFF, followed by a
> + * low surrogate in 0xDC00..0xDFFF). An isolated surrogate code point
> + * has no general interpretation in UTF-16, but could be a valid
> + * (though unmapped) code point in UCS-2. Here we first try to decode
> + * the SMS message in UTF-16BE, and if that fails, fall back to decode
> + * in UCS-2BE.
> + */
> + mm_dbg ("Converting SMS part text from UTF16BE to UTF8...");
> + utf8 = g_convert ((const gchar *) text, len, "UTF8", "UTF16BE", NULL, NULL, NULL);
> + if (!utf8) {
> + mm_dbg ("Converting SMS part text from UCS-2BE to UTF8...");
> + utf8 = g_convert ((const gchar *) text, len, "UTF8", "UCS-2BE", NULL, NULL, NULL);
> + }
> + if (!utf8)
> + utf8 = g_strdup ("");
> +
> mm_dbg (" Got UTF-8 text: '%s'", utf8);
> } else {
> g_warn_if_reached ();
> --
> 2.15.1.504.g5279b80103-goog
>
--
Aleksander
https://aleksander.es
More information about the ModemManager-devel
mailing list