[PATCH] sms-part-3gpp: decode Unicode SMS with non-BMP code points

Fri Dec 15 08:45:12 UTC 2017

Depsite 3GPP TS 23.038 specifies that Unicode SMS messages are encoded in
UCS-2, UTF-16 encoding is commonly used instead on many modern platforms to
allow encoding code points that fall outside the Basic Multilingual Plane
(BMP), such as Emoji. Most of the UCS-2 code points are identical to their
equivalent UTF-16 code points.  In UTF-16, non-BMP code points are encoded in a
pair of surrogate code points (i.e. a high surrogate in 0xD800..0xDBFF,
followed by a low surrogate in 0xDC00..0xDFFF). An isolated surrogate code
point has no general interpretation in UTF-16, but could be a valid (though
unmapped) code point in UCS-2.

This patch modifies the 3GPP SMS decoding to first try UTF-16BE and then fall
back to UCS-2BE on failure. If both fail, an empty string is returned
instead of a NULL pointer.
---
Hi Aleksander and Dan,

I found that ModemManager failed to decode SMS with Emoji and thus researched
into SMS encoding used by modern mobile platforms. It seems like UTF-16BE is
used in practice instead of UCS-2 as specified by the 3GPP spec. It seems like
we can practically use UTF-16BE for most cases as most of the UCS-2 code points
are identical to their equivalent UTF-16 code points. The UCS-2 fallback is
mostly to handle isolated surrogate code points, which should be rare given
that they don't seem to map to a "character" in UCS-2. Given that the 3GPP spec
assumes UCS-2BE, I also assume UTF-16BE by default as I haven't observed any
byte order mark (BOM) in SMS so far.

That said, I've only tested with some modems and US carriers. If you've some
spare cycle, could you try sending SMS with Emoji and other unicode characters
to your modems with your local carrier and see if MM correctly decodes the SMS
with / without this patch.

Thanks,
Ben


 src/mm-sms-part-3gpp.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/src/mm-sms-part-3gpp.c b/src/mm-sms-part-3gpp.c
index 0b59b247..f7beaf61 100644
--- a/src/mm-sms-part-3gpp.c
+++ b/src/mm-sms-part-3gpp.c
@@ -247,8 +247,28 @@ sms_decode_text (const guint8 *text, int len, MMSmsEncoding encoding, int bit_of
         mm_dbg ("   Got UTF-8 text: '%s'", utf8);
         g_free (unpacked);
     } else if (encoding == MM_SMS_ENCODING_UCS2) {
-        mm_dbg ("Converting SMS part text from UCS-2BE to UTF8...");
-        utf8 = g_convert ((char *) text, len, "UTF8", "UCS-2BE", NULL, NULL, NULL);
+        /* Depsite 3GPP TS 23.038 specifies that Unicode SMS messages are
+         * encoded in UCS-2, UTF-16 encoding is commonly used instead on many
+         * modern platforms to allow encoding code points that fall outside the
+         * Basic Multilingual Plane (BMP), such as Emoji. Most of the UCS-2
+         * code points are identical to their equivalent UTF-16 code points.
+         * In UTF-16, non-BMP code points are encoded in a pair of surrogate
+         * code points (i.e. a high surrogate in 0xD800..0xDBFF, followed by a
+         * low surrogate in 0xDC00..0xDFFF). An isolated surrogate code point
+         * has no general interpretation in UTF-16, but could be a valid
+         * (though unmapped) code point in UCS-2. Here we first try to decode
+         * the SMS message in UTF-16BE, and if that fails, fall back to decode
+         * in UCS-2BE.
+         */
+        mm_dbg ("Converting SMS part text from UTF16BE to UTF8...");
+        utf8 = g_convert ((const gchar *) text, len, "UTF8", "UTF16BE", NULL, NULL, NULL);
+        if (!utf8) {
+            mm_dbg ("Converting SMS part text from UCS-2BE to UTF8...");
+            utf8 = g_convert ((const gchar *) text, len, "UTF8", "UCS-2BE", NULL, NULL, NULL);
+        }
+        if (!utf8)
+            utf8 = g_strdup ("");
+
         mm_dbg ("   Got UTF-8 text: '%s'", utf8);
     } else {
         g_warn_if_reached ();
-- 
2.15.1.504.g5279b80103-goog