[PATCH 0/5] Improve text protocol

Tue Apr 16 10:57:45 PDT 2013

There seem to be some claims that you cannot random-access a UTF-8 
string with errors in it. This is false if you define the handling of 
errors to strict patterns that do not contain valid encodings, and 
easiest with my recommendation that errors only be 1 byte long.

To make this sample code simple, the buffer has a 0 byte before the 
first actual byte and another after the last one, this avoids the need 
to pass the buffer ends to the functions. Real implementations may need
to pass the pointer to one or both ends.

// Returns the length of a UTF-8 code point starting at p,
// or returns 0 if it is not a valid encoding. The rest of this
// code treats 0 as a 1-byte-long "code point"
int utf8_length(const unsigned char* p)
{
   if (p < 0x80) return 1; // ASCII
   else if (p < 0xC2) return 0; // continuation and overlong
   else ... // multi-byte codes
}

// return the start of the UTF-8 code point that
// p is pointing at one of the bytes of.
const unsigned char* utf8_start(const unsigned char* p)
{
   for (int i = 0; i < 4; i++)
      if (utf8_length(p-i) > i) return p-i;
   return p;
}

// p is assumed to point at the start of a code point, return the next
// one, or the 0 off the end of the buffer
const unsigned char* utf8_next(const unsigned char* p)
{
   int n = utf8_length(p);
   return p + (n ? n : 1);
}

// p is assumed to point at the start of a code point, return the
// previous one, or the 0 before the start of the buffer
const unsigned char* utf8_prev(const unsigned char* p)
{
   return utf8_start(p-1);
}