[UTF-8] Aspell and UTF-8/Unicode

Elias Martenson elias-m@algonet.se
Mon, 16 Feb 2004 11:23:17 +0100


m=C3=A5n 2004-02-16 klockan 03.42 skrev Kevin Atkinson:

> DO you know of any code samples for efficiency UTF-8 manipulation?

Unfortunately I don't have much of that. But there has to be some stuff
on the Internet :-)

>    I=20
> figure if I support 8-bit charater sets and UTF-8 that will be enough. =20

Possibly. Many applications do it that way. When it boils down to it,
wchar=5Ft strings are not needed very often, and when they are, it's
usually for performance reasons.

> This means I can detect when UTF-8 is being used and just handle the UTF-=
8=20
> strings more carefully, more efficient than converting to to wchar=5Ft ju=
st=20
> to get the length.  What I really need are things like=20
>   - length of utf-8 strings

    size=5Ft stringLength =3D mbstowcs(NULL, theString, 0);

Yes, mbstowcs() serves dual purposes. Both for converting to wide
strings and getting the lengths of an MBS string.

>   - length of the current utf-8 character

I assume that what you mean here is that given the character index into
a UTF-8 string, you want the length of the character at that position?
And perhaps also the actual unicode code point (wchar=5Ft value) for that
character?

The length of a character in an MBS string is retrieved using mbrlen().
The usage goes something like this (disclaimer: I haven't used this
myself to double-check the man page):

    size=5Ft length =3D mbrlen(theStringPointer, MB=5FCUR=5FMAX, NULL);

You can then extract the actual character using mbrtowc() (it is assumed
the pointer to the multibyte character is at strPointer):

    wchar=5Ft theChar;
    size=5Ft length =3D wcrtomb(&theChar, strPointer, MB=5FCUR=5FMAX, NULL);

The unicode character is now stored in theChar.

Hope this helps

Regards

Elias M=C3=A5rtenson