[UTF-8] Aspell and UTF-8/Unicode
Elias Martenson
elias-m@algonet.se
Mon, 16 Feb 2004 11:23:17 +0100
m=C3=A5n 2004-02-16 klockan 03.42 skrev Kevin Atkinson:
> DO you know of any code samples for efficiency UTF-8 manipulation?
Unfortunately I don't have much of that. But there has to be some stuff
on the Internet :-)
> I=20
> figure if I support 8-bit charater sets and UTF-8 that will be enough. =20
Possibly. Many applications do it that way. When it boils down to it,
wchar=5Ft strings are not needed very often, and when they are, it's
usually for performance reasons.
> This means I can detect when UTF-8 is being used and just handle the UTF-=
8=20
> strings more carefully, more efficient than converting to to wchar=5Ft ju=
st=20
> to get the length. What I really need are things like=20
> - length of utf-8 strings
size=5Ft stringLength =3D mbstowcs(NULL, theString, 0);
Yes, mbstowcs() serves dual purposes. Both for converting to wide
strings and getting the lengths of an MBS string.
> - length of the current utf-8 character
I assume that what you mean here is that given the character index into
a UTF-8 string, you want the length of the character at that position?
And perhaps also the actual unicode code point (wchar=5Ft value) for that
character?
The length of a character in an MBS string is retrieved using mbrlen().
The usage goes something like this (disclaimer: I haven't used this
myself to double-check the man page):
size=5Ft length =3D mbrlen(theStringPointer, MB=5FCUR=5FMAX, NULL);
You can then extract the actual character using mbrtowc() (it is assumed
the pointer to the multibyte character is at strPointer):
wchar=5Ft theChar;
size=5Ft length =3D wcrtomb(&theChar, strPointer, MB=5FCUR=5FMAX, NULL);
The unicode character is now stored in theChar.
Hope this helps
Regards
Elias M=C3=A5rtenson