[UTF-8] Aspell and UTF-8/Unicode

Elias Martenson elias-m@algonet.se
Sun, 15 Feb 2004 23:49:24 +0100


s=C3=B6n 2004-02-15 klockan 21.55 skrev Kevin Atkinson:

> I should also add that with Aspell 0.50 you can't use all 256 character=
s. =20
> You are limited to the upper 128.  With Aspell 0.51 this will change, I
> hope, since the internal encoding will never be visible out side of the
> Speller module.

You asked about curses with wide characters.

Allow me to explain the necessary steps. If you have any questions don't
hesitate to ask.

First of all, at least on my FC1 box, there are two ncurses
installations. One in <ncurses/ncurses.h> and one in
<ncursesw/ncurses.h> you must make sure you include the latter, or the
unicode stuff will not work at all. I have absolutely no idea why this
version is not the default.

You have to remember to link with -lncursesw instead of -lncurses.

Also don't forget to issue the call to setlocale(LC_ALL,"") in the
beginning of the program. Although I suppose you already do. :-)

Next, you have to decide wether you want the application to work with
and print wide strings (wchar_t *) or UTF-8 strings (char *). It's a
matter of taste really. Both methods work and has their respective
drawbacks and advantages. Usually, especially when changing existing
code, it's much easier to use UTF-8. Nothing prevents you from using a
combination either. Using UTF-8 most everywhere but using wchar_t where
it's needed specifically.

If you want to use UTF-8, you're pretty much done! Just work with the
UTF-8 strings just like any other string. Just remember to use wcslen()
instead of strlen() if you want the number of characters. This is
particularily important when doing formatting for a curses app.

If you want to use whcar_t, read up on the mbstowcs() and wcstombs()
functions. mbs means "system encoding" pretty much (which, unless you're
on a legacy system, means UTF-8). wcs is a wchar_t string. Then you can
simply use addwstr() instead of addstr() etc. All of the old
"char"-based functions has unicode-aware equivalents.

Well, that's it! I hope this has been of help for you.

Regards

Elias M=C3=A5rtenson