[Uim] Japanese input

TOKUNAGA Hiroyuki tkng at xem.jp
Tue Jul 19 16:14:40 EEST 2005


On Mon, 18 Jul 2005 14:58:16 +0200
Jeroen Ruigrok/asmodai <asmodai at in-nomine.org> wrote:

> And now to pester the Japanese people,
> 
> which Japanese input methods do we currently support?
> 
> - hiragana
> - half-width katakana
> - full-width katakana
> 
> any others?

Hmm, it seems that I should explain about Japanese characters and way
to input Japanese text...

Hiragana, half-width katakana and full-with katakana are not Japanese
input method. They are classification of characters.


Japanese characters
===================

In Japanese, many characters are used. As you know, hiragana, katakana
and Kanji. In addition, we use alphabets to represent foreign language
words.

Distinction of half-with katakana and full-width katakana is a
historical thing. i.e. Old computers could handle only katakana because
JIS X 0201 defined only katakana. (JIS X 0201 is a 8bit extention of
ASCII.)


The problem when you input Japanese text
========================================

The number of hiragana is at most 100. The number of katakana is also
the same. So, inputting them is not so difficult. Many Japanese using
romanized characters to input hiragana/katakana.

But we have many kanji characters. There are over 10,000 kanji
characters in Japan. To read Japanese newspaper comfortably, you need
to know over 3,000 kanji characters.


How to input hiragana
=====================

To input hiragana, most of Japanese use romanized characters. i.e. to
input "ありがとう", we type as "arigatou". (I hope you have a Japanese
font!) This way is called as 'roma-ji input'.

We have another way to input hiragana. That is called as 'kana
input'. Japanese keyboard has punch marks on thier key top, we can use
them to input hiragana. (If you want to see example, google with the
word 'JIS keyboard'.) When the input method is in kana input mode, 
we can type 'ありがとう' directly as 'ありがとう'. (To be exact, it's
not truth. We have to type が as か゛.)

Roma-ji input style is the mainstream. I guess kana input style
user are less than 10% in Japan.


Kana kanji conversion
=====================

We cannot input kanji directly, because there are too many kanji
characters. (To be exact, it's not correct. Some people input kanji
directly. But they are quite rare. I'll explain about these rare people
after.)

Therefore, we type hiragana somehow, then convert these hiragana
strings to kanji characters.

More precisely, we convert hiragana text to kana kanji mixed text. Here
is an example.

before converted: わたしのなまえはなかのです。
after converted:  私の名前は中野です。

This conversion algorithm is very complex and incomplete. For example,
Anthy is using Hidden Marcov Model to determine most probable word
class. I can't understand what is Hidden marcov Model.;-)

I wrote above, conversion algorithm is incomplete. Let's think when
'ひろゆき' (Hiroyuki) is given as hiragana text. Hiroyuki is very
common name in Japan. We can list many conversion candidate of
Hiroyuki. 弘幸, 浩之, 裕之, 拓之... (The last one is my name!)

Of course, we can't guess what is the best candidate, because the best
candidate depends on the context.

Instead of solving this unsolvable problem, we provide a way to choose
conversion candidates. If the first candidate is not intended, you can
choose from other candidates.


Umm, now I don't have enough time to describe concrete usage of Japanese
input methods. I have to study about computer graphics until this
Thursday. If I couldn't earn the credit, my course in later term would
be hard.


Regards,

-- 
TOKUNAGA Hiroyuki
tkng at xem.jp



More information about the uim mailing list