[Uim] uim-py: Adding idioms to PY.scm
Jon Babcock
jon at kanji.com
Sun Apr 4 20:07:33 EEST 2004
[UTF-8 encoded Unicode.]
Yukiko Bando wrote:
> If someone is interested in giving it a try, please download it from here:
> http://www.h4.dion.ne.jp/~apricots/files/CEDICT_PY.scm.tar.bz2
Adding this is a big step, thanks!
> - With 17,000 entiries, lots of common words seem to be still missing!
> "Chocolate" is not included...
You may want to add tsi.src which contains about 130,000 entries. Check
the file 'COPYING' in the libtabe-0.2.3.tgz package for permissions of use.
<http://sourceforge.net/project/showfiles.php?group_id=1519>
The first step is to convert the Zhuyin entries to Pinyin and then to
eliminate duplicates with your CEDICT file. I plan to do this
eventually, but would be *delighted* if I didn't have to. <g> And then
the file must be converted to UTF-8 Unicode, I assume.
Jon
[Somewhat OT]
Regarding number of words... actually, there are more than 200,000 words
missing. <g> This may be a bit OT, but still useful to understand the
eventual size of this project. (Also, non-native
students/scholars/translators of Chinese like myself and native writers
of Chinese may have somewhat different needs and expectations regarding
an input method. I'll try to clarify this later.)
* The _Hanyu Da Cidian_ (汉语大词典) has 347,426 multisyllabic entries.
[Notice I didn't say 'words'. There are many expert opinions on what
exactly constitutes a 'word' in Chinese. To avoid this problem, they are
often called 'compounds' or 'binoms', 'trinoms', etc.; 熟語 (jukugo)in
Japanese. The 國語辭典 Chinese dictionary (see below) explanation of
this is: <quote>語言中已定型的固定詞組或句子。包括成語、諺語、格言、歇後
語等。是在人們長期使用語言的過程中逐漸形成的。</quote> Big subject
which, for practical purposes, we can skip.]
* The new and extraordinarily useful _Grand dictionnaire Ricci de la
langue chinoise_ has about 300,000 multisyllabic entries.
* The 広漢和辞典, a very appealing 3-volume + Index volume edition of
the original 12-volume + Index volume 大漢和辭典 by the venerable 諸橋轍
次 (Morohashi Tetsuji) has approximately 200,000 multisyllabic entries.
* I don't know how many entries are in the big 國語辭典 compiled by the
教育部國語推行委員會 in Taiwan, because I don't yet own a copy. (I use
it daily through its web interface:
<http://140.111.1.22/mandr/clc/dict/> and find it to be the most useful
overall reference for Chinese words within arm's reach.) I'd guess it's
about the same size as _Grand dictionnaire Ricci_.
* The popular but hard to use, _Far-East Chinese-English Dictionary_
(from Taiwan) has about 120,000.
* That old standby of 20th century students Chinese, _Mathew's Chinese
English Dictionary_, has 104,000 or more bisyllabic entries, and the
* venerable and still useful _A Chinese English Dictionary_ by Herbert
A. Giles (1st edition 1892) has, hmm I don't see a number in the front
matter of my copy.
Nevertheless, a professional C-E translator encounters many words that
are not included in any of the above 7 or 8 major dictionaries of
Chinese words. (Just check the daily traffic on the fanyi mailing list
for an endless stream of examples.)
Jon
--
Jon Babcock <jon at kanji.com>
More information about the uim
mailing list